Getting the most edited shows that will end in 2016

import operator
import requests
import json
# show_titles = [
#     "Unforgettable_(2011_TV_series)", 
#     "Angel_from_Hell",
#     "Austin_&_Ally", 
#     "Lab_Rats_(U.S._TV_series)", 
#     "Melissa_Harris-Perry_(TV_series)", 
#     "Gravity_Falls", 
#     "MythBusters", 
#     "Of_Kings_and_Prophets", 
#     "American_Idol",
#     "Togetherness_(TV_series)", 
#     "Childrens_Hospital", 
#     "Monopoly_Millionaires'_Club_(U.S._game_show)", 
#     "The_Good_Wife", 
#     "Mike_&_Molly",
#     "Banshee_(TV_series)", 
#     "Littlest_Pet_Shop_(2012_TV_series)", 
#     "The_Soul_Man", 
#     "Person_of_Interest_(TV_series)", 
#     "Wander_Over_Yonder",
#     "Beauty_&_the_Beast_(2012_TV_series)", 
#     "Crazy_Talk_(TV_series)", 
#     "FABLife", 
#     "The_Meredith_Vieira_Show", 
#     "The_Leftovers_(TV_series)"
# ]

show_titles = [
    "Beauty_&_the_Beast_(2012_TV_series)", 
    "Crazy_Talk_(TV_series)", 
    "FABLife", 
    "The_Meredith_Vieira_Show", 
    "The_Leftovers_(TV_series)"
]

Challenge overview

  1. Create a CSV file that contains a list of all the Wikipedia articles for shows that are ending this year, and how many revisions have been made to each article.
  2. Which of the shows that are being cancelled this year have been edited the most? Which ones have been edited the least?

This challenge can be accomplished without any functions at all, just using some combination of (nested) 'for' loops, while loops, etc.

However, using functions helps us avoid writing the same commands over and over again, and keeps our code well organized. Today we'll walk through how to design a Python script that uses functions to solve this challenge.

Step 0. Outlining what our code will do

A good way to start any complex coding project is to write an outline of what needs to be done, using notes and/or pseudocode.

#STEP 1
#create a new, empty dictionary to hold our per-show counts. 
#As we query the API for revisions for each show, we will add those to this dict,
#like {'showname1':2302, 'showname2':435,...}

#STEP 2
#loop through the list of shows and query the API to count how many revisions each has
#add the total revs and the show name to the dict we made in STEP 1

#STEP 3
#Create a CSV file the contains rows for every show, and the number of revisions it has

#STEP 4
#Find out which of the shows has the most revisions

#STEP 5
#Find out which of the shows has the fewest revisions

Outlining your coding challenge (or final project) this way will help you stay on track and avoid getting stuck—even if you don't use functions! But it's especially helpful if you are going to use functions, because when you break the task down into steps like this, it helps you identify discrete sets of operations that naturally 'go together'.

Step 1. Write an API query function

Step 2 involves making a series of API queries using almost exactly the same parameters—the only thing that changes is the title of the show we're asking for revisions for.

Your main method (which I cover in the next step) will show what code is being executed, and in what order. But most of execution (the for loops, while loops, and if statements) will happen elsewhere—inside functions.

This sounds like a perfect opportunity to write a function. Let's plan it out. Again, using notes/pseudocode first.

Function 1 outline

# define the function for getting revisions (pass it a single show title):
    #parameters = {some parameters go here}
    #endpoint = a url
    #revisions = 0 
    #while True:
        #query the api
        #save the JSON result as a dictionary
        #count the revisions in the dictionary
        #add that count to our variable 'revisions'
        #if there are more revisions to get, do another API call
    #save the final count somewhere, associated with the title
    #send the result somewhere

If you follow all the steps above for every show in the list, you will have collected all the data you need for analysis. Here's what a final version of this function might look like.

Function 1 code

def getRevisions(show_title):
    """
    Input: receives a string (Wikipedia page title)
    Output: returns a dict with the title and the number of revisions
    """
    ENDPOINT = 'https://en.wikipedia.org/w/api.php'

    parameters = {'action' : 'query',
        'prop' : 'revisions', 
        'titles' : show_title,
        'rvlimit' : 500,
        'rvprop' : "ids", 
        'format' : 'json',
        'continue' : ''
    }

    revisions = 0
    while True:
        wp_call = requests.get(ENDPOINT, params=parameters)
        response = wp_call.json()
        pages = response['query']['pages']
        for page_id in pages:
            page = pages[page_id]
            page_revs = page['revisions']
            for rev in page_revs:
                revisions += 1
        if 'continue' in response:
            parameters.update(response['continue'])
        else:
            break
    result = {}
    result[show_title] = revisions
#   print(result)
    print("Found %d revisions for %s" % (revisions, show_title))
    return result

Step 2. Create a Main method

By the 'main' method, I mean the part of the script that controls what happens and in what order it happens. Before today, our whole scripts have essentially been the 'main' method, because we have been ordering everything sequentially—we start executing at the top of the script, and we finish at the bottom.

But when you are writing complex operations in your code (e.g. lots of for loops, while loops, and if statements), it quickly becomes hard to read. And it becomes even harder to get a 'birds eye view' of what the code is doing.

Your main method is where you execute the code that is inside the functions.

Your main method is usually placed at the bottom of the file, even though it is the first part of the Python file that is run (other than any import statements).

That's why the functions have to be above the main method: when Python sees a function call in the main method, it checks against the functions that it has seen so far. It's really the same situation as with show_titles: Python can't execute any operations against the contents of show_titles unless it has seen that list before it is asked to do something with it.

#anatomy of a typical script with functions

#import statements go here

#global variables (like show_titles) go here

#functions go here

###MAIN METHOD STARTS HERE###
shows = {}
for t in show_titles:
    show = getRevisions(t)
    shows.update(show)
print(shows)
Found 1260 revisions for Beauty_&_the_Beast_(2012_TV_series)
Found 47 revisions for Crazy_Talk_(TV_series)
Found 95 revisions for FABLife
Found 78 revisions for The_Meredith_Vieira_Show
Found 668 revisions for The_Leftovers_(TV_series)
{'The_Meredith_Vieira_Show': 78, 'The_Leftovers_(TV_series)': 668, 'Beauty_&_the_Beast_(2012_TV_series)': 1260, 'FABLife': 95, 'Crazy_Talk_(TV_series)': 47}

Step 3: Make a CSV printing function

Now we have written one function. As you can see, it makes it somewhat easier to see what's happening at a high level (especially since we gave our function a descriptive name). But it might not be clear yet why we went through the extra trouble of making a function for this, rather than just using a for loop.

For our next step, we will write a function that takes the dict we just created, shows_2016, and output it to a CSV file.

Function 2 outline

#take in a dictionary with show names and revision counts as key/value pairs
#create a new CSV file with column headers and a line for each show
#name the output file something useful
#save the output file

Function 2 code

def saveResults(show_dict, fname):
    """
    Input: takes a dictionary of show titles and revision counts, and a filename
    Output: creates a CSV file with the specified filename, containing the titles and counts
    """
    fout = open(fname, "w")
    fout.write("title,revisions\n")
    for skey, sval in show_dict.items():
        fout.write(skey + "," + str(sval) + "\n")
    fout.close()

This function is a bit different from the last one. First, it takes two variables, rather than one. Second, it doesn't return anything. That's because in this case, it doesn't need to: its job is to take input from Main and then save that input to a file on your hard drive or server.

Here's what our Main method looks like once we've implemented this new function:

###MAIN METHOD STARTS HERE###
shows = {}
for t in show_titles:
    show = getRevisions(t)
    shows.update(show)

#new function call
saveResults(shows, "tv_show_2016.csv")
Found 1260 revisions for Beauty_&_the_Beast_(2012_TV_series)
Found 47 revisions for Crazy_Talk_(TV_series)
Found 95 revisions for FABLife
Found 78 revisions for The_Meredith_Vieira_Show
Found 668 revisions for The_Leftovers_(TV_series)

Things to notice:

  • we didn't have to assign a variable to capture the results of this funciton (like we did above with 'shows', because this function didn't return any results—it just makes a file.
  • the names of the variables we passed as parameters are different in the function call than they are in the function declaration. In fact, in one case we just passed a string, not a variable at all.

Steps 4-5: getting min and max values

Now that we've saved our results in a dictionary (shows_2016) and as a CSV file (tv_show_2016.csv) there are at least three ways we could always find out the min/max values:

  1. by opening that file in Microsoft Excel or another spreadsheet program, and sorting by the second column.
  2. by looping through our dictionary and comparing each rev count against the largest one we've seen so far, then doing the same thing looking for the smallest count.
  3. by creating sorted lists out of our dictionary data and picking the first/last value in the list

All of these solutions are fine, but #3 is probably the quickest.

###MAIN METHOD STARTS HERE###
shows = {}
for t in show_titles:
    show = getRevisions(t)
    shows.update(show)

#new function call
saveResults(shows, "tv_show_2016.csv")

#make a sorted version of our dictionary, with the most-edited show first
shows_sorted_max = sorted(shows.items(), key=operator.itemgetter(1), reverse = True)
max_rev = shows_sorted_max[0]
print("%s has the most (%d) revisions in the dataset" % (max_rev[0], max_rev[1]))

#make a sorted version of our dictionary, with the least-edited show first
shows_sorted_min = sorted(shows_2016.items(), key=operator.itemgetter(1), reverse = False)
min_rev = shows_sorted_min[0]
print("%s has the fewest (%d) revisions in the dataset" % (min_rev[0], min_rev[1]))
Found 1260 revisions for Beauty_&_the_Beast_(2012_TV_series)
Found 47 revisions for Crazy_Talk_(TV_series)
Found 95 revisions for FABLife
Found 78 revisions for The_Meredith_Vieira_Show
Found 668 revisions for The_Leftovers_(TV_series)
Beauty_&_the_Beast_(2012_TV_series) has the most (1260) revisions in the dataset
Crazy_Talk_(TV_series) has the fewest (47) revisions in the dataset

Now we've accomplished our task and we can all hang up our coding gloves and go home. Right?

Not so fast

This solution is reasonably clear and succinct. But it's not very extensible. For example:

  • what if we have another list of shows (let's call it show_titles_2015), and we want to perform the same operations on that list of shows as well?
  • what if we wanted to be able to arbitrarily find the largest or smallest item in ANY list of dicts that contains strings for keys and numbers for values (even if they're not movies?)

If we want to do any of this, or if we think we might want to do it in the future, without having to re-write all this code, let's take a look at our code for finding the min/max value, and see if we can make it shorter and more abstract by using a function.

Function 3 outline

#take in a dictionary and instructions on how to sort it (lowest-highest, highest-lowest)
#make a sorted version of it
#return the highest/lowest value

Function 3 code

def findExtremeValue(show_dict, max=True):
    """
    Input: 
        * a dictionary with strings for keys and numbers for values
        * a flag for whether to return item with max or min number (max is default)
    Output: returns the item with the max (or min) value in the dictionary
    """
    shows_sorted = sorted(show_dict.items(), key=operator.itemgetter(1), reverse = max)
    return shows_sorted[0]

This function is very short. But that's okay. Functions can be very short—how much a particular function does is up to you. Many programmers are fond of short functions, because they make the code very modular and abstract.

How long should your functions be? A good rule of thumb is to try to make each function do only one thing... of course what constitutes "one thing" is up to you. Ultimately, the best way to figure out how much a single function should do is to experiment.

Here's what our Main method looks like once we've implemented this new function:

###MAIN METHOD STARTS HERE###
shows = {}
for t in show_titles:
    show = getRevisions(t)
    shows.update(show)

saveResults(shows, "tv_show_2016.csv")

#new function call - get max
max_rev = findExtremeValue(shows, max=True)
print("%s has the most (%d) revisions in the dataset" % (max_rev[0], max_rev[1]))

#new function call - get min
min_rev = findExtremeValue(shows, max=False)
print("%s has the fewest (%d) revisions in the dataset" % (min_rev[0], min_rev[1]))
Found 1260 revisions for Beauty_&_the_Beast_(2012_TV_series)
Found 47 revisions for Crazy_Talk_(TV_series)
Found 95 revisions for FABLife
Found 78 revisions for The_Meredith_Vieira_Show
Found 668 revisions for The_Leftovers_(TV_series)
Beauty_&_the_Beast_(2012_TV_series) has the most (1260) revisions in the dataset
Crazy_Talk_(TV_series) has the fewest (47) revisions in the dataset

The final piece: if __name__ == "__main__":

The code we've written so far is much clearer, better organized, and easier to update than our first version. But there's one more (optional) thing that we can do to make full use of the functions we've created: add if __name__ == "__main__": at the top of our main method (which until we've been designating with an all-caps comment string).

Note that when we use if __name__ == "__main__":, we have to indent all of our 'main' stuff under it, just like we would for a function. That's because the main method is a function (method and function are roughly interchangable words in Python, like library and module). By using if __name__ == "__main__":, we're making the functional nature of 'main' explicit.

Using if __name__ == "__main__": also allows us to use our script as a module and use the functions we've created in other scripts (or in the terminal) by importing our script into that environment, and passing values to whatever function we want.

We will demonstrate this in TextWrangler using the script week8/wiki_shows3.py, but I will paste what the main method of that script looks like below, for clarity.

###MAIN METHOD IS BELOW###
if __name__ == "__main__":
    shows = {}
    for t in show_titles:
        show = getRevisions(t)
        shows.update(show)

    saveResults(shows, "tv_show_2016.csv")

    max_rev = findExtremeValue(shows, max=True)
    print("%s has the most (%d) revisions in the dataset" % (max_rev[0], max_rev[1]))

    min_rev = findExtremeValue(shows, max=False)
    print("%s has the fewest (%d) revisions in the dataset" % (min_rev[0], min_rev[1]))