Lab 2 - Networks

Professor Brian Keegan
Department of Information Science, CU Boulder
This notebook is copyright and made available under the Apache License v2.0 license.

This is the second of five lab notebooks that will explore how to do some introductory data extraction and analysis from Wikipedia data. This lab will extend the methods in the prior lab about analyzing a single article's revision histories and use network science methods to analyze the networks of coauthorship and hyperlinks. You do not need to be fluent in either to complete the lab, but there are many options for extending the analyses we do here by using more advanced queries and scripting methods.

I'd like to thank the Wikimedia Foundation for the PAWS system and related Wikitech infrastructure that this workbook runs within. Yuvi Panda, Aaron Halfaker, Jonathan Morgan, and Dario Taraborelli have all provided crucial support and feedback.

Confirm that basic Python commands work

a = 3
b = 4

Import modules and setup environment

Load up all the libraries we'll need to connect to the database, retreive information for analysis, and visualize results.

# Makes the plots appear within the notebook
%matplotlib inline

# Two fundamental packages for doing data manipulation
import numpy as np                   #
import pandas as pd                  #

# Two related packages for plotting data
import matplotlib.pyplot as plt      #
import seaborn as sb                 #

# Package for requesting data via the web and parsing resulting JSON
import requests
import json
from bs4 import BeautifulSoup

# Two packages for accessing the MySQL server
import pymysql                       #
import os                            #

# Package for analyzing complex networks
import networkx as nx                #

# Setup the code environment to use plots with a white background and DataFrames show more columns and rows
pd.options.display.max_columns = 100
pd.options.display.max_rows = 110

Retrieve the content of the page via API

Write a function that takes an article title and returns the list of links in the body of the article. Note that the reason we don't use the "pagelinks" table in MySQL or the "links" parameter in the API is that this includes links within templates. Articles with templates link to each other forming over-dense clusters in the resulting networks. We only want the links appearing in the body of the text.

We pass a request to the API, which returns a JSON-formatted string containing the HTML of the page. We use BeautifulSoup to parse through the HTML tree and extract the non-template links and return them as a list.

def get_page_outlinks(page_title,redirects=1):
    # Replace spaces with underscores
    #page_title = page_title.replace(' ','_')
    bad_titles = ['Special:','Wikipedia:','Help:','Template:','Category:','International Standard']
    # Get the response from the API for a query
    # After passing a page title, the API returns the HTML markup of the current article version within a JSON payload
    req = requests.get('{0}&redirects={1}&prop=text&disableeditsection=1&disabletoc=1'.format(page_title,redirects))
    # Read the response into JSON to parse and extract the HTML
    json_string = json.loads(req.text)
    # Initialize an empty list to store the links
    outlinks_list = [] 
    if 'parse' in json_string.keys():
        page_html = json_string['parse']['text']['*']

        # Parse the HTML into Beautiful Soup
        soup = BeautifulSoup(page_html,'lxml')

        # Delete tags associated with templates
        for tag in soup.find_all('tr'):

        # For each paragraph tag, extract the titles within the links
        for para in soup.find_all('p'):
            for link in para.find_all('a'):
                if link.has_attr('title'):

        # For each unordered list, extract the titles within the child links
        for unordered_list in soup.find_all('ul'):
            for item in unordered_list.find_all('li'):
                for link in item.find_all('a'):
                    if link.has_attr('title'):
                        title = link['title']
                        # Ignore links that aren't interesting
                        #if 'Special:' not in title and 'Wikipedia:' not in title and 'Help:' not in title and 'International Standard' not in title:
                        if all(bad not in title for bad in bad_titles): # Not working for some reason...

    return outlinks_list

You could write a recursive function like recursively_get_hyperlink_network that would crawl the hyperlink network out to an arbitrary distance, but this is becomes exhorbitantly expensive at any depth greater than 1. Here's an example function, but is not executable to prevent you from harming yourself. :)

['Help:Installing Japanese character sets',
 'Help:IPA for English',
 'Help:IPA for English',
 'Help:Pronunciation respelling key',
 'Help:Pronunciation respelling key',
 'Media franchise',
 'The Pokémon Company',
 'Game Freak',
 'Creatures (company)',
 'Satoshi Tajiri',
 'List of Pokémon',
 'Game Boy',
 'Mario (franchise)',
 'Hey You, Pikachu!',
 'Nintendo 64',
 '4Kids Entertainment',
 'United States dollar',
 'Pokémon: Tenth Anniversary',
 'Super Bowl commercials',
 'Super Bowl 50',
 'Super Bowl 50',
 'Augmented reality',
 'Pokémon Go',
 'Pokémon Sun and Moon',
 'Film adaptation',
 'Great Detective Pikachu',
 'Romanization of Japanese',
 'Contraction (grammar)',
 'Help:Installing Japanese character sets',
 'List of Pokémon',
 'Pokémon X and Y',
 'English plurals',
 'Game Boy',
 'Pokémon universe',
 'Insect collecting',
 'Pokémon Trainer',
 'Pokémon (video game series)',
 'Pokémon (anime)',
 'Pokémon Trading Card Game',
 'Poké Ball',
 'Experience point',
 'Pokémon statistics',
 'Pokémon moves',
 'Pokémon evolution',
 'Non-player character',
 'Gym Leader',
 'Elite Four',
 'Satoshi Tajiri',
 'Game Boy',
 'Video game remake',
 'The Pokémon Company International',
 'Pokémon Red and Green',
 'Pokémon Red and Blue',
 'Pokémon Yellow: Special Pikachu Edition',
 'Game Boy Color',
 'Mew (Pokémon)',
 'Kanto (Pokémon)',
 'Kantō region',
 'Pokémon Gold and Silver',
 'Pokémon Crystal',
 'Celebi (Pokémon)',
 'Kansai region',
 'Pokémon mini',
 'Handheld game console',
 'Pokémon Ruby and Sapphire',
 'Game Boy Advance',
 'Pokémon Emerald',
 'Pokémon Diamond and Pearl',
 'Nintendo DS',
 'Types of Pokémon moves',
 'Nintendo Wi-Fi Connection',
 'Pokémon Platinum',
 'Pokémon Battle Revolution',
 'Pokémon HeartGold and SoulSilver',
 'Pokémon Black and White',
 '2010 in video gaming',
 'Pokémon Black and White',
 'Help:Installing Japanese character sets',
 'Help:Installing Japanese character sets',
 'Help:Installing Japanese character sets',
 'Pokémon Black 2 and White 2',
 'Nintendo 3DS',
 'Pokémon Omega Ruby and Alpha Sapphire',
 'Pokémon Sun and Moon',
 'Pokémon types',
 'Nintendo GameCube',
 'Pokémon Colosseum',
 'Pokémon XD: Gale of Darkness',
 'Pokémon Adventures',
 'Exposition (literary technique)',
 'Canon (fiction)',
 'Brock (Pokémon)',
 'Misty (Pokémon)',
 'Tracey Sketchit',
 'May (Pokémon anime character)',
 'Max (Pokémon anime character)',
 'Pokémon Chronicles',
 'Dawn (Pokémon)',
 'Pocket Monsters: Best Wishes!',
 'Help:Installing Japanese character sets',
 'Pokémon movies',
 'Pokémon the Movie: Black—Victini and Reshiram and White—Victini and Zekrom',
 'Pokémon Junior',
 'Warner Bros. Pictures',
 'Sony Pictures',
 'Legendary Pictures',
 'Universal Pictures',
 'Nicole Perlman',
 'Alex Hirsch',
 'Dean Israelite',
 'Robert Rodriguez',
 'Tim Miller (director)',
 'Collectible card game',
 'Wizards of the Coast',
 'Nintendo e-Reader',
 'Pokémon Trading Card Game (video game)',
 'Viz Media',
 'Chuang Yi',
 'Monopoly (game)',
 'Vatican City',
 'United Kingdom',
 'Christian Power Cards (page does not exist)',
 'Anti-Defamation League',
 'Merrick, New York',
 'Problem gambling',
 'Saudi Arabia',
 'Star of David',
 'Economic materialism',
 'People for the Ethical Treatment of Animals',
 'Cruelty to animals',
 'Dog fighting',
 'Dennō Senshi Porygon',
 'The Simpsons',
 'Thirty Minutes over Tokyo',
 'South Park',
 'Manhattan Beach, California',
 'Monster in My Pocket',
 'Bosnian War',
 'Popular culture',
 'List of Pokémon characters',
 "Macy's Thanksgiving Day Parade",
 'Time (magazine)',
 'Drawn Together',
 'List of Drawn Together characters',
 'The Grim Adventures of Billy & Mandy',
 'Robot Chicken',
 'All Grown Up!',
 'Johnny Test',
 "I Love the '90s: Part Deux",
 'Pokémon Live!',
 'Jim Butcher',
 'Codex Alera',
 'Rockefeller Center',
 'Stuffed animal',
 'Nintendo World Store',
 'Wikipedia:Identifying reliable sources',
 'Digital Monster (virtual pet)',
 'Fan site',
 'Twitch Plays Pokémon',
 'Fan made',
 'Game mode',
 'Pokémon Uranium',
 'Pokémon: The Electric Tale of Pikachu',
 'Shōnen manga',
 'Pokémon Adventures',
 'Magical Pokémon Journey',
 'Shōjo manga',
 'Pokémon (manga)',
 'Ash & Pikachu',
 'Pokémon Gold & Silver (manga)',
 'Pokémon Ruby-Sapphire',
 'List of Pokémon volumes',
 'List of Pokémon volumes',
 'List of Pokémon volumes',
 'List of Pokémon volumes',
 'Pokémon Diamond and Pearl Adventure!',
 'List of Pokémon volumes',
 'List of Pokémon volumes',
 'List of Pokémon volumes',
 'List of Pokémon volumes',
 'List of Pokémon volumes',
 'List of Pokémon Black and White chapters',
 'Portal:Video games',
 'List of Pokémon',
 'List of Pokémon chapters',
 'List of Pokémon characters',
 'List of Pokémon episodes',
 'List of Pokémon video games',
 'Pokémon episodes removed from rotation']
def recursively_get_hyperlink_network(seed_page,depth): neighbors = {} if depth < 0: return neighbors neighbors[seed_page] = get_page_outlinks(seed_page) for neighbor in neighbors[seed_page]: neighbors[neighbor] = get_hyperlink_network(neighbor,depth-1) return neighbors

Instead, define a simple function to get the 1.5-step ego hyperlink network. The "ego" is the seed page you start from, the "alters" are the neighbors that the ego links out to. We also get the alters of the alters (2nd order alters), but only include these 2nd order connections if they link to 1st order alters. In other words, the 1.5-step ego hyperlink network are all the pages linked from the seed page and the connections among this set of articles.

def get_hyperlink_alters(seed_page):
    # Initialize an empty dictionary to act as an adjacency "list"
    neighbors = {}
    # Get all the alters for the seed page and store them in the adjacency dictionary
    neighbors[seed_page] = get_page_outlinks(seed_page,1)
    # For each of the alters, get their alters and store in the adjacency dictionary
    for neighbor in list(set(neighbors[seed_page])): # Don't recrawl duplicates
        neighbors[neighbor] = get_page_outlinks(neighbor,0)
    # Initialize an empty graph that we will add nodes and edges into
    g = nx.DiGraph()
    # For each entry in the adjacency dictionary, check if the alter's alters are also the seed page's alters
    # If they are and the edge is already in the graph, increment the edge weight by one
    # If they are but the edge is not already in the graph, add the edge with a weight of one
    for article,neighbor_list in neighbors.items():
        for neighbor in neighbor_list:
            if neighbor in neighbors[seed_page] + [seed_page]:
                if g.has_edge(article,neighbor):
                    g[article][neighbor]['weight'] += 1
    # Return the weighted graph
    return g

Run this on an example article and save the resulting graph object to disk. This step could take more than a minute depending on the number of links and size of the neighboring pages.

page_title = 'Pokémon'

# Create the hyperlink network
hyperlink_g = get_hyperlink_alters(page_title)

# Save the graph to disk to visualize in Gephi
nx.write_gexf(hyperlink_g,'hyperlink_{0}.gexf'.format(page_title.replace(' ','_')))
hg_nodes = hyperlink_g.number_of_nodes()
hg_edges = hyperlink_g.number_of_edges()

print("There are {0} nodes and {1} edges in the hyperlink network.".format(num_nodes,num_edges))
NameError                                 Traceback (most recent call last)
<ipython-input-10-f4b394d4f770> in <module>()
      2 hg_edges = hyperlink_g.number_of_edges()
----> 4 print("There are {0} nodes and {1} edges in the hyperlink network.".format(num_nodes,num_edges))

NameError: name 'num_nodes' is not defined
hg_density = nx.density(hyperlink_g)
print('{0:.2%} of the possible edges actually exist.'.format(hg_density))
5.24% of the possible edges actually exist.
def reciprocity(g):
    reciprocated_edges = []
    for (i,j) in g.edges():
        if hyperlink_g.has_edge(j,i):
    return len(reciprocated_edges)/float(g.number_of_edges())

hg_reciprocity = reciprocity(hyperlink_g)

print('{0:.2%} of the edges in the hyperlink network are reciprocated.'.format(hg_reciprocity))
30.58% of the edges in the hyperlink network are reciprocated.

Identify the most well-connected nodes

hg_in_degree_d = {node:int(centrality*(len(hyperlink_g) - 1)) for node,centrality in nx.in_degree_centrality(hyperlink_g).items()}
hg_out_degree_d = {node:int(centrality*(len(hyperlink_g) - 1)) for node,centrality in nx.out_degree_centrality(hyperlink_g).items()}
degree_df = pd.DataFrame({'In':hg_in_degree_d,'Out':hg_out_degree_d})
Help:Installing Japanese character sets    100
Pokémon                                     88
Nintendo                                    70
Pokémon (anime)                             68
Pikachu                                     56
Pokémon Red and Blue                        53
Portal:Pokémon                              51
Pokémon (video game series)                 51
Pokémon Gold and Silver                     48
IGN                                         47
Name: In, dtype: int64
Pokémon                        227
Pokémon (video game series)     67
Gym Leader                      45
List of Pokémon characters      45
Elite Four                      45
Flareon                         44
Jolteon                         44
Vaporeon                        44
Pokémon moves                   41
Types of Pokémon moves          41
Name: Out, dtype: int64
in_degree_dist_df = degree_df['In'].value_counts().reset_index()
out_degree_dist_df = degree_df['Out'].value_counts().reset_index()

f,ax = plt.subplots(1,1)

Construct co-authorship network

def get_500_recent_revisions(page_title):
    req = requests.get('{0}&rvprop=ids%7Ctimestamp%7Cuser%7Csize&rvlimit=500'.format(page_title))
    json_payload = json.loads(req.text)
        pageid = list(json_payload['query']['pages'].keys())[0]
        revisions = json_payload['query']['pages'][pageid]['revisions']
        df = pd.DataFrame(revisions)
        df['timestamp'] = df['timestamp'].apply(lambda x:pd.datetime.strptime(x,'%Y-%m-%dT%H:%M:%SZ'))
        df['title'] = json_payload['query']['pages'][pageid]['title']
        return df
    except KeyError:
        print('Error in {0}'.format(page_title))
def get_neighbors_500_revisions(page_title):
    """ Takes a page title and returns the 500 most-recent revisions for the page and its neighbors.
      page_title = a string for the page title to get its revisions
      A pandas DataFrame containing all the page revisions.
    alters = get_page_outlinks(page_title) + [page_title]
    df_list = []
    for alter in alters:
        _df = get_500_recent_revisions(alter)
    df = pd.concat(df_list)
    return df
hyperlink_g_rev_df = get_neighbors_500_revisions(page_title)

hyperlink_g_gb_user_title = hyperlink_g_rev_df.groupby(['user','title'])
hyperlink_g_agg = hyperlink_g_gb_user_title.agg({'revid':pd.Series.nunique})
hyperlink_g_edgelist_df = hyperlink_g_agg.reset_index()
hyperlink_g_edgelist_df = hyperlink_g_edgelist_df.rename(columns={'revid':'weight'})

users = hyperlink_g_edgelist_df['user'].unique()
pages = hyperlink_g_edgelist_df['title'].unique()
collab_g = nx.from_pandas_dataframe(hyperlink_g_edgelist_df,source='user',target='title',

nx.write_gexf(collab_g,'collaboration_{0}.gexf'.format(page_title.replace(' ','_')))

Compute descriptive statistics for the collaboration network

cg_users = len(users)
cg_pages = len(pages)
cg_edges = collab_g.number_of_edges()

print("There are {0} pages, {1} users, and {2} edges in the collaboration network.".format(cg_users,cg_pages,cg_edges))
cg_density = nx.bipartite.density(collab_g,pages)
print('{0:.2%} of the possible edges actually exist.'.format(cg_density))

Identify the most well-connected nodes

cg_in_degree_d = {node:int(centrality*(len(collab_g) - 1)) for node,centrality in nx.in_degree_centrality(collab_g).items()}
cg_out_degree_d = {node:int(centrality*(len(collab_g) - 1)) for node,centrality in nx.out_degree_centrality(collab_g).items()}
cg_degree_df = pd.DataFrame({'In':cg_in_degree_d,'Out':cg_out_degree_d})
in_degree_dist_df = cg_degree_df['In'].value_counts().reset_index()
out_degree_dist_df = cg_degree_df['Out'].value_counts().reset_index()
revision_dist_df = hyperlink_g_edgelist_df['weight'].value_counts().reset_index()

f,ax = plt.subplots(1,1)