Lab 2 - Networks

Professor Brian Keegan
Department of Information Science, CU Boulder
This notebook is copyright and made available under the Apache License v2.0 license.

This is the second of five lab notebooks that will explore how to do some introductory data extraction and analysis from Wikipedia data. This lab will extend the methods in the prior lab about analyzing a single article's revision histories and use network science methods to analyze the networks of coauthorship and hyperlinks. You do not need to be fluent in either to complete the lab, but there are many options for extending the analyses we do here by using more advanced queries and scripting methods.

I'd like to thank the Wikimedia Foundation for the PAWS system and related Wikitech infrastructure that this workbook runs within. Yuvi Panda, Aaron Halfaker, Jonathan Morgan, and Dario Taraborelli have all provided crucial support and feedback.

Confirm that basic Python commands work

a = 3
b = 4

Import modules and setup environment

Load up all the libraries we'll need to connect to the database, retreive information for analysis, and visualize results.

# Makes the plots appear within the notebook
%matplotlib inline

# Two fundamental packages for doing data manipulation
import numpy as np                   #
import pandas as pd                  #

# Two related packages for plotting data
import matplotlib.pyplot as plt      #
import seaborn as sb                 #

# Package for requesting data via the web and parsing resulting JSON
import requests
import json
from bs4 import BeautifulSoup

# Two packages for accessing the MySQL server
import pymysql                       #
import os                            #

# Package for analyzing complex networks
import networkx as nx                #

# Setup the code environment to use plots with a white background and DataFrames show more columns and rows
pd.options.display.max_columns = 100
pd.options.display.max_rows = 110

Retrieve the content of the page via API

Write a function that takes an article title and returns the list of links in the body of the article. Note that the reason we don't use the "pagelinks" table in MySQL or the "links" parameter in the API is that this includes links within templates. Articles with templates link to each other forming over-dense clusters in the resulting networks. We only want the links appearing in the body of the text.

We pass a request to the API, which returns a JSON-formatted string containing the HTML of the page. We use BeautifulSoup to parse through the HTML tree and extract the non-template links and return them as a list.

def get_page_outlinks(page_title,redirects=1):
    # Replace spaces with underscores
    #page_title = page_title.replace(' ','_')
    bad_titles = ['Special:','Wikipedia:','Help:','Template:','Category:','International Standard']
    # Get the response from the API for a query
    # After passing a page title, the API returns the HTML markup of the current article version within a JSON payload
    req = requests.get('{0}&redirects={1}&prop=text&disableeditsection=1&disabletoc=1'.format(page_title,redirects))
    # Read the response into JSON to parse and extract the HTML
    json_string = json.loads(req.text)
    # Initialize an empty list to store the links
    outlinks_list = [] 
    if 'parse' in json_string.keys():
        page_html = json_string['parse']['text']['*']

        # Parse the HTML into Beautiful Soup
        soup = BeautifulSoup(page_html,'lxml')

        # Delete tags associated with templates
        for tag in soup.find_all('tr'):

        # For each paragraph tag, extract the titles within the links
        for para in soup.find_all('p'):
            for link in para.find_all('a'):
                if link.has_attr('title'):

        # For each unordered list, extract the titles within the child links
        for unordered_list in soup.find_all('ul'):
            for item in unordered_list.find_all('li'):
                for link in item.find_all('a'):
                    if link.has_attr('title'):
                        title = link['title']
                        # Ignore links that aren't interesting
                        #if 'Special:' not in title and 'Wikipedia:' not in title and 'Help:' not in title and 'International Standard' not in title:
                        if all(bad not in title for bad in bad_titles): # Not working for some reason...

    return outlinks_list
get_page_outlinks('Cyclone Pam')
['Tropical cyclone',
 'Natural disasters',
 'Solomon Islands',
 'New Zealand',
 'Barometric pressure',
 'Cyclone Zoe',
 '2002–03 South Pacific cyclone season',
 'Cyclone Gafilo',
 '2003–04 South-West Indian Ocean cyclone season',
 'Maximum sustained wind',
 'South Pacific tropical cyclone',
 'Cyclone Orson',
 'Cyclone Monica',
 'Cyclone Fantala',
 'Solomon Islands',
 'Tropical cyclone scale',
 'Tropical cyclone scales',
 'Saffir–Simpson hurricane wind scale',
 'Bar (unit)',
 'Pascal (unit)',
 'Inches of mercury',
 'New Zealand',
 'Extratropical transition',
 'Extratropical cyclone',
 'Storm surge',
 'State of emergency',
 'Solomon Islands',
 'Santa Cruz Islands',
 'Port Vila',
 'Tanna (island)',
 'Water shortage',
 'North Island',
 'Fiji Meteorological Service',
 'Nadi, Fiji',
 'Numerical weather prediction',
 'Solomon Islands',
 'Tropical depression',
 'Australian tropical cyclone scale',
 'Ridge (meteorology)',
 'Rapid deepening',
 'Atmospheric circulation',
 'Central dense overcast',
 'Eye (cyclone)',
 'Visible light',
 'Santa Cruz Islands',
 'Saffir–Simpson hurricane wind scale',
 'Bar (unit)',
 'Pascal (unit)',
 'Inches of mercury',
 'Cyclone Zoe',
 '2002–03 South Pacific cyclone season',
 'Meteorological Service of New Zealand Limited',
 'Extratropical transition',
 'New Zealand',
 'International Federation of Red Cross and Red Crescent Societies',
 'Port Vila',
 'Cyclone Uma',
 'United Nations Office for the Coordination of Humanitarian Affairs',
 'Tanna (island)',
 'Tongoa (page does not exist)',
 'New Caledonia',
 'Solomon Islands',
 'King tide',
 'Nui (atoll)',
 'Prime Minister of Tuvalu',
 'Enele Sopoaga',
 'Solomon Islands',
 'Makira-Ulawa Province',
 'Temotu Province',
 'Santa Cruz Islands',
 'Northern Division, Fiji',
 'Yasawa Islands',
 'Volvo Ocean Race',
 'New Caledonia',
 'Loyalty Islands',
 'Isle of Pines, New Caledonia',
 'Maré Island',
 'Maré Island',
 'Loyalty Islands',
 'Maré Island',
 'New Zealand',
 'North Island',
 'Cyclone Bola',
 'Hicks Bay',
 'Gisborne, New Zealand',
 'Whangarei District',
 'Tutukaka (page does not exist)',
 'Tolaga Bay',
 'Chatham Islands',
 'Lockheed P-3 Orion',
 'Jim Yong Kim',
 'World Bank',
 'United Nations',
 'Secretary-General of the United Nations',
 'Ban Ki-moon',
 'Climate change',
 'World Conference on Disaster Risk Reduction',
 'President of Vanuatu',
 'Baldwin Lonsdale',
 'Bauerfield International Airport',
 'United Kingdom',
 'European Union',
 'France–New Zealand relations',
 'French frigate Vendémiaire (F734)',
 'Royal Australian Air Force',
 'Boeing C-17 Globemaster III in Australian service',
 'Lockheed C-130 Hercules in Australian service',
 'Airtech CN-235',
 'New Caledonian Armed Forces',
 'Save the Children',
 'Typhoon Haiyan',
 'Moso (island)',
 'Shepherds Islands',
 'Adventist Development and Relief Agency',
 'Swiss franc',
 'Nui (atoll)',
 'Portal:Tropical cyclones',
 'List of the most intense tropical cyclones',
 'Cyclone Zoe',
 '2002–03 South Pacific cyclone season',
 'Cyclone Percy',
 '2004–05 South Pacific cyclone season',
 'Cyclone Ron',
 'Cyclone Susan',
 '1997–98 South Pacific cyclone season',
 'Cyclone Zoe',
 '2002–03 South Pacific cyclone season',
 'Cyclone Percy',
 '2004–05 South Pacific cyclone season',
 'Cyclone Ron',
 'Cyclone Susan',
 '1997–98 South Pacific cyclone season',
 'Cyclone Atu',
 '2010–11 South Pacific cyclone season',
 'Cyclone Fran',
 '1991–92 South Pacific cyclone season',
 'Cyclone Winston',
 '2015–16 South Pacific cyclone season',
 'Cyclone Zoe',
 '2002–03 South Pacific cyclone season',
 'Cyclone Percy',
 '2004–05 South Pacific cyclone season',
 'Cyclone Ron',
 'Cyclone Susan',
 '1997–98 South Pacific cyclone season']

You could write a recursive function like recursively_get_hyperlink_network that would crawl the hyperlink network out to an arbitrary distance, but this is becomes exhorbitantly expensive at any depth greater than 1. Here's an example function, but is not executable to prevent you from harming yourself. :)

def recursively_get_hyperlink_network(seed_page,depth): neighbors = {} if depth < 0: return neighbors neighbors[seed_page] = get_page_outlinks(seed_page) for neighbor in neighbors[seed_page]: neighbors[neighbor] = get_hyperlink_network(neighbor,depth-1) return neighbors

Instead, define a simple function to get the 1.5-step ego hyperlink network. The "ego" is the seed page you start from, the "alters" are the neighbors that the ego links out to. We also get the alters of the alters (2nd order alters), but only include these 2nd order connections if they link to 1st order alters. In other words, the 1.5-step ego hyperlink network are all the pages linked from the seed page and the connections among this set of articles.

def get_hyperlink_alters(seed_page):
    # Initialize an empty dictionary to act as an adjacency "list"
    neighbors = {}
    # Get all the alters for the seed page and store them in the adjacency dictionary
    neighbors[seed_page] = get_page_outlinks(seed_page,1)
    # For each of the alters, get their alters and store in the adjacency dictionary
    for neighbor in list(set(neighbors[seed_page])): # Don't recrawl duplicates
        neighbors[neighbor] = get_page_outlinks(neighbor,0)
    # Initialize an empty graph that we will add nodes and edges into
    g = nx.DiGraph()
    # For each entry in the adjacency dictionary, check if the alter's alters are also the seed page's alters
    # If they are and the edge is already in the graph, increment the edge weight by one
    # If they are but the edge is not already in the graph, add the edge with a weight of one
    for article,neighbor_list in neighbors.items():
        for neighbor in neighbor_list:
            if neighbor in neighbors[seed_page] + [seed_page]:
                if g.has_edge(article,neighbor):
                    g[article][neighbor]['weight'] += 1
    # Return the weighted graph
    return g

Run this on an example article and save the resulting graph object to disk. This step could take more than a minute depending on the number of links and size of the neighboring pages.

page_title = 'Cyclone Pam'

# Create the hyperlink network
hyperlink_g = get_hyperlink_alters(page_title)

# Save the graph to disk to visualize in Gephi
nx.write_gexf(hyperlink_g,'hyperlink_{0}.gexf'.format(page_title.replace(' ','_')))
hg_nodes = hyperlink_g.number_of_nodes()
hg_edges = hyperlink_g.number_of_edges()

print("There are {0} nodes and {1} edges in the hyperlink network.".format(hg_nodes,hg_edges))
There are 137 nodes and 888 edges in the hyperlink network.
hg_density = nx.density(hyperlink_g)
print('{0:.2%} of the possible edges actually exist.'.format(hg_density))
4.77% of the possible edges actually exist.
def reciprocity(g):
    reciprocated_edges = []
    for (i,j) in g.edges():
        if hyperlink_g.has_edge(j,i):
    return len(reciprocated_edges)/float(g.number_of_edges())

hg_reciprocity = reciprocity(hyperlink_g)

print('{0:.2%} of the edges in the hyperlink network are reciprocated.'.format(hg_reciprocity))
27.03% of the edges in the hyperlink network are reciprocated.

Identify the most well-connected nodes

hg_in_degree_d = {node:int(centrality*(len(hyperlink_g) - 1)) for node,centrality in nx.in_degree_centrality(hyperlink_g).items()}
hg_out_degree_d = {node:int(centrality*(len(hyperlink_g) - 1)) for node,centrality in nx.out_degree_centrality(hyperlink_g).items()}
degree_df = pd.DataFrame({'In':hg_in_degree_d,'Out':hg_out_degree_d})
Tropical cyclone            34
Vanuatu                     32
New Zealand                 31
Fiji                        29
Tuvalu                      26
Portal:Tropical cyclones    24
New Caledonia               23
United Nations              22
Solomon Islands             22
Cyclone Pam                 21
Name: In, dtype: int64
in_degree_dist_df = degree_df['In'].value_counts().reset_index()
out_degree_dist_df = degree_df['Out'].value_counts().reset_index()

f,ax = plt.subplots(1,1)

Construct co-authorship network

def get_500_recent_revisions(page_title):
    req = requests.get('{0}&rvprop=ids%7Ctimestamp%7Cuser%7Csize&rvlimit=500'.format(page_title))
    json_payload = json.loads(req.text)
        pageid = list(json_payload['query']['pages'].keys())[0]
        revisions = json_payload['query']['pages'][pageid]['revisions']
        df = pd.DataFrame(revisions)
        df['timestamp'] = df['timestamp'].apply(lambda x:pd.datetime.strptime(x,'%Y-%m-%dT%H:%M:%SZ'))
        df['title'] = json_payload['query']['pages'][pageid]['title']
        return df
    except KeyError:
        print('Error in {0}'.format(page_title))
def get_neighbors_500_revisions(page_title):
    """ Takes a page title and returns the 500 most-recent revisions for the page and its neighbors.
      page_title = a string for the page title to get its revisions
      A pandas DataFrame containing all the page revisions.
    alters = get_page_outlinks(page_title) + [page_title]
    df_list = []
    for alter in alters:
        _df = get_500_recent_revisions(alter)
    df = pd.concat(df_list)
    return df
hyperlink_g_rev_df = get_neighbors_500_revisions(page_title)

hyperlink_g_gb_user_title = hyperlink_g_rev_df.groupby(['user','title'])
hyperlink_g_agg = hyperlink_g_gb_user_title.agg({'revid':pd.Series.nunique})
hyperlink_g_edgelist_df = hyperlink_g_agg.reset_index()
hyperlink_g_edgelist_df = hyperlink_g_edgelist_df.rename(columns={'revid':'weight'})

users = hyperlink_g_edgelist_df['user'].unique()
pages = hyperlink_g_edgelist_df['title'].unique()
collab_g = nx.from_pandas_dataframe(hyperlink_g_edgelist_df,source='user',target='title',

nx.write_gexf(collab_g,'collaboration_{0}.gexf'.format(page_title.replace(' ','_')))

Compute descriptive statistics for the collaboration network

cg_users = len(users)
cg_pages = len(pages)
cg_edges = collab_g.number_of_edges()

print("There are {0} pages, {1} users, and {2} edges in the collaboration network.".format(cg_users,cg_pages,cg_edges))
cg_density = nx.bipartite.density(collab_g,pages)
print('{0:.2%} of the possible edges actually exist.'.format(cg_density))

Identify the most well-connected nodes

cg_in_degree_d = {node:int(centrality*(len(collab_g) - 1)) for node,centrality in nx.in_degree_centrality(collab_g).items()}
cg_out_degree_d = {node:int(centrality*(len(collab_g) - 1)) for node,centrality in nx.out_degree_centrality(collab_g).items()}
cg_degree_df = pd.DataFrame({'In':cg_in_degree_d,'Out':cg_out_degree_d})
in_degree_dist_df = cg_degree_df['In'].value_counts().reset_index()
out_degree_dist_df = cg_degree_df['Out'].value_counts().reset_index()
revision_dist_df = hyperlink_g_edgelist_df['weight'].value_counts().reset_index()

f,ax = plt.subplots(1,1)