Lab 2 - Networks

Professor Brian Keegan
Department of Information Science, CU Boulder
This notebook is copyright and made available under the Apache License v2.0 license.

This is the second of five lab notebooks that will explore how to do some introductory data extraction and analysis from Wikipedia data. This lab will extend the methods in the prior lab about analyzing a single article's revision histories and use network science methods to analyze the networks of coauthorship and hyperlinks. You do not need to be fluent in either to complete the lab, but there are many options for extending the analyses we do here by using more advanced queries and scripting methods.

Acknowledgements
I'd like to thank the Wikimedia Foundation for the PAWS system and related Wikitech infrastructure that this workbook runs within. Yuvi Panda, Aaron Halfaker, Jonathan Morgan, and Dario Taraborelli have all provided crucial support and feedback.

Confirm that basic Python commands work

a = 3
b = 4
a**b
81

Import modules and setup environment

Load up all the libraries we'll need to connect to the database, retreive information for analysis, and visualize results.

# Makes the plots appear within the notebook
%matplotlib inline

# Two fundamental packages for doing data manipulation
import numpy as np                   # http://www.numpy.org/
import pandas as pd                  # http://pandas.pydata.org/

# Two related packages for plotting data
import matplotlib.pyplot as plt      # http://matplotlib.org/
import seaborn as sb                 # https://stanford.edu/~mwaskom/software/seaborn/

# Package for requesting data via the web and parsing resulting JSON
import requests
import json
from bs4 import BeautifulSoup

# Two packages for accessing the MySQL server
import pymysql                       # http://pymysql.readthedocs.io/en/latest/
import os                            # https://docs.python.org/3.4/library/os.html

# Package for analyzing complex networks
import networkx as nx                # https://networkx.github.io/

# Setup the code environment to use plots with a white background and DataFrames show more columns and rows
sb.set_style('whitegrid')
pd.options.display.max_columns = 100
pd.options.display.max_rows = 110

Retrieve the content of the page via API

Write a function that takes an article title and returns the list of links in the body of the article. Note that the reason we don't use the "pagelinks" table in MySQL or the "links" parameter in the API is that this includes links within templates. Articles with templates link to each other forming over-dense clusters in the resulting networks. We only want the links appearing in the body of the text.

We pass a request to the API, which returns a JSON-formatted string containing the HTML of the page. We use BeautifulSoup to parse through the HTML tree and extract the non-template links and return them as a list.

def get_page_outlinks(page_title,redirects=1):
    # Replace spaces with underscores
    #page_title = page_title.replace(' ','_')
    
    bad_titles = ['Special:','Wikipedia:','Help:','Template:','Category:','International Standard']
    
    # Get the response from the API for a query
    # After passing a page title, the API returns the HTML markup of the current article version within a JSON payload
    req = requests.get('https://en.wikipedia.org/w/api.php?action=parse&format=json&page={0}&redirects={1}&prop=text&disableeditsection=1&disabletoc=1'.format(page_title,redirects))
    
    # Read the response into JSON to parse and extract the HTML
    json_string = json.loads(req.text)
    
    # Initialize an empty list to store the links
    outlinks_list = [] 
    
    if 'parse' in json_string.keys():
        page_html = json_string['parse']['text']['*']

        # Parse the HTML into Beautiful Soup
        soup = BeautifulSoup(page_html,'lxml')

        # Delete tags associated with templates
        for tag in soup.find_all('tr'):
            tag.replace_with('')

        # For each paragraph tag, extract the titles within the links
        for para in soup.find_all('p'):
            for link in para.find_all('a'):
                if link.has_attr('title'):
                    outlinks_list.append(link['title'])

        # For each unordered list, extract the titles within the child links
        for unordered_list in soup.find_all('ul'):
            for item in unordered_list.find_all('li'):
                for link in item.find_all('a'):
                    if link.has_attr('title'):
                        title = link['title']
                        # Ignore links that aren't interesting
                        #if 'Special:' not in title and 'Wikipedia:' not in title and 'Help:' not in title and 'International Standard' not in title:
                        if all(bad not in title for bad in bad_titles): # Not working for some reason...
                            outlinks_list.append(title)

    return outlinks_list
get_page_outlinks('Freeskiing')
['Alpine skiing',
 'Terrain park',
 'Snowboarding',
 'Freeriding',
 'Big mountain skier',
 'Freestyle skiing',
 'Half-pipe skiing',
 'Slopestyle',
 'International Ski Federation',
 'Dynastar',
 'Terrain parks',
 'Ski jumping',
 'Quarterpipe',
 'Halfpipe',
 'Twin-tip ski',
 'Association of Freeskiing Professionals (page does not exist)',
 'Sarah Burke',
 'International Olympic Committee',
 'Sochi',
 'United States Ski and Snowboard Association',
 'Piste',
 'Avalanche',
 'Breckenridge Ski Resort',
 'Mammoth Mountain Ski Area',
 'Aspen/Snowmass',
 'Park City Mountain Resort',
 'Poley Mountain',
 'Whistler Blackcomb',
 'Mount Snow',
 'Line Skis',
 'Skis',
 'Ski boot',
 'Ski Bindings',
 'Jesper tjäder (page does not exist)',
 'Mike Douglas',
 'Mark Abma',
 'JP Auclair',
 'Ingrid Backstrom',
 'Noah Bowman',
 'Bill Briggs (skier)',
 'Bobby Brown (freestyle skier)',
 'Sarah Burke',
 'Sammy Carlson',
 'Guerlain Chicherit',
 'Doug Coombs',
 'Chris Davenport',
 'Justin Dorey',
 'Simon Dumont',
 'Nick Goepper',
 'Tanner Hall',
 'Janette Hargin',
 'Henrik Harlaut',
 'Russ Henshaw',
 'Eric Hjorleifson',
 'C. R. Johnson',
 'Kristi Leskinen',
 'Jossi Wells',
 'Pepe Gay (page does not exist)',
 'Mike Oilchange (page does not exist)',
 'Aidan Bharti (page does not exist)',
 'Shane McConkey',
 'Eric Miscimarra (page does not exist)',
 'Seth Morrison (skier)',
 'Jonny Moseley',
 'Jon Olsson',
 'Sean Pettit',
 'Glen Plake',
 'Eric Pollard',
 'Mike Riddle',
 'Kevin Rolland',
 'Sylvain Saudan',
 'TJ Schiller',
 'Scot Schmidt',
 'Candide Thovex',
 'Kaya Turski',
 'Tom Wallisch',
 'Torin Yater-Wallace',
 'Aerial skiing',
 'Alpine skiing',
 'Backcountry skiing',
 'Extreme skiing',
 'FIS Freestyle World Ski Championships',
 'FIS Freestyle Skiing World Cup',
 'Freeriding',
 'Freestyle skiing',
 'Freestyle skiing at the Winter Olympics',
 'Half-pipe skiing',
 'List of Olympic venues in freestyle skiing',
 'List of skiing topics',
 'Mogul skiing',
 'Ski ballet',
 'Ski cross',
 'Slopestyle',
 'X Games',
 'IF3 International Freeski Film Festival']

You could write a recursive function like recursively_get_hyperlink_network that would crawl the hyperlink network out to an arbitrary distance, but this is becomes exhorbitantly expensive at any depth greater than 1. Here's an example function, but is not executable to prevent you from harming yourself. :)

def recursively_get_hyperlink_network(seed_page,depth): neighbors = {} if depth < 0: return neighbors neighbors[seed_page] = get_page_outlinks(seed_page) for neighbor in neighbors[seed_page]: neighbors[neighbor] = get_hyperlink_network(neighbor,depth-1) return neighbors

Instead, define a simple function to get the 1.5-step ego hyperlink network. The "ego" is the seed page you start from, the "alters" are the neighbors that the ego links out to. We also get the alters of the alters (2nd order alters), but only include these 2nd order connections if they link to 1st order alters. In other words, the 1.5-step ego hyperlink network are all the pages linked from the seed page and the connections among this set of articles.

def get_hyperlink_alters(seed_page):
    # Initialize an empty dictionary to act as an adjacency "list"
    neighbors = {}
    
    # Get all the alters for the seed page and store them in the adjacency dictionary
    neighbors[seed_page] = get_page_outlinks(seed_page,1)
    
    # For each of the alters, get their alters and store in the adjacency dictionary
    for neighbor in list(set(neighbors[seed_page])): # Don't recrawl duplicates
        neighbors[neighbor] = get_page_outlinks(neighbor,0)
    
    # Initialize an empty graph that we will add nodes and edges into
    g = nx.DiGraph()
    
    # For each entry in the adjacency dictionary, check if the alter's alters are also the seed page's alters
    # If they are and the edge is already in the graph, increment the edge weight by one
    # If they are but the edge is not already in the graph, add the edge with a weight of one
    for article,neighbor_list in neighbors.items():
        for neighbor in neighbor_list:
            if neighbor in neighbors[seed_page] + [seed_page]:
                if g.has_edge(article,neighbor):
                    g[article][neighbor]['weight'] += 1
                else:
                    g.add_edge(article,neighbor,weight=1)
    
    # Return the weighted graph
    return g

Run this on an example article and save the resulting graph object to disk. This step could take more than a minute depending on the number of links and size of the neighboring pages.

page_title = 'Freeskiing'

# Create the hyperlink network
hyperlink_g = get_hyperlink_alters(page_title)

# Save the graph to disk to visualize in Gephi
nx.write_gexf(hyperlink_g,'hyperlink_{0}.gexf'.format(page_title.replace(' ','_')))
hg_nodes = hyperlink_g.number_of_nodes()
hg_edges = hyperlink_g.number_of_edges()

print("There are {0} nodes and {1} edges in the hyperlink network.".format(hg_nodes,hg_edges))
There are 90 nodes and 333 edges in the hyperlink network.
hg_density = nx.density(hyperlink_g)
print('{0:.2%} of the possible edges actually exist.'.format(hg_density))
4.16% of the possible edges actually exist.
def reciprocity(g):
    reciprocated_edges = []
    
    for (i,j) in g.edges():
        if hyperlink_g.has_edge(j,i):
            reciprocated_edges.append((i,j))
    
    return len(reciprocated_edges)/float(g.number_of_edges())

hg_reciprocity = reciprocity(hyperlink_g)

print('{0:.2%} of the edges in the hyperlink network are reciprocated.'.format(hg_reciprocity))
27.93% of the edges in the hyperlink network are reciprocated.

Identify the most well-connected nodes

hg_in_degree_d = {node:int(centrality*(len(hyperlink_g) - 1)) for node,centrality in nx.in_degree_centrality(hyperlink_g).items()}
hg_out_degree_d = {node:int(centrality*(len(hyperlink_g) - 1)) for node,centrality in nx.out_degree_centrality(hyperlink_g).items()}
degree_df = pd.DataFrame({'In':hg_in_degree_d,'Out':hg_out_degree_d})
degree_df['In'].sort_values(ascending=False).head(10)
Freestyle skiing                         29
Freeskiing                               16
Alpine skiing                            15
International Ski Federation             12
Slopestyle                               12
X Games                                  11
Snowboarding                             11
Extreme skiing                           10
Mogul skiing                             10
FIS Freestyle World Ski Championships     9
Name: In, dtype: int64
degree_df['Out'].sort_values(ascending=False).head(10)
Freeskiing                                 89
List of skiing topics                      20
Half-pipe skiing                           14
Freestyle skiing at the Winter Olympics    10
Ski ballet                                  9
FIS Freestyle Skiing World Cup              9
FIS Freestyle World Ski Championships       9
Big mountain skier                          7
Extreme skiing                              7
Aerial skiing                               6
Name: Out, dtype: int64
in_degree_dist_df = degree_df['In'].value_counts().reset_index()
out_degree_dist_df = degree_df['Out'].value_counts().reset_index()

f,ax = plt.subplots(1,1)
in_degree_dist_df.plot.scatter(x='index',y='In',ax=ax,c='blue',label='In')
out_degree_dist_df.plot.scatter(x='index',y='Out',ax=ax,c='red',label='Out')
ax.set_xscale('symlog')
ax.set_yscale('symlog')
ax.set_xlim((0,1e3))
ax.set_ylim((0,1e3));

Construct co-authorship network

def get_500_recent_revisions(page_title):
    req = requests.get('https://en.wikipedia.org//w/api.php?action=query&format=json&prop=revisions&titles={0}&rvprop=ids%7Ctimestamp%7Cuser%7Csize&rvlimit=500'.format(page_title))
    json_payload = json.loads(req.text)
    
    try:
        pageid = list(json_payload['query']['pages'].keys())[0]
        revisions = json_payload['query']['pages'][pageid]['revisions']
        
        df = pd.DataFrame(revisions)
        df['timestamp'] = df['timestamp'].apply(lambda x:pd.datetime.strptime(x,'%Y-%m-%dT%H:%M:%SZ'))
        df['title'] = json_payload['query']['pages'][pageid]['title']
        return df
        
    except KeyError:
        print('Error in {0}'.format(page_title))
        pass
def get_neighbors_500_revisions(page_title):
    """ Takes a page title and returns the 500 most-recent revisions for the page and its neighbors.
      page_title = a string for the page title to get its revisions
      
    Returns:
      A pandas DataFrame containing all the page revisions.
    """
    
    alters = get_page_outlinks(page_title) + [page_title]
    
    df_list = []
    
    for alter in alters:
        _df = get_500_recent_revisions(alter)
        df_list.append(_df)
        
    df = pd.concat(df_list)
    return df
 
hyperlink_g_rev_df = get_neighbors_500_revisions(page_title)

hyperlink_g_rev_df.head()
Error in Association of Freeskiing Professionals (page does not exist)
Error in Jesper tjäder (page does not exist)
Error in Pepe Gay (page does not exist)
Error in Mike Oilchange (page does not exist)
Error in Aidan Bharti (page does not exist)
Error in Eric Miscimarra (page does not exist)
anon parentid revid size timestamp title user
0 NaN 741680108 741680231 9303 2016-09-29 01:17:03 Alpine skiing Stephanieking
1 NaN 739775429 741680108 9296 2016-09-29 01:16:11 Alpine skiing Stephanieking
2 NaN 739775392 739775429 8207 2016-09-16 22:33:51 Alpine skiing Bahooka
3 NaN 737189099 739775392 8278 2016-09-16 22:33:26 Alpine skiing Clcaskey
4 NaN 737008021 737189099 8207 2016-09-01 07:16:45 Alpine skiing Bgwhite
hyperlink_g_gb_user_title = hyperlink_g_rev_df.groupby(['user','title'])
hyperlink_g_agg = hyperlink_g_gb_user_title.agg({'revid':pd.Series.nunique})
hyperlink_g_edgelist_df = hyperlink_g_agg.reset_index()
hyperlink_g_edgelist_df = hyperlink_g_edgelist_df.rename(columns={'revid':'weight'})

users = hyperlink_g_edgelist_df['user'].unique()
pages = hyperlink_g_edgelist_df['title'].unique()
collab_g = nx.from_pandas_dataframe(hyperlink_g_edgelist_df,source='user',target='title',
                                    edge_attr=['weight'],create_using=nx.DiGraph())
collab_g.add_nodes_from(users,nodetype='user')
collab_g.add_nodes_from(pages,nodetype='page')

nx.write_gexf(collab_g,'collaboration_{0}.gexf'.format(page_title.replace(' ','_')))
 

Compute descriptive statistics for the collaboration network

cg_users = len(users)
cg_pages = len(pages)
cg_edges = collab_g.number_of_edges()

print("There are {0} pages, {1} users, and {2} edges in the collaboration network.".format(cg_users,cg_pages,cg_edges))
cg_density = nx.bipartite.density(collab_g,pages)
print('{0:.2%} of the possible edges actually exist.'.format(cg_density))

Identify the most well-connected nodes

cg_in_degree_d = {node:int(centrality*(len(collab_g) - 1)) for node,centrality in nx.in_degree_centrality(collab_g).items()}
cg_out_degree_d = {node:int(centrality*(len(collab_g) - 1)) for node,centrality in nx.out_degree_centrality(collab_g).items()}
cg_degree_df = pd.DataFrame({'In':cg_in_degree_d,'Out':cg_out_degree_d})
cg_degree_df['In'].sort_values(ascending=False).head(10)
cg_degree_df['Out'].sort_values(ascending=False).head(10)
in_degree_dist_df = cg_degree_df['In'].value_counts().reset_index()
out_degree_dist_df = cg_degree_df['Out'].value_counts().reset_index()
revision_dist_df = hyperlink_g_edgelist_df['weight'].value_counts().reset_index()

f,ax = plt.subplots(1,1)
in_degree_dist_df.plot.scatter(x='index',y='In',ax=ax,c='blue',label='Page')
out_degree_dist_df.plot.scatter(x='index',y='Out',ax=ax,c='red',label='Editor')
revision_dist_df.plot.scatter(x='index',y='weight',ax=ax,c='green',label='Weight')
ax.set_xscale('symlog')
ax.set_yscale('symlog')
ax.set_xlim((0,1e3))
ax.set_ylim((0,1e5));