Lab 2 - Networks

Professor Brian Keegan
Department of Information Science, CU Boulder
This notebook is copyright and made available under the Apache License v2.0 license.

This is the second of five lab notebooks that will explore how to do some introductory data extraction and analysis from Wikipedia data. This lab will extend the methods in the prior lab about analyzing a single article's revision histories and use network science methods to analyze the networks of coauthorship and hyperlinks. You do not need to be fluent in either to complete the lab, but there are many options for extending the analyses we do here by using more advanced queries and scripting methods.

Acknowledgements
I'd like to thank the Wikimedia Foundation for the PAWS system and related Wikitech infrastructure that this workbook runs within. Yuvi Panda, Aaron Halfaker, Jonathan Morgan, and Dario Taraborelli have all provided crucial support and feedback.

Confirm that basic Python commands work

a = 3
b = 4
a**b
81

Import modules and setup environment

Load up all the libraries we'll need to connect to the database, retreive information for analysis, and visualize results.

# Makes the plots appear within the notebook
%matplotlib inline

# Two fundamental packages for doing data manipulation
import numpy as np                   # http://www.numpy.org/
import pandas as pd                  # http://pandas.pydata.org/

# Two related packages for plotting data
import matplotlib.pyplot as plt      # http://matplotlib.org/
import seaborn as sb                 # https://stanford.edu/~mwaskom/software/seaborn/

# Package for requesting data via the web and parsing resulting JSON
import requests
import json
from bs4 import BeautifulSoup

# Two packages for accessing the MySQL server
import pymysql                       # http://pymysql.readthedocs.io/en/latest/
import os                            # https://docs.python.org/3.4/library/os.html

# Package for analyzing complex networks
import networkx as nx                # https://networkx.github.io/

# Setup the code environment to use plots with a white background and DataFrames show more columns and rows
sb.set_style('whitegrid')
pd.options.display.max_columns = 100
pd.options.display.max_rows = 110

Retrieve the content of the page via API

Write a function that takes an article title and returns the list of links in the body of the article. Note that the reason we don't use the "pagelinks" table in MySQL or the "links" parameter in the API is that this includes links within templates. Articles with templates link to each other forming over-dense clusters in the resulting networks. We only want the links appearing in the body of the text.

We pass a request to the API, which returns a JSON-formatted string containing the HTML of the page. We use BeautifulSoup to parse through the HTML tree and extract the non-template links and return them as a list.

def get_page_outlinks(page_title,redirects=1):
    # Replace spaces with underscores
    #page_title = page_title.replace(' ','_')
    
    bad_titles = ['Special:','Wikipedia:','Help:','Template:','Category:','International Standard']
    
    # Get the response from the API for a query
    # After passing a page title, the API returns the HTML markup of the current article version within a JSON payload
    req = requests.get('https://en.wikipedia.org/w/api.php?action=parse&format=json&page={0}&redirects={1}&prop=text&disableeditsection=1&disabletoc=1'.format(page_title,redirects))
    
    # Read the response into JSON to parse and extract the HTML
    json_string = json.loads(req.text)
    
    # Initialize an empty list to store the links
    outlinks_list = [] 
    
    if 'parse' in json_string.keys():
        page_html = json_string['parse']['text']['*']

        # Parse the HTML into Beautiful Soup
        soup = BeautifulSoup(page_html,'lxml')

        # Delete tags associated with templates
        for tag in soup.find_all('tr'):
            tag.replace_with('')

        # For each paragraph tag, extract the titles within the links
        for para in soup.find_all('p'):
            for link in para.find_all('a'):
                if link.has_attr('title'):
                    outlinks_list.append(link['title'])

        # For each unordered list, extract the titles within the child links
        for unordered_list in soup.find_all('ul'):
            for item in unordered_list.find_all('li'):
                for link in item.find_all('a'):
                    if link.has_attr('title'):
                        title = link['title']
                        # Ignore links that aren't interesting
                        #if 'Special:' not in title and 'Wikipedia:' not in title and 'Help:' not in title and 'International Standard' not in title:
                        if all(bad not in title for bad in bad_titles): # Not working for some reason...
                            outlinks_list.append(title)

    return outlinks_list
get_page_outlinks('Pit bull')
['Dog type',
 'American Pit Bull Terrier',
 'American Staffordshire Terrier',
 'American Bully',
 'Staffordshire Bull Terrier',
 'American Bulldog',
 'List of dog fighting breeds',
 'Terrier',
 'Blood sport',
 'Catch dog',
 'Dog fighting',
 'Bulldog',
 'Terrier',
 'United Kingdom',
 'Bull-baiting',
 'Bear-baiting',
 'Cock fighting',
 'Catch dog',
 'Companion dog',
 'Police dog',
 'Therapy dogs',
 'Dog fighting',
 'Attack dog',
 'San Francisco SPCA',
 'Center for Animal Care and Control',
 'Center for Disease Control and Prevention',
 'American Veterinary Medical Association',
 'Ammonia',
 'Ampule',
 'Breed-specific legislation',
 'Insurance premium',
 'Liability insurance',
 'Ontario',
 'Miami',
 'Denver',
 'Singapore',
 'Franklin County, Ohio',
 'American Bulldog',
 'Breed-specific legislation',
 'Breed-specific legislation',
 'U.S. Army',
 'United States Marine Corps',
 'Legal presumption',
 'Prima facie',
 'Municipal government',
 'Caselaw',
 'Neutering',
 'Microchip implant (animal)',
 'Liability insurance',
 'Felony',
 'ASPCA',
 'Dangerous Dogs Act 1991',
 'Legal liability',
 'Strict liability',
 'Insurance',
 'American Kennel Club',
 'Canine Good Citizen',
 'Rottweiler',
 'German Shepherd Dog',
 'Doberman Pinscher',
 'Akita Inu',
 'American Akita',
 'Chow-Chow',
 'Farmers Insurance',
 'Embargo',
 'Cephalic index',
 'United Airlines',
 'Wikipedia:Citation needed',
 'Soldiers',
 'Police dogs',
 'Search and rescue dogs',
 'Actors',
 'Television personalities',
 'Seeing eye dog',
 'Celebrity',
 'Bull Terrier',
 'Nipper',
 'American Staffordshire Terrier',
 'Pete the Pup',
 'Little Rascals',
 'Billie Holiday',
 'Helen Keller',
 'Buster Brown',
 'Horatio Nelson Jackson',
 'Theodore Roosevelt',
 'Sergeant Stubby',
 '26th Infantry Division (United States)',
 'Aneurysm',
 'Daddy (dog)',
 'Portal:Dogs',
 'Digital object identifier']

You could write a recursive function like recursively_get_hyperlink_network that would crawl the hyperlink network out to an arbitrary distance, but this is becomes exhorbitantly expensive at any depth greater than 1. Here's an example function, but is not executable to prevent you from harming yourself. :)

def recursively_get_hyperlink_network(seed_page,depth): neighbors = {} if depth < 0: return neighbors neighbors[seed_page] = get_page_outlinks(seed_page) for neighbor in neighbors[seed_page]: neighbors[neighbor] = get_hyperlink_network(neighbor,depth-1) return neighbors

Instead, define a simple function to get the 1.5-step ego hyperlink network. The "ego" is the seed page you start from, the "alters" are the neighbors that the ego links out to. We also get the alters of the alters (2nd order alters), but only include these 2nd order connections if they link to 1st order alters. In other words, the 1.5-step ego hyperlink network are all the pages linked from the seed page and the connections among this set of articles.

def get_hyperlink_alters(seed_page):
    # Initialize an empty dictionary to act as an adjacency "list"
    neighbors = {}
    
    # Get all the alters for the seed page and store them in the adjacency dictionary
    neighbors[seed_page] = get_page_outlinks(seed_page,1)
    
    # For each of the alters, get their alters and store in the adjacency dictionary
    for neighbor in list(set(neighbors[seed_page])): # Don't recrawl duplicates
        neighbors[neighbor] = get_page_outlinks(neighbor,0)
    
    # Initialize an empty graph that we will add nodes and edges into
    g = nx.DiGraph()
    
    # For each entry in the adjacency dictionary, check if the alter's alters are also the seed page's alters
    # If they are and the edge is already in the graph, increment the edge weight by one
    # If they are but the edge is not already in the graph, add the edge with a weight of one
    for article,neighbor_list in neighbors.items():
        for neighbor in neighbor_list:
            if neighbor in neighbors[seed_page] + [seed_page]:
                if g.has_edge(article,neighbor):
                    g[article][neighbor]['weight'] += 1
                else:
                    g.add_edge(article,neighbor,weight=1)
    
    # Return the weighted graph
    return g

Run this on an example article and save the resulting graph object to disk. This step could take more than a minute depending on the number of links and size of the neighboring pages.

page_title = 'Pit bull'

# Create the hyperlink network
hyperlink_g = get_hyperlink_alters(page_title)

# Save the graph to disk to visualize in Gephi
nx.write_gexf(hyperlink_g,'hyperlink_{0}.gexf'.format(page_title.replace(' ','_')))
hg_nodes = hyperlink_g.number_of_nodes()
hg_edges = hyperlink_g.number_of_edges()

print("There are {0} nodes and {1} edges in the hyperlink network.".format(hg_nodes,hg_edges))
There are 84 nodes and 357 edges in the hyperlink network.
hg_density = nx.density(hyperlink_g)
print('{0:.2%} of the possible edges actually exist.'.format(hg_density))
5.12% of the possible edges actually exist.
def reciprocity(g):
    reciprocated_edges = []
    
    for (i,j) in g.edges():
        if hyperlink_g.has_edge(j,i):
            reciprocated_edges.append((i,j))
    
    return len(reciprocated_edges)/float(g.number_of_edges())

hg_reciprocity = reciprocity(hyperlink_g)

print('{0:.2%} of the edges in the hyperlink network are reciprocated.'.format(hg_reciprocity))
23.81% of the edges in the hyperlink network are reciprocated.

Identify the most well-connected nodes

hg_in_degree_d = {node:int(centrality*(len(hyperlink_g) - 1)) for node,centrality in nx.in_degree_centrality(hyperlink_g).items()}
hg_out_degree_d = {node:int(centrality*(len(hyperlink_g) - 1)) for node,centrality in nx.out_degree_centrality(hyperlink_g).items()}
degree_df = pd.DataFrame({'In':hg_in_degree_d,'Out':hg_out_degree_d})
degree_df['In'].sort_values(ascending=False).head(10)
Wikipedia:Citation needed     47
American Kennel Club          15
American Pit Bull Terrier     14
United Kingdom                12
Bulldog                       11
Staffordshire Bull Terrier    11
Rottweiler                    11
Companion dog                  9
Dog fighting                   9
Bull Terrier                   9
Name: In, dtype: int64
degree_df['Out'].sort_values(ascending=False).head(10)
Pit bull                          83
American Pit Bull Terrier         21
Breed-specific legislation        18
Staffordshire Bull Terrier        11
Dog fighting                      10
American Akita                     9
Akita Inu                          9
American Staffordshire Terrier     7
Doberman Pinscher                  7
List of dog fighting breeds        7
Name: Out, dtype: int64
in_degree_dist_df = degree_df['In'].value_counts().reset_index()
out_degree_dist_df = degree_df['Out'].value_counts().reset_index()

f,ax = plt.subplots(1,1)
in_degree_dist_df.plot.scatter(x='index',y='In',ax=ax,c='blue',label='In')
out_degree_dist_df.plot.scatter(x='index',y='Out',ax=ax,c='red',label='Out')
ax.set_xscale('symlog')
ax.set_yscale('symlog')
ax.set_xlim((0,1e3))
ax.set_ylim((0,1e3));

Construct co-authorship network

def get_500_recent_revisions(page_title):
    req = requests.get('https://en.wikipedia.org//w/api.php?action=query&format=json&prop=revisions&titles={0}&rvprop=ids%7Ctimestamp%7Cuser%7Csize&rvlimit=500'.format(page_title))
    json_payload = json.loads(req.text)
    
    try:
        pageid = list(json_payload['query']['pages'].keys())[0]
        revisions = json_payload['query']['pages'][pageid]['revisions']
        
        df = pd.DataFrame(revisions)
        df['timestamp'] = df['timestamp'].apply(lambda x:pd.datetime.strptime(x,'%Y-%m-%dT%H:%M:%SZ'))
        df['title'] = json_payload['query']['pages'][pageid]['title']
        return df
        
    except KeyError:
        print('Error in {0}'.format(page_title))
        pass
def get_neighbors_500_revisions(page_title):
    """ Takes a page title and returns the 500 most-recent revisions for the page and its neighbors.
      page_title = a string for the page title to get its revisions
      
    Returns:
      A pandas DataFrame containing all the page revisions.
    """
    
    alters = get_page_outlinks(page_title) + [page_title]
    
    df_list = []
    
    for alter in alters:
        _df = get_500_recent_revisions(alter)
        df_list.append(_df)
        
    df = pd.concat(df_list)
    return df
 
hyperlink_g_rev_df = get_neighbors_500_revisions(page_title)

hyperlink_g_rev_df.head()
anon parentid revid size timestamp title user
0 NaN 736301376 736301425 22827 2016-08-26 15:03:53 Dog type Oknazevad
1 NaN 732104489 736301376 22859 2016-08-26 15:03:26 Dog type Oknazevad
2 NaN 732094041 732104489 22873 2016-07-29 17:20:58 Dog type Ohnoitsjamie
3 NaN 729290525 732094041 23063 2016-07-29 15:40:21 Dog type Bender235
4 NaN 729149090 729290525 23072 2016-07-11 06:21:05 Dog type BG19bot
hyperlink_g_gb_user_title = hyperlink_g_rev_df.groupby(['user','title'])
hyperlink_g_agg = hyperlink_g_gb_user_title.agg({'revid':pd.Series.nunique})
hyperlink_g_edgelist_df = hyperlink_g_agg.reset_index()
hyperlink_g_edgelist_df = hyperlink_g_edgelist_df.rename(columns={'revid':'weight'})

users = hyperlink_g_edgelist_df['user'].unique()
pages = hyperlink_g_edgelist_df['title'].unique()
collab_g = nx.from_pandas_dataframe(hyperlink_g_edgelist_df,source='user',target='title',
                                    edge_attr=['weight'],create_using=nx.DiGraph())
collab_g.add_nodes_from(users,nodetype='user')
collab_g.add_nodes_from(pages,nodetype='page')

nx.write_gexf(collab_g,'collaboration_{0}.gexf'.format(page_title.replace(' ','_')))
 

Compute descriptive statistics for the collaboration network

cg_users = len(users)
cg_pages = len(pages)
cg_edges = collab_g.number_of_edges()

print("There are {0} pages, {1} users, and {2} edges in the collaboration network.".format(cg_users,cg_pages,cg_edges))
There are 10539 pages, 84 users, and 13185 edges in the collaboration network.
cg_density = nx.bipartite.density(collab_g,pages)
print('{0:.2%} of the possible edges actually exist.'.format(cg_density))
0.74% of the possible edges actually exist.

Identify the most well-connected nodes

cg_in_degree_d = {node:int(centrality*(len(collab_g) - 1)) for node,centrality in nx.in_degree_centrality(collab_g).items()}
cg_out_degree_d = {node:int(centrality*(len(collab_g) - 1)) for node,centrality in nx.out_degree_centrality(collab_g).items()}
cg_degree_df = pd.DataFrame({'In':cg_in_degree_d,'Out':cg_out_degree_d})
cg_degree_df['In'].sort_values(ascending=False).head(10)
Embargo                352
Felony                 335
Police dog             312
Bear-baiting           303
Terrier                297
Ammonia                293
Helen Keller           288
Billie Holiday         284
Cephalic index         277
Liability insurance    277
Name: In, dtype: int64
cg_degree_df['Out'].sort_values(ascending=False).head(10)
ClueBot NG           49
AnomieBOT            49
Yobot                46
Addbot               33
SmackBot             32
FrescoBot            29
Cydebot              28
Rjwilmsi             27
Materialscientist    27
BG19bot              25
Name: Out, dtype: int64
in_degree_dist_df = cg_degree_df['In'].value_counts().reset_index()
out_degree_dist_df = cg_degree_df['Out'].value_counts().reset_index()
revision_dist_df = hyperlink_g_edgelist_df['weight'].value_counts().reset_index()

f,ax = plt.subplots(1,1)
in_degree_dist_df.plot.scatter(x='index',y='In',ax=ax,c='blue',label='Page')
out_degree_dist_df.plot.scatter(x='index',y='Out',ax=ax,c='red',label='Editor')
revision_dist_df.plot.scatter(x='index',y='weight',ax=ax,c='green',label='Weight')
ax.set_xscale('symlog')
ax.set_yscale('symlog')
ax.set_xlim((0,1e3))
ax.set_ylim((0,1e5));