Lab 2 - Hyperlink Networks

Professor Brian Keegan
Department of Information Science, CU Boulder
This notebook is copyright and made available under the Apache License v2.0 license.

This is the second of five lab notebooks that will explore how to do some introductory data extraction and analysis from Wikipedia data. This lab will extend the methods in the prior lab about analyzing a single article's revision histories and use network science methods to analyze the networks of hyperlinks around a single article. You do not need to be fluent in either to complete the lab, but there are many options for extending the analyses we do here by using more advanced queries and scripting methods.

Acknowledgements
I'd like to thank the Wikimedia Foundation for the PAWS system and related Wikitech infrastructure that this workbook runs within. Yuvi Panda, Aaron Halfaker, Jonathan Morgan, and Dario Taraborelli have all provided crucial support and feedback.

Confirm that basic Python commands work

a = 3
b = 4
a**b
81

Import modules and setup environment

Load up all the libraries we'll need to connect to the database, retreive information for analysis, and visualize results.

# Makes the plots appear within the notebook
%matplotlib inline

# Two fundamental packages for doing data manipulation
import numpy as np                   # http://www.numpy.org/
import pandas as pd                  # http://pandas.pydata.org/

# Two related packages for plotting data
import matplotlib.pyplot as plt      # http://matplotlib.org/
import seaborn as sb                 # https://stanford.edu/~mwaskom/software/seaborn/

# Package for requesting data via the web and parsing resulting JSON
import requests
import json
from bs4 import BeautifulSoup

# Two packages for accessing the MySQL server
import pymysql                       # http://pymysql.readthedocs.io/en/latest/
import os                            # https://docs.python.org/3.4/library/os.html

# Packages for analyzing complex networks
import networkx as nx                # https://networkx.github.io/
import igraph as ig

# Setup the code environment to use plots with a white background and DataFrames show more columns and rows
sb.set_style('whitegrid')
pd.options.display.max_columns = 100
pd.options.display.max_rows = 110

Define the name of the article you want to use for the rest of the lab.

page_title = 'American_Expeditionary_Forces'#World_War_I a long page with lots of hyperlinks, American_Expeditionary_Forces seems more manageable so long as keep to current links

Network 1 - Hyperlinks

The first part of this lab with examine the hyperlinks among Wikipedia articles.

Retrieve the content of the page via API

Write a function that takes an article title and returns the list of links in the body of the article. Note that the reason we don't use the "pagelinks" table in MySQL or the "links" parameter in the API is that this includes links within templates. Articles with templates link to each other forming over-dense clusters in the resulting networks. We only want the links appearing in the body of the text.

We pass a request to the API, which returns a JSON-formatted string containing the HTML of the page. We use BeautifulSoup to parse through the HTML tree and extract the non-template links and return them as a list.

def get_page_outlinks(page_title,redirects=1):
    # Replace spaces with underscores
    #page_title = page_title.replace(' ','_')
    
    bad_titles = ['Special:','Wikipedia:','Help:','Template:','Category:','International Standard','Portal:','s:','File:']
    
    # Get the response from the API for a query
    # After passing a page title, the API returns the HTML markup of the current article version within a JSON payload
    req = requests.get('https://en.wikipedia.org/w/api.php?action=parse&format=json&page={0}&redirects={1}&prop=text&disableeditsection=1&disabletoc=1'.format(page_title,redirects))
    
    # Read the response into JSON to parse and extract the HTML
    json_string = json.loads(req.text)
    
    # Initialize an empty list to store the links
    outlinks_list = [] 
    
    if 'parse' in json_string.keys():
        page_html = json_string['parse']['text']['*']

        # Parse the HTML into Beautiful Soup
        soup = BeautifulSoup(page_html,'lxml')

        # Delete tags associated with templates
        for tag in soup.find_all('tr'):
            tag.replace_with('')

        # For each paragraph tag, extract the titles within the links
        for para in soup.find_all('p'):
            for link in para.find_all('a'):
                if link.has_attr('title'):
                    title = link['title']
                    # Ignore links that aren't interesting
                    if all(bad not in title for bad in bad_titles):
                        outlinks_list.append(title)

        # For each unordered list, extract the titles within the child links
        for unordered_list in soup.find_all('ul'):
            for item in unordered_list.find_all('li'):
                for link in item.find_all('a'):
                    if link.has_attr('title'):
                        title = link['title']
                        # Ignore links that aren't interesting
                        if all(bad not in title for bad in bad_titles):
                            outlinks_list.append(title)

    return outlinks_list

Run an example article, showing the first 10 articles.

page_outlinks = get_page_outlinks(page_title)
page_outlinks[:10]
['United States Armed Forces',
 'John J. Pershing',
 'World War I',
 'United States campaigns in World War I',
 'German Army (German Empire)',
 'Austro-Hungarian Army',
 'Western Front (World War I)',
 'Third Battle of the Aisne',
 'Battle of Château-Thierry (1918)',
 'Battle of Belleau Wood']

You could write a recursive function like recursively_get_hyperlink_network that would crawl the hyperlink network out to an arbitrary distance, but this is becomes exhorbitantly expensive at any depth greater than 1.

Here's an example function, but is not executable to prevent you from harming yourself. :)

def recursively_get_hyperlink_network(seed_page,depth): neighbors = {} if depth < 0: return neighbors neighbors[seed_page] = get_page_outlinks(seed_page) for neighbor in neighbors[seed_page]: neighbors[neighbor] = get_hyperlink_network(neighbor,depth-1) return neighbors

Instead, define a simple function to get the 1.5-step ego hyperlink network. The "ego" is the seed page you start from, the "alters" are the neighbors that the ego links out to. We also get the alters of the alters (2nd order alters), but only include these 2nd order connections if they link to 1st order alters. In other words, the 1.5-step ego hyperlink network are all the pages linked from the seed page and the connections among this set of articles.

def get_hyperlink_alters(seed_page):
    # Initialize an empty dictionary to act as an adjacency "list"
    neighbors = {}
    
    # Get all the alters for the seed page and store them in the adjacency dictionary
    neighbors[seed_page] = get_page_outlinks(seed_page,1)
    
    # For each of the alters, get their alters and store in the adjacency dictionary
    for neighbor in list(set(neighbors[seed_page])): # Don't recrawl duplicates
        neighbors[neighbor] = get_page_outlinks(neighbor,0)
    
    # Initialize an empty graph that we will add nodes and edges into
    g = nx.DiGraph()
    
    # For each entry in the adjacency dictionary, check if the alter's alters are also the seed page's alters
    # If they are and the edge is already in the graph, increment the edge weight by one
    # If they are but the edge is not already in the graph, add the edge with a weight of one
    for article,neighbor_list in neighbors.items():
        for neighbor in neighbor_list:
            if neighbor in neighbors[seed_page] + [seed_page]:
                if g.has_edge(article,neighbor):
                    g[article][neighbor]['weight'] += 1
                else:
                    g.add_edge(article,neighbor,weight=1)
    
    # Return the weighted graph
    return g

Run this on an example article and save the resulting graph object to disk.

This step could take more than a minute depending on the number of links and size of the neighboring pages.

# Create the hyperlink network
hyperlink_g = get_hyperlink_alters(page_title)

# Save the graph to disk to visualize in Gephi
nx.write_graphml(hyperlink_g,'hyperlink_{0}.graphml'.format(page_title.replace(' ','_')))
hg_nodes = hyperlink_g.number_of_nodes()
hg_edges = hyperlink_g.number_of_edges()

print("There are {0} nodes and {1} edges in the hyperlink network.".format(hg_nodes,hg_edges))

#reimp_hg_nodes=nx_hg.number_of_nodes()#does it work to use the reimported from graphml data?
#reimp_hg_edges = nx_hg.number_of_edges()#at least it doesn't cause an error
#print("When graphml reimported, there are {0} nodes and {1} edges in the hyperlink network.".format(reimp_hg_nodes,reimp_hg_edges))
There are 93 nodes and 575 edges in the hyperlink network.
When graphml reimported, there are 93 nodes and 575 edges in the hyperlink network.
hg_density = nx.density(hyperlink_g)
print('{0:.2%} of the possible edges actually exist.'.format(hg_density))
6.72% of the possible edges actually exist.
def reciprocity(g):
    reciprocated_edges = []
    
    for (i,j) in g.edges():
        if hyperlink_g.has_edge(j,i):
            reciprocated_edges.append((i,j))
    
    return len(reciprocated_edges)/float(g.number_of_edges())

hg_reciprocity = reciprocity(hyperlink_g)

print('{0:.2%} of the edges in the hyperlink network are reciprocated.'.format(hg_reciprocity))

#nx_hg_reciprocity = reciprocity(nx_hg)
#print('{0:.2%} of the edges in the reimported hyperlink network are reciprocated.'.format(nx_hg_reciprocity))
29.04% of the edges in the hyperlink network are reciprocated.
29.04% of the edges in the reimported hyperlink network are reciprocated.

Play the Wikipedia Game!

Using only the hyperlinks on the article, try to get from the first article to the second article.

page1,page2 = np.random.choice(list(hyperlink_g.nodes()),2)
print("Try to navigate from \"{0}\" to \"{1}\" using only hyperlinks.\n".format(page1,page2))
print("Start at: https://en.wikipedia.org/wiki/{0}".format(page1.replace(' ','_')))
Try to navigate from "Arab League" to "Chicago Tribune" using only hyperlinks.

Start at: https://en.wikipedia.org/wiki/Arab_League

No cheating!

After you've played the game a few times, see what an optimal shortest path is. You may get an error indicating there is no shortest path, in which case, try a new pair of nodes.

nx.shortest_path(hyperlink_g,page1,page2)
['Arab League', 'The New York Times', 'Chicago Tribune']

The shortest path length is the path connecting two nodes in the fewest steps. This is related to the "small world" effect where everyone in the world is just a few handshakes from each other. It's rare to find complex networks where the longest shortest path is above 5. Nodes that are this far from each other are likely about very unrelated topics.

If there are no paths greater than 5, lower the path_length_threshold from 5 to 4.

The long_path_lengths dictionary below is populated by computing all the shortest path lengths between nodes in the network and only keeping those paths that are longer than 5 steps from each other. In a directed graph like our hyperlink network, it's important to follow the direction of the arrows: if page A links to page B but page B doesn't link to page A, then we can't make a shortest path from B to A, we have to find another path.

path_length_threshold = 4
long_path_lengths = {}

for k,d in nx.all_pairs_shortest_path_length(hyperlink_g).items():
    long_paths = [v for v,l in d.items() if l > path_length_threshold]
    if len(long_paths) > 0:
        long_path_lengths[k] = long_paths
        
long_path_lengths.keys()
dict_keys(['Thomas A. Pope', 'Canon de 155 C modèle 1917 Schneider', 'Nieuport 28', 'Leonard Porter Ayres', '372nd Infantry Regiment (United States)', 'Mark V tank', '332nd Infantry Regiment (United States)', 'Frederick Funston', 'New York (state)', 'Tours', 'Doughboy', '41st Infantry Division (United States)', 'John Pershing', 'Canon de 75 modèle 1897', 'American Expeditionary Force Siberia', '33rd Infantry Division (United States)', 'Harlem Hellfighters', 'Brest, France', '26th Infantry Division (United States)', 'Fort Omaha Balloon School', 'Renault FT', 'Austro-Hungarian Army', 'Remington Arms', 'Legion of Merit', 'Saint Nazaire', 'M1903 Springfield', 'Combined arms', '27th Infantry Regiment (United States)', 'Verdun', 'Henry Johnson (World War I soldier)', 'Battle of Hamel', 'Allies of World War I', 'Bordeaux', 'German Army (German Empire)', 'United States campaigns in World War I', '93rd Infantry Division (United States)', 'First United States Army', 'Sergeant', 'Second Battle of the Marne', 'Third Battle of the Aisne', 'Meuse-Argonne offensive', '369th Infantry Regiment (United States)', 'Battle of Saint-Mihiel', 'Medal of Honor', 'New Jersey', 'Croix de guerre 1914–1918 (France)', 'Adrian helmet', 'Battle of Belleau Wood', 'United States Marine Corps', 'World War I', 'Battle of Château-Thierry (1918)', 'Polar Bear Expedition', 'Battlefield 1', '339th Infantry Regiment', 'Corporal', 'United States Armed Forces', 'Western Front (World War I)', 'First Australian Imperial Force', 'Newport News, Virginia', '1st Infantry Division (United States)', '371st Infantry Regiment (United States)', 'SPAD XIII', '370th Infantry Regiment (United States)', 'James J. Cooke', '92nd Infantry Division (United States)', 'Battle of Cantigny', 'Virginia', 'John Monash', 'National Guard (United States)', '42nd Infantry Division (United States)', 'Canon de 155mm GPF', 'Robert J. Dalessandro', 'John J. Pershing', 'United States home front during World War I', 'Armistice of 11 November 1918', 'Division (military)', 'American Expeditionary Forces on the Western Front order of battle', 'Woodrow Wilson', 'M1917 Enfield', '2nd Infantry Division (United States)', 'Meuse-Argonne Offensive', 'United States Army Center of Military History'])

The shortest path between the articles can be identified using the shortest_path function and supplying the graph and the names of two nodes.

# Randomly choose two articles in the list of long shortest paths
page1,page2 = np.random.choice(list(long_path_lengths.keys()),2)
print("The two pages randomly selected are: \"{0}\" and \"{1}\"".format(page1,page2))

# Display the path between these articles
nx.shortest_path(hyperlink_g,page1,page2)
The two pages randomly selected are: "Combined arms" and "Croix de guerre 1914–1918 (France)"
['Combined arms',
 'United States Marine Corps',
 'World War I',
 'Croix de guerre 1914–1918 (France)']

Test out different combinations of articles from the long_path_lengths to find the articles that are farthest apart by entering different article names for page1 and page2.

page1 = 'John Pershing'
#page2 = 'Douglas_Haig,_1st_Earl_Haig'# no connection apparently-
#page2='Joseph_Joffre'#also not connected
page2='United States Marine Corps'
nx.shortest_path(hyperlink_g,page1,page2)
['John Pershing', 'Medal of Honor', 'United States Marine Corps']
hg_in_degree_d = {node:int(centrality*(len(hyperlink_g) - 1)) for node,centrality in nx.in_degree_centrality(hyperlink_g).items()}
hg_out_degree_d = {node:int(centrality*(len(hyperlink_g) - 1)) for node,centrality in nx.out_degree_centrality(hyperlink_g).items()}

Look at the nodes with the highest in-degree: other pages in the network point to this page.

degree_df = pd.DataFrame({'In':hg_in_degree_d,'Out':hg_out_degree_d})
degree_df['In'].sort_values(ascending=False).head(10)
World War I                                      62
Western Front (World War I)                      23
Medal of Honor                                   23
John J. Pershing                                 20
Allies of World War I                            18
United States Army Center of Military History    18
93rd Infantry Division (United States)           14
United States Marine Corps                       14
Meuse-Argonne Offensive                          12
Armistice of 11 November 1918                    12
Name: In, dtype: int64

Look at the nodes with the highest-out-degree: these pages point to many other pages.

degree_df['Out'].sort_values(ascending=False).head(10)
American_Expeditionary_Forces                                         92
John Pershing                                                         25
John J. Pershing                                                      25
American Expeditionary Forces on the Western Front order of battle    23
World War I                                                           18
United States campaigns in World War I                                16
42nd Infantry Division (United States)                                12
93rd Infantry Division (United States)                                12
Battle of Hamel                                                       12
1st Infantry Division (United States)                                 12
Name: Out, dtype: int64

Look at the nodes that have no links out.

degree_df.query('Out == 0')['Out']
Electronic Arts     0
Internet Archive    0
La Pallice          0
Nancy, France       0
Name: Out, dtype: int64

Look at nodes that have a single link in. These are also known as (in-) pendants. If there are none, it should appear as an empty series.

degree_df.query('In == 1')['In']
41st Infantry Division (United States)                                1
American Expeditionary Forces on the Western Front order of battle    1
Fort Omaha Balloon School                                             1
La Pallice                                                            1
National Guard (United States)                                        1
Newport News, Virginia                                                1
Nieuport 28                                                           1
Robert J. Dalessandro                                                 1
Saint Nazaire                                                         1
Tours                                                                 1
Name: In, dtype: int64

Look at the nodes with a single link out. These are also known as (out-)pendants. If there are none, it should appear as an empty series.

degree_df.query('Out == 1')['Out']
8mm Lebel                                        1
Adrian helmet                                    1
Austro-Hungarian Army                            1
Berthier rifle                                   1
Brest, France                                    1
Combined arms                                    1
Fort Omaha Balloon School                        1
Mark V tank                                      1
SPAD XIII                                        1
United States Army Center of Military History    1
White American                                   1
Name: Out, dtype: int64

Given a page, what are the neighbors that link in to it? Assign a specific article title to the page1 variable by replacing the np.random.choice(degree_df.index)

page1 = np.random.choice(degree_df.index)

in_connections = hyperlink_g.predecessors(page1)
print("The links into node \"{0}\" are:\n{1}".format(page1,in_connections))
The links into node "Battle of Château-Thierry (1918)" are:
['American Expeditionary Forces on the Western Front order of battle', 'American_Expeditionary_Forces', 'World War I', '2nd Infantry Division (United States)', 'John J. Pershing', 'John Pershing']
out_connections = hyperlink_g.successors(page1)
print("The links out from node \"{0}\" are:\n{1}".format(page1,out_connections))
The links out from node "Battle of Château-Thierry (1918)" are:
['World War I', 'John J. Pershing', 'Second Battle of the Marne']
in_degree_dist_df = degree_df['In'].value_counts().reset_index()
out_degree_dist_df = degree_df['Out'].value_counts().reset_index()

f,ax = plt.subplots(1,1)
in_degree_dist_df.plot.scatter(x='index',y='In',ax=ax,c='blue',label='In')
out_degree_dist_df.plot.scatter(x='index',y='Out',ax=ax,c='red',label='Out')
ax.set_xscale('symlog')
ax.set_yscale('symlog')
ax.set_xlim((0,1e3))
ax.set_ylim((0,1e3))

ax.set_xlabel('Connections')
ax.set_ylabel('Count')
<matplotlib.text.Text at 0x7f9dcc354940>

Calculate communities within the network

Define a function to compute node community memberships for multiple community detection algorithms within igraph. The output is a dictionary of dictionaries where the top-level key is the name of the algorithm and returns a second-level dictionary keyed by the the page name with values being the community membership value. Documentation and details about these algorithms can be found under the igraph graph-class documentation.

def comparative_community_detector(igraph):
    memberships = {}
    
    # Directed memberships
    memberships['betweenness'] = igraph.community_edge_betweenness().as_clustering().membership
    memberships['infomap'] = igraph.community_infomap().membership
    memberships['spinglass'] = igraph.community_spinglass().membership
    memberships['walktrap'] = igraph.community_walktrap().as_clustering().membership
    
    # Undirected memberships
    undirected = igraph.as_undirected()
    memberships['fastgreedy'] = undirected.community_fastgreedy().as_clustering().membership
    memberships['leading_eigenvector'] = undirected.community_leading_eigenvector().membership
    memberships['multilevel'] = undirected.community_multilevel().membership
    
    labelled_memberships = {}
    for label,membership in memberships.items():
        labelled_memberships[label] = dict(zip(igraph.vs['id'],membership))
        
    return labelled_memberships

Not included in the comparative_community_detector function are two additional community detection algorithms that are too intensive or are not working properly. They're documented below if you ever care to explore in the future.

# Uses up a ton of memory and crashes kernel immediately ig_hg_optimal_modularity = hyperlink_g.community_optimal_modularity().membership ig_hg_optimal_modularity_labels = dict(zip(ig_hg.vs['id'],ig_hg_optimal_modularity)) pd.Series(ig_hg_optimal_modularity_labels).value_counts().head(10) # Lumps everyone into a single community ig_hg_label_propagation = hyperlink_g.community_label_propagation(initial=range(ig_hg_d.vcount()),fixed=[False]*ig_hg_d.vcount()).membership ig_hg_label_propagation_labels = dict(zip(ig_hg_d.vs['id'],ig_hg_label_propagation)) pd.Series(ig_hg_label_propagation_labels).value_counts().head(10)

Here we need to shift from using the networkx library to using the igraph library. The former is built purely in Python which makes it easier-to-use but somewhat slower while the latter is a "wrapper" that lets us write in Python but does the calculations in much-faster C code behind-the-scenes.

# Load the hyperlink network data from disk into a networkx graph object
nx_hg = nx.read_graphml('hyperlink_{0}.graphml'.format(page_title.replace(' ','_')))
#nx_hg = nx.read_graphml('C:/Users/ancu6230/Documents/boulder from sept 16/INFO-3501-5501-master/Lab 2/AEF.graphml')
#I guess jupyter can only read from files that are in the paws directory, not really from my hard drive disk

# Load the hyperlink network data from disk into a igraph graph object
ig_hg = ig.read('hyperlink_{0}.graphml'.format(page_title.replace(' ','_')))
ig.summary(ig_hg) # Get statistics about the 
IGRAPH D-W- 93 575 -- 
+ attr: id (v), weight (e)

Run the function on the igraph version of the hyperlink network.

This may take a minute or more since these are intensive calculations

# Run the community detection labelling on the igraph graph object
comparative_community_labels = comparative_community_detector(ig_hg)

# Convert the node labels into a dict-of-dicts keyed by page name and inner-dict containing community labels
comparative_community_labels_transposed = pd.DataFrame(comparative_community_labels).to_dict('index')

# Update each node in the networkx graph object to reflect the community membership labels
for _node in nx_hg.nodes():
    try:
        nx_hg.node[_node]['label'] = _node
        for (label,membership) in comparative_community_labels_transposed[_node].items():
            nx_hg.node[_node][label] = int(membership)
    except KeyError: # Concerning that some labels aren't present, but skip them for now
        print("Error in assigning \"{0}\" to a community.".format(_node))
        pass

# Write the labeled graph back to disk to visualize in Gephi
nx.write_graphml(nx_hg,'hyperlink_communities_{0}.graphml'.format(page_title.replace(' ','_')))