Lab 2 - Networks

Professor Brian Keegan
Department of Information Science, CU Boulder
This notebook is copyright and made available under the Apache License v2.0 license.

This is the second of five lab notebooks that will explore how to do some introductory data extraction and analysis from Wikipedia data. This lab will extend the methods in the prior lab about analyzing a single article's revision histories and use network science methods to analyze the networks of coauthorship and hyperlinks. You do not need to be fluent in either to complete the lab, but there are many options for extending the analyses we do here by using more advanced queries and scripting methods.

Acknowledgements
I'd like to thank the Wikimedia Foundation for the PAWS system and related Wikitech infrastructure that this workbook runs within. Yuvi Panda, Aaron Halfaker, Jonathan Morgan, and Dario Taraborelli have all provided crucial support and feedback.

Confirm that basic Python commands work

a = 3
b = 4

a**b
81

Import modules and setup environment

Load up all the libraries we'll need to connect to the database, retreive information for analysis, and visualize results.

# Makes the plots appear within the notebook
%matplotlib inline

# Two fundamental packages for doing data manipulation
import numpy as np                   # http://www.numpy.org/
import pandas as pd                  # http://pandas.pydata.org/

# Two related packages for plotting data
import matplotlib.pyplot as plt      # http://matplotlib.org/
import seaborn as sb                 # https://stanford.edu/~mwaskom/software/seaborn/

# Package for requesting data via the web and parsing resulting JSON
import requests
import json
from bs4 import BeautifulSoup

# Two packages for accessing the MySQL server
import pymysql                       # http://pymysql.readthedocs.io/en/latest/
import os                            # https://docs.python.org/3.4/library/os.html

# Package for analyzing complex networks
import networkx as nx                # https://networkx.github.io/

# Setup the code environment to use plots with a white background and DataFrames show more columns and rows
sb.set_style('whitegrid')
pd.options.display.max_columns = 100
pd.options.display.max_rows = 110

Retrieve the content of the page via API

Write a function that takes an article title and returns the list of links in the body of the article. Note that the reason we don't use the "pagelinks" table in MySQL or the "links" parameter in the API is that this includes links within templates. Articles with templates link to each other forming over-dense clusters in the resulting networks. We only want the links appearing in the body of the text.

We pass a request to the API, which returns a JSON-formatted string containing the HTML of the page. We use BeautifulSoup to parse through the HTML tree and extract the non-template links and return them as a list.

 
def get_page_outlinks(page_title,redirects=1):
    # Replace spaces with underscores
    #page_title = page_title.replace(' ','_')
    
    bad_titles = ['Special:','Wikipedia:','Help:','Template:','Category:','International Standard']
    
    # Get the response from the API for a query
    # After passing a page title, the API returns the HTML markup of the current article version within a JSON payload
    req = requests.get('https://en.wikipedia.org/w/api.php?action=parse&format=json&page={0}&redirects={1}&prop=text&disableeditsection=1&disabletoc=1'.format(page_title,redirects))
    
    # Read the response into JSON to parse and extract the HTML
    json_string = json.loads(req.text)
    
    # Initialize an empty list to store the links
    outlinks_list = [] 
    
    if 'parse' in json_string.keys():
        page_html = json_string['parse']['text']['*']

        # Parse the HTML into Beautiful Soup
        soup = BeautifulSoup(page_html,'lxml')

        # Delete tags associated with templates
        for tag in soup.find_all('tr'):
            tag.replace_with('')

        # For each paragraph tag, extract the titles within the links
        for para in soup.find_all('p'):
            for link in para.find_all('a'):
                if link.has_attr('title'):
                    outlinks_list.append(link['title'])

        # For each unordered list, extract the titles within the child links
        for unordered_list in soup.find_all('ul'):
            for item in unordered_list.find_all('li'):
                for link in item.find_all('a'):
                    if link.has_attr('title'):
                        title = link['title']
                        # Ignore links that aren't interesting
                        #if 'Special:' not in title and 'Wikipedia:' not in title and 'Help:' not in title and 'International Standard' not in title:
        
                        if all(bad not in title for bad in bad_titles): # Not working for some reason...
                            outlinks_list.append(title)

    return outlinks_list
get_page_outlinks('Barack Obama')
['American English',
 'Listen',
 'File:En-us-Barack-Hussein-Obama.ogg',
 'Help:IPA for English',
 'President of the United States',
 'African American',
 'Contiguous United States',
 'Honolulu',
 'Columbia University',
 'Harvard Law School',
 'Harvard Law Review',
 'Community organizing',
 'Juris Doctor',
 'Civil and political rights',
 'Constitutional law',
 'University of Chicago Law School',
 'Illinois Senate career of Barack Obama',
 'Illinois Senate',
 "Illinois's 1st congressional district election, 2000",
 'United States House of Representatives',
 'Bobby Rush',
 'United States Senate election in Illinois, 2004',
 'United States Senate',
 'Democratic Party (United States)',
 'Primary election',
 '2004 Democratic National Convention keynote address',
 '2004 Democratic National Convention',
 'Barack Obama presidential primary campaign, 2008',
 'Hillary Clinton',
 'Democratic Party presidential primaries, 2008',
 'Republican Party (United States)',
 'John McCain',
 'United States presidential election, 2008',
 'First inauguration of Barack Obama',
 '2009 Nobel Peace Prize',
 'Stimulus (economics)',
 'Great Recession',
 'American Recovery and Reinvestment Act of 2009',
 'Tax Relief, Unemployment Insurance Reauthorization, and Job Creation Act of 2010',
 'Patient Protection and Affordable Care Act',
 'Dodd–Frank Wall Street Reform and Consumer Protection Act',
 "Don't Ask, Don't Tell Repeal Act of 2010",
 'Withdrawal of U.S. troops from Iraq',
 'Iraq War',
 'War in Afghanistan (2001–14)',
 'New START',
 'Russia',
 '2011 military intervention in Libya',
 'Muammar Gaddafi',
 'Death of Osama bin Laden',
 'United States House of Representatives elections, 2010',
 'United States debt ceiling',
 'Budget Control Act of 2011',
 'American Taxpayer Relief Act of 2012',
 'United States presidential election, 2012',
 'Mitt Romney',
 'Second inauguration of Barack Obama',
 'Gun politics in the United States',
 'Sandy Hook Elementary School shooting',
 'LGBT American',
 'Supreme Court of the United States',
 'Defense of Marriage Act',
 'United States v. Windsor',
 'Same-sex marriage in the United States',
 'Obergefell v. Hodges',
 'American-led intervention in Iraq (2014–present)',
 'Iraqi insurgency (2011–13)',
 'Islamic State of Iraq and the Levant',
 'Withdrawal of U.S. troops from Iraq',
 'Withdrawal of U.S. troops from Afghanistan',
 'Paris Agreement',
 'Joint Comprehensive Plan of Action',
 'United States–Cuban Thaw',
 'Cuba–United States relations',
 'Kapiolani Medical Center for Women and Children',
 'Honolulu',
 'Ann Dunham',
 'Wichita, Kansas',
 'English Americans',
 'Barack Obama Sr.',
 'Luo people of Kenya and Tanzania',
 'Nyang’oma Kogelo',
 'Russian language',
 'University of Hawaii at Manoa',
 'Foreign student',
 'Wailuku, Hawaii',
 'Maui',
 'University of Washington',
 'Harvard University',
 'Lolo Soetoro',
 'Indonesia',
 'East–West Center',
 'Graduate student',
 'University of Hawaii',
 'Molokai',
 'J-1 visa',
 'Tebet, South Jakarta',
 'Menteng',
 'Besuki Public School',
 'Calvert School',
 'Madelyn Dunham',
 'Stanley Armour Dunham',
 'Punahou School',
 'University-preparatory school',
 'Anthropology',
 'Doctor of Philosophy',
 'Ovarian cancer',
 'Uterine cancer',
 'Marijuana',
 'Cocaine',
 'Occidental College',
 'Disinvestment from South Africa',
 'Apartheid',
 'Columbia College, Columbia University',
 'Political science',
 'International relations',
 'Bachelor of Arts',
 'Business International Corporation',
 'New York Public Interest Research Group',
 'New York City Subway',
 'Metropolitan Transportation Authority',
 '137th Street – City College (IRT Broadway – Seventh Avenue Line)',
 'Developing Communities Project',
 'Roseland, Chicago',
 'West Pullman, Chicago',
 'Riverdale, Chicago',
 'South Side, Chicago',
 'Altgeld Gardens Homes (Chicago, Illinois)',
 'Gamaliel Foundation',
 'Family of Barack Obama',
 'Harvard Law School',
 'Somerville, Massachusetts',
 'Harvard Law Review',
 'Laurence Tribe',
 'Associate attorney',
 'Sidley Austin',
 'Hopkins & Sutter',
 'Juris Doctor',
 'Magna cum laude',
 'List of African-American firsts',
 'Dreams from My Father',
 'University of Chicago Law School',
 'Constitutional law',
 'Project Vote',
 'Voter registration campaign',
 'African Americans',
 "Crain's Chicago Business",
 'Of counsel',
 'Woods Fund of Chicago',
 'Joyce Foundation',
 'Chicago Annenberg Challenge',
 'Illinois Senate',
 'Alice Palmer (politician)',
 'Hyde Park, Chicago',
 'Kenwood, Chicago',
 'South Shore, Chicago',
 'Chicago Lawn, Chicago',
 'Tax credit',
 'Payday loan',
 'Predatory lending',
 "Illinois's 1st congressional district election, 2000",
 "Illinois's 1st congressional district",
 'United States House of Representatives',
 'Bobby Rush',
 'Racial profiling',
 'Capital punishment in the United States',
 'David Axelrod',
 'George W. Bush',
 '2003 invasion of Iraq',
 'Iraq Resolution',
 'Protests against the Iraq War',
 'Peter Fitzgerald (politician)',
 'Carol Moseley Braun',
 'Democratic Party (United States)',
 '2004 Democratic National Convention',
 'Jack Ryan (politician)',
 'Alan Keyes',
 'United States Senate election in Illinois, 2004',
 'Congressional Black Caucus',
 'Congressional Quarterly',
 'Resignation from the United States Senate',
 'Lame duck (politics)',
 'Sponsor (legislative)',
 'Secure America and Orderly Immigration Act',
 'Nunn–Lugar Cooperative Threat Reduction',
 'Federal Funding Accountability and Transparency Act of 2006',
 'Tom Carper',
 'Tom Coburn',
 'John McCain',
 'Tort reform',
 'Class Action Fairness Act of 2005',
 'Foreign Intelligence Surveillance Act of 1978 Amendments Act of 2008',
 'NSA warrantless surveillance (2001–07)',
 'Democratic Republic of the Congo',
 'Honest Leadership and Open Government Act',
 'Deceptive Practices and Voter Intimidation Prevention Act',
 'Iraq War De-Escalation Act of 2007',
 'Disinvestment from Iran',
 "State Children's Health Insurance Program",
 'United States Senate Committee on Foreign Relations',
 'United States Senate Committee on Environment and Public Works',
 "United States Senate Committee on Veterans' Affairs",
 'United States Senate Committee on Health, Education, Labor and Pensions',
 'United States Senate Committee on Homeland Security and Governmental Affairs',
 'United States Senate Foreign Relations Subcommittee on Europe and Regional Security Cooperation',
 'Mahmoud Abbas',
 'President of the Palestinian National Authority',
 'University of Nairobi',
 'Old State Capitol State Historic Site (Illinois)',
 'Springfield, Illinois',
 'Abraham Lincoln',
 "Lincoln's House Divided Speech",
 'Iraq War',
 'Energy policy of the United States',
 'Health care reform in the United States',
 'Democratic Party presidential primaries, 2008',
 'Hillary Clinton',
 'Delegate',
 'Caucus',
 'Delaware',
 'Joe Biden',
 'Indiana Governor',
 'Evan Bayh',
 'Virginia Governor',
 'Tim Kaine',
 '2008 Democratic National Convention',
 'Bill Clinton',
 'Invesco Field at Mile High',
 'Campaign finance in the United States',
 'United States presidential election debates',
 'Electoral College (United States)',
 'Election',
 'Barack Obama election victory speech, 2008',
 'Grant Park (Chicago)',
 'Federal Election Commission',
 'Democratic Party presidential primaries, 2012',
 '2012 Democratic National Convention',
 '2012 Democratic National Convention',
 'Charlotte, North Carolina',
 'Joe Biden',
 'Bill Clinton',
 'Mitt Romney',
 'Paul Ryan',
 'Electoral College (United States)',
 'Franklin D. Roosevelt',
 'List of United States presidential elections by popular vote margin',
 'First inauguration of Barack Obama',
 'Guantanamo Bay detention camp',
 'George W. Bush',
 'Ronald Reagan',
 'Mexico City Policy',
 'Lilly Ledbetter Fair Pay Act of 2009',
 'Statute of limitations',
 'Embryonic stem cell',
 'Sonia Sotomayor',
 'Associate Justice of the Supreme Court of the United States',
 'David Souter',
 'Hispanic',
 'Elena Kagan',
 'John Paul Stevens',
 'Health Care and Education Reconciliation Act of 2010',
 'Reconciliation (United States Congress)',
 'Pell Grant',
 'Space policy of the Barack Obama administration',
 'NASA',
 'Human spaceflight',
 'Ares I',
 'Ares V',
 'Constellation program',
 'International Space Station',
 '2011 State of the Union Address',
 'Innovation economics',
 'Earmark (politics)',
 'Sustainable energy',
 'Matthew Shepard and James Byrd Jr. Hate Crimes Prevention Act',
 'Hate crime laws in the United States',
 "Don't Ask, Don't Tell Repeal Act of 2010",
 "Don't ask, don't tell",
 'United States Armed Forces',
 'Same-sex marriage in the United States',
 'Inaugural address',
 'LGBT rights in the United States',
 'Supreme Court of the United States',
 'Hollingsworth v. Perry',
 'Same-sex marriage',
 'United States v. Windsor',
 'Defense of Marriage Act',
 'Obergefell v. Hodges',
 'White House Council on Women and Girls',
 'Executive order',
 's:Executive Order 13506',
 'Senior Advisor to the President',
 'Valerie Jarrett',
 'White House Task Force to Protect Students from Sexual Assault',
 'Joe Biden',
 'Office of the Vice President of the United States',
 'Violence Against Women Act',
 'American Recovery and Reinvestment Act of 2009',
 'Stimulus (economics)',
 'Great Recession',
 'Tax incentive',
 'Timothy Geithner',
 'Financial crisis of 2007–08',
 'Public-Private Investment Program for Legacy Assets',
 'Automotive industry crisis of 2008–10',
 'General Motors',
 'Chrysler',
 'Chrysler Chapter 11 reorganization',
 'Fiat',
 'General Motors Chapter 11 reorganization',
 'Car Allowance Rebate System',
 'Congressional Budget Office',
 '2010 United States federal budget',
 'Debt ceiling',
 'Budget Control Act of 2011',
 'Federal government of the United States',
 'Default (finance)',
 'Federal Reserve System',
 'Ben Bernanke',
 'National Association for Business Economics',
 'World War II',
 'United States elections, 2010',
 'Bush tax cuts',
 'Federal Insurance Contributions Act tax',
 'Estate tax in the United States',
 'Tax Relief, Unemployment Insurance Reauthorization, and Job Creation Act of 2010',
 'Income inequality in the United States',
 'Fast food worker strikes',
 'Pope Francis',
 'Trickle-down economics',
 'Trans-Pacific Partnership',
 'Global warming',
 'Drilling rig',
 'Macondo Prospect',
 'Gulf of Mexico',
 'Deepwater Horizon oil spill',
 'United States Secretary of the Interior',
 'Ken Salazar',
 'Deepwater drilling',
 'Keystone XL pipeline',
 'Petroleum exploration in the Arctic',
 'United States Congress',
 'Health care in the United States',
 'Public health insurance option',
 'Pre-existing condition',
 'Barack Obama speech to joint session of Congress, September 2009',
 'Patient Protection and Affordable Care Act',
 'Provisions of the Patient Protection and Affordable Care Act',
 'Medicaid',
 'Federal poverty level',
 'Health insurance exchange',
 'Tax bracket',
 'Indoor tanning',
 'Medicare Advantage',
 'National Federation of Independent Business v. Sebelius',
 'Burwell v. Hobby Lobby Stores, Inc.',
 'Religious Freedom Restoration Act',
 'King v. Burwell',
 'Sandy Hook Elementary School shooting',
 'Federal Assault Weapons Ban',
 'Bureau of Alcohol, Tobacco, Firearms and Explosives',
 'Executive order',
 "Women's suffrage",
 'United States House of Representatives elections, 2010',
 'Federal Communications Commission',
 'Internet access',
 'Telecommunication',
 'Net neutrality',
 'United States Secretary of State',
 'Russian reset',
 'Al Arabiya',
 'Cairo University',
 'A New Beginning',
 'Iranian presidential election, 2009',
 'President of the United Nations Security Council',
 'United Nations Security Council',
 'Benjamin Netanyahu',
 'East Jerusalem',
 'President of Russia',
 'Dmitry Medvedev',
 'START I',
 'New START',
 'United States Senate',
 'LGBT rights by country or territory',
 'United States–Cuban Thaw',
 'Cuba–United States relations',
 'Saudi Arabian-led intervention in Yemen',
 'United States Marine Corps',
 'Counter-terrorism',
 'Northern Iraq offensive (June 2014)',
 'Islamic State of Iraq and the Levant',
 'Islamic State of Iraq and the Levant',
 'Sinjar massacre',
 'American-led intervention in Iraq (2014–present)',
 '82nd Airborne Division',
 'David D. McKiernan',
 'Special Forces (United States Army)',
 'Stanley A. McChrystal',
 'David Petraeus',
 'Israeli settlement',
 'Two-state solution',
 'Arab–Israeli conflict',
 'Joint Political Military Group',
 'Iron Dome',
 'Palestinian rocket attacks on Israel',
 'Jeffrey Goldberg',
 'Zionism',
 'African-American Civil Rights Movement (1954–68)',
 'Muammar Gaddafi',
 'Arab Spring',
 'Arab League',
 'United Nations Security Council Resolution 1973',
 'Tomahawk (missile)',
 'Northrop Grumman B-2 Spirit',
 'NATO',
 'Operation Unified Protector',
 'Syrian Civil War',
 'Bashar al-Assad',
 'Ghouta chemical attack',
 "Destruction of Syria's chemical weapons",
 'Chlorine gas',
 'Military intervention against ISIL',
 'Osama bin Laden',
 "Osama bin Laden's compound in Abbottabad",
 'Leon Panetta',
 'United States Navy SEALs',
 'World Trade Center site',
 'Times Square',
 'Reactions to the death of Osama bin Laden',
 'Negotiations leading to the Joint Comprehensive Plan of Action',
 'Nuclear weapon',
 'Joint Plan of Action',
 'Joint Comprehensive Plan of Action',
 'Benjamin Netanyahu',
 'Vatican City',
 'Pope Francis',
 'Prisoner exchange',
 'President of Cuba',
 'Raúl Castro',
 'Death of Nelson Mandela',
 'Johannesburg',
 'Pope Francis',
 'Cuban Thaw',
 'The New Republic',
 'Calvin Coolidge',
 'African Union',
 'Addis Ababa',
 'Education in Africa',
 'Economy of Africa',
 'LGBT',
 'Democratization',
 'United States presidential visits to Sub-Saharan Africa',
 'Atomic bombings of Hiroshima and Nagasaki',
 'Shinzō Abe',
 'Hiroshima Peace Memorial Museum',
 'Ivy League',
 'African-American Civil Rights Movement (1954–68)',
 'National Association of Black Journalists',
 'Gallup Organization',
 'Ronald Reagan',
 'Bill Clinton',
 'Death of Osama bin Laden',
 'Tony Blair',
 'Democratic Party (Italy)',
 'Walter Veltroni',
 'President of France',
 'Nicolas Sarkozy',
 'Harris Interactive',
 'France 24',
 'International Herald Tribune',
 'Grammy Award for Best Spoken Word Album',
 'Grammy Award',
 'Audiobook',
 'Dreams from My Father',
 'The Audacity of Hope',
 'Barack Obama presidential primary campaign, 2008',
 'Yes We Can (will.i.am song)',
 'Daytime Emmy Award',
 'Time (magazine)',
 'Time Person of the Year',
 'Parliament of the United Kingdom',
 'Westminster Hall',
 'Charles de Gaulle',
 'Nelson Mandela',
 'Monarchy of the United Kingdom',
 'Elizabeth II',
 'Pope Benedict XVI',
 'Norwegian Nobel Committee',
 '2009 Nobel Peace Prize',
 'Oslo',
 'The New York Times',
 'Geir Lundestad',
 'List of things named after Barack Obama',
 'Nystalus obamai (page does not exist)',
 'Obamadon',
 'Presidential library',
 'University of Chicago',
 'Jackson Park (Chicago)',
 'South Side, Chicago',
 'Chicago',
 'Illinois',
 'National Archives and Records Administration',
 'Family of Barack Obama',
 'Bernie Mac',
 'Margaret Thatcher',
 'Maya Soetoro-Ng',
 'Moneygall',
 'Jefferson Davis',
 'President of the Confederate States of America',
 'American Civil War',
 'Chicago White Sox',
 '2005 American League Championship Series',
 '2009 Major League Baseball All-Star Game',
 'Chicago Bears',
 'National Football League',
 'Steeler Nation',
 'Super Bowl XLIII',
 '1985 Chicago Bears season',
 'Super Bowl XX',
 'Space Shuttle Challenger disaster',
 'Michelle Obama',
 'Sidley Austin',
 'University of Chicago Laboratory Schools',
 'Sidwell Friends School',
 'Portuguese Water Dog',
 'Bo (dog)',
 'Ted Kennedy',
 'Sunny (dog)',
 'Hyde Park, Chicago',
 'Kenwood, Chicago',
 'Tony Rezko',
 'Money (magazine)',
 'Fisher House Foundation',
 'Glamour (magazine)',
 'Feminist',
 'Black church',
 'Community organizing',
 'Christianity Today',
 'Resurrection of Jesus',
 'Trinity United Church of Christ',
 'Jeremiah Wright',
 'Jeremiah Wright controversy',
 'Shiloh Baptist Church (Washington, D.C.)',
 "St. John's Episcopal Church, Lafayette Square",
 'Camp David',
 'The Bridge: The Life and Rise of Barack Obama',
 'Jesse White (politician)',
 'Illinois Secretary of State',
 'Evan Thomas',
 'PublicAffairs',
 'Hartford Courant',
 'The Huffington Post',
 'The Bridge: The Life and Rise of Barack Obama',
 'Organizing for Action',
 'JAMA (journal)',
 'DMOZ',
 'Biographical Directory of the United States Congress',
 'Project Vote Smart',
 'Federal Election Commission',
 'C-SPAN',
 'Chicago Tribune',
 'PolitiFact.com',
 'Project Gutenberg',
 'Internet Archive']

You could write a recursive function like recursively_get_hyperlink_network that would crawl the hyperlink network out to an arbitrary distance, but this is becomes exhorbitantly expensive at any depth greater than 1. Here's an example function, but is not executable to prevent you from harming yourself. :)

def recursively_get_hyperlink_network(seed_page,depth): neighbors = {} if depth < 0: return neighbors neighbors[seed_page] = get_page_outlinks(seed_page) for neighbor in neighbors[seed_page]: neighbors[neighbor] = get_hyperlink_network(neighbor,depth-1) return neighbors

Instead, define a simple function to get the 1.5-step ego hyperlink network. The "ego" is the seed page you start from, the "alters" are the neighbors that the ego links out to. We also get the alters of the alters (2nd order alters), but only include these 2nd order connections if they link to 1st order alters. In other words, the 1.5-step ego hyperlink network are all the pages linked from the seed page and the connections among this set of articles.

def get_hyperlink_alters(seed_page):
    # Initialize an empty dictionary to act as an adjacency "list"
    neighbors = {}
    
    # Get all the alters for the seed page and store them in the adjacency dictionary
    neighbors[seed_page] = get_page_outlinks(seed_page,1)
    
    # For each of the alters, get their alters and store in the adjacency dictionary
    for neighbor in list(set(neighbors[seed_page])): # Don't recrawl duplicates
        neighbors[neighbor] = get_page_outlinks(neighbor,0)
    
    # Initialize an empty graph that we will add nodes and edges into
    g = nx.DiGraph()
    
    # For each entry in the adjacency dictionary, check if the alter's alters are also the seed page's alters
    # If they are and the edge is already in the graph, increment the edge weight by one
    # If they are but the edge is not already in the graph, add the edge with a weight of one
    for article,neighbor_list in neighbors.items():
        for neighbor in neighbor_list:
            if neighbor in neighbors[seed_page] + [seed_page]:
                if g.has_edge(article,neighbor):
                    g[article][neighbor]['weight'] += 1
                else:
                    g.add_edge(article,neighbor,weight=1)
    
    # Return the weighted graph
    return g

Run this on an example article and save the resulting graph object to disk. This step could take more than a minute depending on the number of links and size of the neighboring pages.

page_title = 'Wii'

# Create the hyperlink network
hyperlink_g = get_hyperlink_alters(page_title)

# Save the graph to disk to visualize in Gephi
nx.write_gexf(hyperlink_g,'hyperlink_{0}.gexf'.format(page_title.replace(' ','_')))
hg_nodes = hyperlink_g.number_of_nodes()
hg_edges = hyperlink_g.number_of_edges()

#print("There are {0} nodes and {1} edges in the hyperlink network.".format(num_nodes,num_edges))
print("There are {0} nodes and {1} edges in the hyperlink network.".format(hg_nodes,hg_edges))
There are 250 nodes and 2874 edges in the hyperlink network.
hg_density = nx.density(hyperlink_g)
print('{0:.2%} of the possible edges actually exist.'.format(hg_density))
4.62% of the possible edges actually exist.
def reciprocity(g):
    reciprocated_edges = []
    
    for (i,j) in g.edges():
        if hyperlink_g.has_edge(j,i):
            reciprocated_edges.append((i,j))
    
    return len(reciprocated_edges)/float(g.number_of_edges())

hg_reciprocity = reciprocity(hyperlink_g)

print('{0:.2%} of the edges in the hyperlink network are reciprocated.'.format(hg_reciprocity))
31.07% of the edges in the hyperlink network are reciprocated.

Identify the most well-connected nodes

hg_in_degree_d = {node:int(centrality*(len(hyperlink_g) - 1)) for node,centrality in nx.in_degree_centrality(hyperlink_g).items()}
hg_out_degree_d = {node:int(centrality*(len(hyperlink_g) - 1)) for node,centrality in nx.out_degree_centrality(hyperlink_g).items()}
degree_df = pd.DataFrame({'In':hg_in_degree_d,'Out':hg_out_degree_d})
degree_df['In'].sort_values(ascending=False).head(10)
degree_df['Out'].sort_values(ascending=False).head(10)
in_degree_dist_df = degree_df['In'].value_counts().reset_index()
out_degree_dist_df = degree_df['Out'].value_counts().reset_index()

f,ax = plt.subplots(1,1)
in_degree_dist_df.plot.scatter(x='index',y='In',ax=ax,c='blue',label='In')
out_degree_dist_df.plot.scatter(x='index',y='Out',ax=ax,c='red',label='Out')
ax.set_xscale('symlog')
ax.set_yscale('symlog')
ax.set_xlim((0,1e3))
ax.set_ylim((0,1e3));

Construct co-authorship network

def get_500_recent_revisions(page_title):
    req = requests.get('https://en.wikipedia.org//w/api.php?action=query&format=json&prop=revisions&titles={0}&rvprop=ids%7Ctimestamp%7Cuser%7Csize&rvlimit=500'.format(page_title))
    json_payload = json.loads(req.text)
    
    try:
        pageid = list(json_payload['query']['pages'].keys())[0]
        revisions = json_payload['query']['pages'][pageid]['revisions']
        
        df = pd.DataFrame(revisions)
        df['timestamp'] = df['timestamp'].apply(lambda x:pd.datetime.strptime(x,'%Y-%m-%dT%H:%M:%SZ'))
        df['title'] = json_payload['query']['pages'][pageid]['title']
        return df
        
    except KeyError:
        print('Error in {0}'.format(page_title))
        pass
def get_neighbors_500_revisions(page_title):
    """ Takes a page title and returns the 500 most-recent revisions for the page and its neighbors.
      page_title = a string for the page title to get its revisions
      
    Returns:
      A pandas DataFrame containing all the page revisions.
    """
    
    alters = get_page_outlinks(page_title) + [page_title]
    
    df_list = []
    
    for alter in alters:
        _df = get_500_recent_revisions(alter)
        df_list.append(_df)
        
    df = pd.concat(df_list)
    return df
 
hyperlink_g_rev_df = get_neighbors_500_revisions(page_title)

hyperlink_g_rev_df.head()
hyperlink_g_gb_user_title = hyperlink_g_rev_df.groupby(['user','title'])
hyperlink_g_agg = hyperlink_g_gb_user_title.agg({'revid':pd.Series.nunique})
hyperlink_g_edgelist_df = hyperlink_g_agg.reset_index()
hyperlink_g_edgelist_df = hyperlink_g_edgelist_df.rename(columns={'revid':'weight'})

users = hyperlink_g_edgelist_df['user'].unique()
pages = hyperlink_g_edgelist_df['title'].unique()
collab_g = nx.from_pandas_dataframe(hyperlink_g_edgelist_df,source='user',target='title',
                                    edge_attr=['weight'],create_using=nx.DiGraph())
collab_g.add_nodes_from(users,nodetype='user')
collab_g.add_nodes_from(pages,nodetype='page')

nx.write_gexf(collab_g,'collaboration_{0}.gexf'.format(page_title.replace(' ','_')))
 

Compute descriptive statistics for the collaboration network

cg_users = len(users)
cg_pages = len(pages)
cg_edges = collab_g.number_of_edges()

print("There are {0} pages, {1} users, and {2} edges in the collaboration network.".format(cg_users,cg_pages,cg_edges))
cg_density = nx.bipartite.density(collab_g,pages)
print('{0:.2%} of the possible edges actually exist.'.format(cg_density))

Identify the most well-connected nodes

cg_in_degree_d = {node:int(centrality*(len(collab_g) - 1)) for node,centrality in nx.in_degree_centrality(collab_g).items()}
cg_out_degree_d = {node:int(centrality*(len(collab_g) - 1)) for node,centrality in nx.out_degree_centrality(collab_g).items()}
cg_degree_df = pd.DataFrame({'In':cg_in_degree_d,'Out':cg_out_degree_d})
cg_degree_df['In'].sort_values(ascending=False).head(10)
cg_degree_df['Out'].sort_values(ascending=False).head(10)
in_degree_dist_df = cg_degree_df['In'].value_counts().reset_index()
out_degree_dist_df = cg_degree_df['Out'].value_counts().reset_index()
revision_dist_df = hyperlink_g_edgelist_df['weight'].value_counts().reset_index()

f,ax = plt.subplots(1,1)
in_degree_dist_df.plot.scatter(x='index',y='In',ax=ax,c='blue',label='Page')
out_degree_dist_df.plot.scatter(x='index',y='Out',ax=ax,c='red',label='Editor')
revision_dist_df.plot.scatter(x='index',y='weight',ax=ax,c='green',label='Weight')
ax.set_xscale('symlog')
ax.set_yscale('symlog')
ax.set_xlim((0,1e3))
ax.set_ylim((0,1e5));