Lab 4 - Pageviews

Professor Brian Keegan
Department of Information Science, CU Boulder
This notebook is copyright and made available under the Apache License v2.0 license.

This is the third of five lab notebooks that will explore how to analyze the structure of collaborations in Wikipedia data about users' revisions across multiple articles. This lab will extend the methods in the previous two labs about analyzing a single article's revision histories and analyzing the hyperlink networks around a single Wikipedia page. You do not need to be fluent in either to complete the lab, but there are many options for extending the analyses we do here by using more advanced queries and scripting methods.

Acknowledgements
I'd like to thank the Wikimedia Foundation for the PAWS system and related Wikitech infrastructure that this workbook runs within. Yuvi Panda, Aaron Halfaker, Jonathan Morgan, and Dario Taraborelli have all provided crucial support and feedback.

Confirm that basic Python commands work

a = 3
b = 4
c = 5
(c-a)**b

Import modules and setup environment

Load up all the libraries we'll need to connect to the database, retreive information for analysis, and visualize results.

# Makes the plots appear within the notebook
%matplotlib inline

# Two fundamental packages for doing data manipulation
import numpy as np                   # http://www.numpy.org/
import pandas as pd                  # http://pandas.pydata.org/

# Two related packages for plotting data
import matplotlib.pyplot as plt      # http://matplotlib.org/
import seaborn as sb                 # https://stanford.edu/~mwaskom/software/seaborn/

# Package for requesting data via the web and parsing resulting JSON
import requests                      # http://docs.python-requests.org/en/master/
import json                          # https://docs.python.org/3/library/json.html
from bs4 import BeautifulSoup        # https://www.crummy.com/software/BeautifulSoup/bs4/doc/

# Two packages for accessing the MySQL server
import pymysql                       # http://pymysql.readthedocs.io/en/latest/
import os                            # https://docs.python.org/3.4/library/os.html

# Setup the code environment to use plots with a white background and DataFrames show more columns and rows
sb.set_style('whitegrid')
pd.options.display.max_columns = 100
pd.options.display.max_rows = 110

Define an article to examine pageview dynamics.

page_title = 'Brexit'

Get pageview data for a single article

Details about the Wikimedia REST API for pageviews is available here. Unfortunately, this data end point only provides information going back to July 1, 2015.

This is what the API returns as an example.

# Get today's date and yesterday's date
today = pd.datetime.today()
yesterday = pd.datetime.today() - pd.to_timedelta('1 day')

# Date
today_date_s = str(today.date())
yesterday_date_s = str(yesterday.date())

# Convert to strings
today_s = pd.datetime.strftime(today,'%Y%m%d00')
yesterday_s = pd.datetime.strftime(yesterday,'%Y%m%d00')

# Get the pageviews for today and yesterday
url_string = 'http://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/{0}/daily/{1}/{2}'
print(url_string.format(page_title.replace(' ','_'),yesterday_s,today_s))
http://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Brexit/daily/2016110100/2016110200

Write a function to get the pageviews from January 1, 2015 (in practice, the start date will be as late as August or as early as May) until yesterday.

def get_daily_pageviews(page_title,today_s):
    url_string = 'http://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/{0}/daily/2015010100/{1}'
    req = requests.get(url_string.format(page_title,today_s))

    json_s = json.loads(req.text)
    if 'items' in json_s.keys():
        _df = pd.DataFrame(json_s['items'])[['timestamp','views','article']]
        _df['timestamp'] = pd.to_datetime(_df['timestamp'],format='%Y%m%d00')
        _df['weekday'] = _df['timestamp'].apply(lambda x:x.weekday())
        return _df

Get the data for your page.

pageview_df = get_daily_pageviews(page_title,today_s)
pageview_df.head()
timestamp views article weekday
0 2015-07-01 59 Brexit 2
1 2015-07-02 44 Brexit 3
2 2015-07-03 40 Brexit 4
3 2015-07-04 49 Brexit 5
4 2015-07-05 64 Brexit 6

Interpret page view results

What does the pageview activity look like? Are there any bursts of attention? What might these bursts be linked to?

ax = pageview_df.plot.line(x='timestamp',y='views',logy=False,legend=False)
#ax = pageview_df.plot.line(x='timestamp',y='views',logy=True,legend=False)

ax.set_xlabel('')
ax.set_ylabel('Pageviews')

plt.savefig('problem2_F.png')

Use a logarithmic scaling for the y-axis to see more of the detail in the lower-traffic days.

ax = pageview_df.plot.line(x='timestamp',y='views',logy=True,legend=False)
ax.set_xlabel('')
ax.set_ylabel('Pageviews')
<matplotlib.text.Text at 0x7f80f2515128>

What are the dates for the biggest pageview outliers? Here we define an "outlier" to be more than 3 standard deviations above the average number of pageviews over the time window.

std_threshold = 4
threshold_val = pageview_df['views'].mean() + pageview_df['views'].std() * std_threshold
peak_days = pageview_df[pageview_df['views'] > threshold_val]

peak_days.head(10)
timestamp views article weekday
359 2016-06-24 42436 Brexit 4
467 2016-10-10 25067 Brexit 0
468 2016-10-11 32849 Brexit 1
469 2016-10-12 38317 Brexit 2

How much of the total pageview activity occurred on these days compared to the rest of the pageviews?

peak_fraction = pageview_df.loc[peak_days.index,'views'].sum()/pageview_df['views'].sum()

print('{0:.1%} of all pageviews occurred on the {1} peak days.'.format(peak_fraction,len(peak_days)))
11.2% of all pageviews occurred on the 4 peak days.

How does pageview activity change over the course of a week?

g = sb.factorplot(x='weekday',y='views',data=pageview_df,kind='bar',color='grey',
                  aspect=1.67,estimator=np.median)
ax = g.axes[0][0]
ax.set_xticklabels(['Mon','Tue','Wed','Thu','Fri','Sat','Sun'],rotation=0)
ax.set_xlabel('')
ax.set_ylabel('Average pageviews')

plt.savefig('problem2_Days.png')

Compare pageviews to another page

Lets write a function that takes a list of article names and returns a DataFrame indexed by date, columned by articles, and values being the number of pageviews.

def get_multiple_pageviews(page_list,today_s):
    multiple_pv_df = pd.DataFrame(index=pd.date_range('2015-05-01', today_date_s))
    for page in page_list:
        pv_df = get_daily_pageviews(page,today_s)
        try:
            multiple_pv_df[page] = pv_df.set_index('timestamp')['views'] 
        except:
            print("Error on: {0}".format(page))
            multiple_pv_df[page] = np.nan
    return multiple_pv_df.dropna(how='all')

Enter two related pages for which you want to compare their pageview behavior.

page_list = ['Brexit','European Union']

Get both of their data.

# Get the data
multiple_pvs = get_multiple_pageviews(page_list,today_s)

# Show the top rows
multiple_pvs.tail()
Brexit European Union
2016-10-28 13620.0 10769.0
2016-10-29 10641.0 8528.0
2016-10-30 9905.0 10081.0
2016-10-31 10623.0 11455.0
2016-11-01 10430.0 11419.0

Plot the data.

multiple_pvs.plot(logy=True)
#plt.savefig('problem3_Pit.png')
<matplotlib.axes._subplots.AxesSubplot at 0x7f80f2573630>

What is the correlation coefficient between these two articles' behavior?

multiple_pvs.apply(np.log).corr()
Brexit European Union
Brexit 1.000000 0.277592
European Union 0.277592 1.000000

How did the ratio between the two articles' pageviews change over time?

ratio_s = multiple_pvs[page_list[0]].div(multiple_pvs[page_list[1]])
ax = ratio_s.plot()
ax.set_ylabel('{0}/{1}'.format(page_list[0],page_list[1]))
<matplotlib.text.Text at 0x7f80f25c46d8>

Use the functions for resolving redirects and getting page outlinks from prior labs.

# From http://stackoverflow.com/a/312464/1574687
def make_chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]

def resolve_redirects(page_title_list):
    # Chunk the pages into a list of lists of size 50
    chunks = make_chunks(page_title_list,50)
    # Create an empty list to fill with the redirected titles
    redirected_page_titles = []
    # For each chunk try to get the redirects
    for chunk in chunks:
        # Create the query string that separates spaces within page titles by '+' 
        # and separates page titles by '|'
        page_titles_string = '|'.join([page.replace(' ','+') for page in chunk])
        # Put this large string into the URL
        url_string = 'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=pageprops&titles={0}+&redirects=1'.format(page_titles_string)
        # Do the query and parse the JSON response into a dictionary
        req = json.loads(requests.get(url_string).text)
        # Convert the returned values containing redirects into a dictionary
        if 'redirects' in req['query'].keys():
            redirected_titles = {d['from']:d['to'] for d in req['query']['redirects']}
            # Add the redirected titles to the list
            for title in chunk:
                try:
                    #print(len(redirected_page_titles), title, redirected_titles[title])
                    redirected_page_titles.append(redirected_titles[title])
                # If they don't have a redirect just add the original title
                except KeyError:
                    #print(len(redirected_page_titles), '\nFrom: ', title, '\nTo: ', title)
                    redirected_page_titles.append(title)
        else:
            for title in chunk:
                redirected_page_titles.append(title)
    # Make sure the number of page titles remained the same, otherwise raise a warning
    if len(page_title_list) == len(redirected_page_titles):
        return redirected_page_titles
    else:
        print("WARNING! The number of page titles in the redirected list ({0}) is not equal to the input list ({1})".format(len(redirected_page_titles),len(page_title_list)))
        return redirected_page_titles

def get_page_outlinks(page_title,redirects=1):
    # Replace spaces with underscores
    #page_title = page_title.replace(' ','_')
    
    bad_titles = ['Special:','Wikipedia:','Help:','Template:','Category:','International Standard','Portal:','s:']
    
    # Get the response from the API for a query
    # After passing a page title, the API returns the HTML markup of the current article version within a JSON payload
    req = requests.get('https://en.wikipedia.org/w/api.php?action=parse&format=json&page={0}&redirects={1}&prop=text&disableeditsection=1&disabletoc=1'.format(page_title,redirects))
    
    # Read the response into JSON to parse and extract the HTML
    json_string = json.loads(req.text)
    
    # Initialize an empty list to store the links
    outlinks_list = [] 
    
    if 'parse' in json_string.keys():
        page_html = json_string['parse']['text']['*']

        # Parse the HTML into Beautiful Soup
        soup = BeautifulSoup(page_html,'lxml')

        # Delete tags associated with templates
        for tag in soup.find_all('tr'):
            tag.replace_with('')

        # For each paragraph tag, extract the titles within the links
        for para in soup.find_all('p'):
            for link in para.find_all('a'):
                if link.has_attr('title'):
                    title = link['title']
                    # Ignore links that aren't interesting
                    if all(bad not in title for bad in bad_titles):
                        outlinks_list.append(title)

        # For each unordered list, extract the titles within the child links
        for unordered_list in soup.find_all('ul'):
            for item in unordered_list.find_all('li'):
                for link in item.find_all('a'):
                    if link.has_attr('title'):
                        title = link['title']
                        # Ignore links that aren't interesting
                        if all(bad not in title for bad in bad_titles):
                            outlinks_list.append(title)

    return outlinks_list

Get the outlinks.

raw_outlinks = get_page_outlinks(page_title)
redirected_outlinks = resolve_redirects(raw_outlinks)

Get the data.

This stage may take several minutes.

# Get the data
hl_pvs_df = get_multiple_pageviews(redirected_outlinks + [page_title],today_s)

# Show the top rows
hl_pvs_df.head()
Error on: Common Commercial Policy (page does not exist)
United Kingdom Withdrawal from the European Union United Kingdom European Union membership referendum, 2016 Portmanteau Theresa May Article 50 of the Treaty on European Union Bill (law) European Communities Act 1972 (UK) Statute book European Union law Law of the United Kingdom European Economic Community United Kingdom European Communities membership referendum, 1975 Labour Party (UK) Trade union Conservative Party (UK) UK Independence Party Greek withdrawal from the eurozone Greece Treaty of Rome President of France Charles de Gaulle Treaty of Accession 1972 Edward Heath Harold Wilson United Kingdom general election, October 1974 Labour Party (UK) Conference Pro-Europeanism Euroscepticism Constitutional convention (political custom) Cabinet collective responsibility Electoral district Shetland Outer Hebrides European Exchange Rate Mechanism Euro United Kingdom general election, 1983 Margaret Thatcher Single European Act Government of the United Kingdom Pound sterling Deutsche Mark Black Wednesday Maastricht Treaty European Union Treaty of Lisbon Treaty on the Functioning of the European Union Referendum Party James Goldsmith United Kingdom general election, 1997 ... Angus Dalgleish CERN European Molecular Biology Laboratory Lisbon Treaty London School of Economics Alan Sked Israel Treaty of European Union The Guardian Scottish independence Proposed second Scottish independence referendum Henry McLeish London, England Greater London Sadiq Khan London independence City-state Singapore Fabian Picardo Treaty of Utrecht José García-Margallo y Marfil Disputed status of Gibraltar Common Travel Area Enda Kenny Saint-Quentin, Aisne Xavier Bertrand Calais Dover Juxtaposed controls Emmanuel Macron François Hollande Bernard Cazeneuve Electricity sector in Ireland German Institute for International and Security Affairs Vienna Convention on the Law of Treaties European Communities (Amendment) Act 1986 European Union Act 2011 Book:Brexit, Article 50, and other articles Project Fear (British politics) Euroscepticism in the United Kingdom Multi-speed Europe Referendums related to the European Union Causes of the vote in favour of Brexit Frexit Dutch withdrawal from the European Union Digital object identifier Steve Peers OCLC DMOZ Brexit
2015-07-01 24031.0 1048.0 NaN 3373.0 789.0 NaN 406.0 25.0 9.0 271.0 274.0 1073.0 172.0 1877.0 1107.0 1743.0 1124.0 6144.0 60376.0 319.0 1512.0 2074.0 13.0 553.0 921.0 210.0 31.0 90.0 503.0 143.0 69.0 205.0 1173.0 459.0 983.0 9004.0 321.0 4594.0 111.0 832.0 3512.0 1044.0 592.0 953.0 16551.0 858.0 59.0 50.0 690.0 453.0 ... 7.0 2357.0 67.0 110.0 1100.0 32.0 12228.0 2.0 3280.0 495.0 NaN 44.0 155.0 1486.0 236.0 NaN 734.0 16127.0 38.0 295.0 16.0 123.0 210.0 1140.0 114.0 24.0 1103.0 655.0 64.0 126.0 2100.0 54.0 15.0 16.0 192.0 3.0 12.0 NaN NaN 71.0 74.0 151.0 NaN NaN NaN 2328.0 NaN 683.0 747.0 59.0
2015-07-02 24806.0 771.0 NaN 3295.0 698.0 NaN 372.0 27.0 18.0 272.0 347.0 923.0 191.0 1830.0 1257.0 1647.0 1058.0 4548.0 36390.0 346.0 1475.0 1917.0 8.0 636.0 846.0 201.0 40.0 73.0 568.0 121.0 90.0 210.0 1212.0 433.0 748.0 7485.0 288.0 5292.0 128.0 908.0 3626.0 835.0 516.0 851.0 14013.0 774.0 61.0 66.0 402.0 465.0 ... 5.0 2341.0 65.0 94.0 1190.0 63.0 11372.0 NaN 3122.0 482.0 NaN 42.0 151.0 1519.0 373.0 NaN 811.0 15631.0 42.0 352.0 16.0 137.0 266.0 1229.0 86.0 15.0 1035.0 549.0 39.0 147.0 1770.0 67.0 22.0 16.0 207.0 9.0 14.0 NaN NaN 88.0 80.0 225.0 NaN NaN NaN 3169.0 NaN 858.0 1012.0 44.0
2015-07-03 23958.0 968.0 NaN 3111.0 578.0 NaN 285.0 36.0 5.0 221.0 212.0 798.0 160.0 1628.0 1015.0 1386.0 1175.0 4459.0 32940.0 258.0 1519.0 2000.0 10.0 464.0 831.0 200.0 37.0 83.0 436.0 109.0 66.0 185.0 2851.0 494.0 693.0 6280.0 305.0 4855.0 121.0 668.0 3006.0 746.0 545.0 712.0 12084.0 602.0 33.0 38.0 430.0 479.0 ... 8.0 2923.0 30.0 105.0 1000.0 48.0 10653.0 1.0 3069.0 482.0 NaN 38.0 141.0 1387.0 308.0 NaN 662.0 13895.0 49.0 300.0 13.0 152.0 267.0 1152.0 118.0 30.0 771.0 667.0 47.0 176.0 1743.0 68.0 20.0 10.0 181.0 6.0 6.0 NaN NaN 64.0 65.0 160.0 NaN NaN NaN 1972.0 NaN 708.0 704.0 40.0
2015-07-04 22813.0 777.0 NaN 2882.0 557.0 NaN 263.0 26.0 5.0 151.0 205.0 731.0 171.0 1501.0 1244.0 1433.0 1136.0 3799.0 33687.0 230.0 1444.0 1993.0 8.0 553.0 858.0 150.0 20.0 74.0 417.0 82.0 85.0 155.0 1553.0 441.0 647.0 6464.0 295.0 4983.0 78.0 624.0 2988.0 733.0 455.0 738.0 11045.0 491.0 20.0 71.0 365.0 416.0 ... 2.0 2080.0 40.0 83.0 916.0 416.0 12132.0 2.0 3278.0 592.0 NaN 41.0 159.0 1269.0 371.0 NaN 642.0 13420.0 74.0 697.0 10.0 133.0 210.0 1030.0 98.0 30.0 1221.0 491.0 76.0 144.0 1389.0 78.0 8.0 11.0 122.0 1.0 6.0 NaN NaN 61.0 51.0 803.0 NaN NaN NaN 1847.0 NaN 823.0 676.0 49.0
2015-07-05 23968.0 1373.0 NaN 3065.0 616.0 NaN 220.0 146.0 8.0 200.0 249.0 1081.0 264.0 1970.0 992.0 1710.0 1124.0 9412.0 69144.0 374.0 1519.0 2235.0 10.0 546.0 951.0 209.0 28.0 104.0 623.0 79.0 85.0 145.0 1254.0 470.0 1139.0 10812.0 320.0 5963.0 105.0 670.0 3109.0 931.0 684.0 1041.0 17501.0 988.0 47.0 55.0 481.0 504.0 ... 8.0 2264.0 37.0 91.0 1136.0 51.0 12054.0 3.0 3203.0 557.0 NaN 29.0 163.0 1422.0 297.0 NaN 618.0 14580.0 53.0 582.0 16.0 184.0 223.0 1124.0 143.0 75.0 1093.0 518.0 43.0 359.0 2377.0 33.0 12.0 9.0 148.0 1.0 7.0 NaN NaN 91.0 74.0 566.0 NaN NaN NaN 2399.0 NaN 715.0 716.0 64.0

5 rows × 200 columns

What are the most-viewed articles in the hyperlink network?

most_viewed_articles = hl_pvs_df.cumsum().ix[yesterday_date_s]
most_viewed_articles = most_viewed_articles.sort_values(ascending=False)
most_viewed_articles.head(10)
United Kingdom    13720334.0
European Union     8657136.0
Singapore          8182760.0
Iceland            7208042.0
Israel             6897555.0
Netherlands        6179920.0
Switzerland        5913059.0
Norway             4605733.0
David Cameron      4418606.0
Greece             4394214.0
Name: 2016-11-01 00:00:00, dtype: float64

Most and least correlated articles

Which articles are most correlated with each other?

# Log the pageview data to reduce skew from bursty outliers abd make the correlation table
hl_corr_df = hl_pvs_df.apply(np.log).corr()

# Correlation table is symmetric, drop one half of them
# From: http://stackoverflow.com/questions/34417685/melt-the-upper-triangular-matrix-of-a-pandas-dataframe
hl_corr_df = hl_corr_df.where(np.triu(np.ones(hl_corr_df.shape)).astype(np.bool))

# Unstack the DataFrame into a series and sort
hl_corr_s = hl_corr_df.unstack().sort_values(ascending=False)

# Drop NaNs
hl_corr_s = hl_corr_s.dropna()

# Drop values equal to 1
hl_corr_s = hl_corr_s[hl_corr_s < 1]

List out the 10 most correlated articles.

hl_corr_s.head(10)
Book:Brexit, Article 50, and other articles  University of Bath                                      1.0
                                             David Cameron                                           1.0
                                             Police and Judicial Co-operation in Criminal Matters    1.0
                                             Angela Merkel                                           1.0
                                             Royal prerogative in the United Kingdom                 1.0
                                             George Osborne                                          1.0
                                             United Kingdom                                          1.0
                                             The Observer                                            1.0
                                             David Pannick, Baron Pannick                            1.0
                                             Attorney General for England and Wales                  1.0
dtype: float64

Inspect this correlation from the raw data.

_df = hl_pvs_df[list(hl_corr_s.index[0])]

ax = _df.plot(logy=True)

plt.savefig('problem4_HCor.png')

Look at the 10 least-correlated articles.

hl_corr_s.tail(10)
Book:Brexit, Article 50, and other articles  Liechtenstein                                       -1.0
                                             Electoral district                                  -1.0
                                             Withdrawal from the European Union                  -1.0
                                             Coming into force                                   -1.0
                                             David Davis (British politician)                    -1.0
                                             Secretary of State for Exiting the European Union   -1.0
                                             Better Off Out                                      -1.0
                                             Bernard Cazeneuve                                   -1.0
                                             Grassroots Out                                      -1.0
                                             European Union Act 2011                             -1.0
dtype: float64

Plot the correlation between the two most anti-correlated articles. These show some kinda wacky properties that are interesting to explore or think more about.

_df = hl_pvs_df[list(hl_corr_s.index[-1])]

ax = _df.plot(logy=True)

plt.savefig('problem4_LCor.png')

Is there a relationship between the position of the link on the page and the correlation between the linked article's pageviews and the seed article's pageviews? For instance, links closer to the top of the page might reflect more important topics while links towards the end of the page might be less relevant.

link_corrs = []

for num,link in enumerate(redirected_outlinks):
    try:
        link_corrs.append({'position':num,'title':link,'corr':hl_corr_s.ix[(page_title,link)]})
    except KeyError:
        print("Error on: {0}".format(link))
Error on: Common Commercial Policy (page does not exist)
Error on: Book:Brexit, Article 50, and other articles

Plot the results.

ax = pd.DataFrame(link_corrs).plot.scatter(x='position',y='corr')
ax.set_xlim((0,len(link_corrs)))
ax.set_ylim((-1,1))
ax.set_xlabel('Link position')
ax.set_ylabel('Correlation')
<matplotlib.text.Text at 0x7f80f0eedba8>

Get page revisions

In this section, we'll repurpose and adapt code from the last lab to get data about page revisions. Rather than looking at the number of times a user contributed to a given article, we'll simply count the number of times the article was edited on a given date.

def get_page_edits_by_date(page_title,conn,date_string='2014-12-31'):
    """ Takes a page title and returns the number of revisions made on each date.
      page_title = a string for the page title to get its revisions
      date_string = a string for the date in YYYY-MM-DD format
      conn = a database connection
      
    Returns:
      A DataFrame with username, page title, edit count, and min/max timestamps
    """
    # In case you pass a page title with spaces in it, replace the spaces with underscores
    page_title = page_title.replace(' ','_').encode('utf8').decode('latin1')
    
    # The MySQL query string used to retrieve the data. By line, it is
    ## converting the timestamp to a date and 
    ## counting the number of elements
    ## from the "revisions" table
    ## joining the "page" table on it
    ## using the page_id and rev_page columns as keys
    ## limiting the results to entries that have the pagetitle, 
    ## occur in the namespace, and happen after Dec 31, 2014
    ## grouping the results by date
    s = """
            SELECT
                DATE(rev_timestamp) as date,
                page_title,
                COUNT(*) as edits
            FROM 
                revision 
            JOIN 
                page ON page.page_id = revision.rev_page
            WHERE 
                page.page_title = "{0}" 
                AND page_namespace = 0
                AND DATE(rev_timestamp) > '{1}'
            GROUP BY
                date
        """.format(page_title,date_string)

    # Use the connection to run the query and return the results as a DataFrame
    _df = pd.read_sql_query(s,conn)
    
    _df['page_title'] = _df['page_title'].str.decode('utf8')
    _df['page_title'] = _df['page_title'].str.replace('_',' ')
    
    # Return the data, with a clean index
    return _df

def get_neighbors_revisions(page_title,conn):
    """ Takes a page title and returns revisions for the page and its neighbors.
      page_title = a string for the page title to get its revisions
      
    Returns:
      A pandas DataFrame containing all the page revisions.
    """
    # Get the outlinks from the page and include the page itself in the list
    alters = get_page_outlinks(page_title) + [page_title]
    # Resolve the redirects in the list of alters
    alters = list(set(resolve_redirects(alters)))
    # Create an empty container to hold the DataFrames
    df_list = []
    # For each page, get the revision counts and append to the df_list
    for alter in alters:
        _df = get_page_edits_by_date(alter,conn)
        df_list.append(_df)
    # Concatenate the list of revision count DataFrames into a giant DataFrame
    df = pd.concat(df_list)
    # Return the data
    return df.reset_index(drop=True)

Get the authentication information and connect to the database.

host, user, password = os.environ['MYSQL_HOST'], os.environ['MYSQL_USERNAME'], os.environ['MYSQL_PASSWORD']
conn = pymysql.connect(host=host,user=user,password=password,database='enwiki_p',connect_timeout=3600)
conn.cursor().execute('use enwiki_p');

Get the number of revisions per day for all the articles.

hl_daily_rev_df = get_neighbors_revisions(page_title,conn)
hl_daily_rev_df.head()
date page_title edits
0 2016-06-26 European Union (Amendment) Act 2008 1.0
1 2016-06-27 European Union (Amendment) Act 2008 1.0
2 2015-05-27 Get Britain Out 3.0
3 2015-06-08 Get Britain Out 7.0
4 2016-02-06 Get Britain Out 2.0

Reindex the edit data so it's starting and ending on the same dates as the pageviews data.

# Convert into a format like the hl_pageviews DataFrame
# Index are dates between Jan 1, 2015 and today; columns are article titles; values are number of edits
hl_edits_df = hl_daily_rev_df.set_index(['date','page_title'])['edits'].unstack(1)

# Reindex so dates are continuous
pv_start_ix = str(hl_pvs_df.index.min().date())
_date_range = pd.date_range(pv_start_ix,yesterday_date_s)
hl_edits_df = hl_edits_df.reindex(index=_date_range)

# Fill in empty observations with 0s
hl_edits_df = hl_edits_df.fillna(0)

hl_edits_df.head()
page_title A-G v De Keyser's Royal Hotel Ltd Alan Sked Angela Merkel Angus Dalgleish Article 50 of the Treaty on European Union Attorney General for England and Wales Bernard Cazeneuve Better Off Out Bill (law) Black Wednesday Brexit Britain Stronger in Europe Budget of the European Union CERN Cabinet collective responsibility Calais Causes of the vote in favour of Brexit Chancellor of the Exchequer Charles de Gaulle Citizens Rights Directive 2004 City-state Clacton (UK Parliament constituency) Common Agricultural Policy Common Foreign and Security Policy Common Travel Area Conservative Party (UK) Constitution Unit Constitution of the United Kingdom Constitutional Reform and Governance Act 2010 Constitutional convention (political custom) Council of the European Union Country attractiveness Curia advisari vult DMOZ Daniel Hannan David Cameron David Davis (British politician) David Pannick, Baron Pannick Department for Exiting the European Union Deposit (politics) Deutsche Mark Devilling Digital object identifier Disputed status of Gibraltar Divisional court (England and Wales) Donald Tusk Dover Dutch withdrawal from the European Union Economic and Social Research Council Edward Heath ... Royal prerogative in the United Kingdom Sadiq Khan Saint-Quentin, Aisne Scottish independence Secretary of State for Exiting the European Union Secretary of State for Foreign and Commonwealth Affairs Senate of Berlin Shetland Signatories to the Treaty establishing a Constitution for Europe Signing of the Treaty of Lisbon Singapore Single European Act St George's, University of London Statute book Steve Peers Stuart Rose Supreme Court of the United Kingdom Swiss immigration referendum, February 2014 Switzerland Switzerland–European Union relations Tampon Terence Etherton The Guardian The Observer The Times Theresa May Tony Blair Trade union Treaty of Accession 1972 Treaty of Lisbon Treaty of Rome Treaty of Utrecht Treaty on the Functioning of the European Union UK Independence Party Udo Di Fabio United Kingdom United Kingdom European Communities membership referendum, 1975 United Kingdom European Union membership referendum, 2016 United Kingdom general election, 1983 United Kingdom general election, 1997 United Kingdom general election, 2015 United Kingdom general election, December 1910 United Kingdom general election, October 1974 University of Bath Vienna Convention on the Law of Treaties Vote Leave Voting in the Council of the European Union Withdrawal from the European Union World Pensions & Investments Forum Xavier Bertrand
2015-07-01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 4.0 1.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2015-07-02 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 0.0 2.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2015-07-03 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 24.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 15.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2015-07-04 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 6.0 0.0 0.0
2015-07-05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 2.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 10.0 0.0 0.0

5 rows × 193 columns

Are pageviews and edits correlated with each other?

_s1 = hl_pvs_df[page_title]
_s2 = hl_edits_df[page_title]

np.corrcoef(_s1.apply(np.log),_s2)[0][1]
0.41526284297648647
single_pv_edits_df = pd.DataFrame({'pageviews':_s1,'edits':_s2})
ax = single_pv_edits_df.plot(secondary_y='edits',logy=True)
ax.right_ax.set_yscale('log')
ax.set_ylabel('Pageviews')
ax.right_ax.set_ylabel('Edits')

plt.savefig('problem5_PVE.png')

Can Wikipedia supply information to keep up with demand?

The ratio between the cumulative pageviews and cumulative edits.

ax = (_s1.cumsum()/_s2.cumsum()).plot()

ax.set_ylabel('Cumulative pageviews per edit')

plt.savefig('problem5_PV_E.png')
def zscore(series):
    return np.abs((series - series.mean())/series.std())

Look at the normalized (z-score) excitation and relaxation in edits and pageviews by day. Each point is a single day in the article's history and they're connected if they come one day after each other. Values along the diagonal in red suggest that increases in attention to the article are matched by similar increases in editing activity on the article. Alternatively, data points in the upper-left triangle suggest increases in pageviews are not matched by increases in edits while data points in the lower-right triangle suggest increases in edits are not matched by increases in pageviews.

f,ax = plt.subplots(1,1)

ax.set_xlabel('Edits (z-score)')
ax.set_ylabel('Pageviews (z-score)')
ax.set_xlim((1e-3,1e2))
ax.set_ylim((1e-3,1e2))
ax.set_xscale('log')
ax.set_yscale('log')

plt.text(1e-1,1e1,'More views than edits',ha='center',weight='bold')
plt.text(1e1,1e-1,'More edits than views',ha='center',weight='bold')

plt.plot([1e-3,1e2],[1e-3,1e2],axes=ax,c='r')

_s1 = zscore(hl_edits_df[page_title])
_s2 = zscore(hl_pvs_df[page_title])
plt.plot(_s1,_s2,'o-',axes=ax,c='grey');
f,ax = plt.subplots(1,1)

ax.set_xlabel('Edits (z-score)')
ax.set_ylabel('Pageviews (z-score)')
ax.set_xlim((1e-3,1e2))
ax.set_ylim((1e-3,1e2))
ax.set_xscale('log')
ax.set_yscale('log')

for page in hl_edits_df.columns:
    _s1 = zscore(hl_edits_df[page])
    _s2 = zscore(hl_pvs_df[page])
    #plt.plot(_s1.ix[_s2.index],_s2,'.-',axes=ax,alpha=.125,c='grey')
    plt.scatter(_s1,_s2,axes=ax,alpha=.125,c='grey')
    
# Diagonal red line showing perfect matching
plt.plot([1e-3,1e2],[1e-3,1e2],axes=ax,c='r',zorder=-1);

plt.savefig('problem5_Scat.png')

Did a burst of pageviews diffuse to adjacent pages?

On what day did the maximum number of pageviews occur?

page_max_pv_date = hl_pvs_df[page_title].idxmax()
print(page_max_pv_date.date())
2016-06-24
# For each article, what was the date when the maximum number of 
max_pv_dates_df = hl_pvs_df.idxmax()

# Check whether these dates were also the page_max_pv_date
_fraction = max_pv_dates_df.isin([page_max_pv_date,page_max_pv_date+1]).sum()/len(max_pv_dates_df)

print("{0:.1%} of articles in the hyperlink network also had the max pageview on the same date as the article.".format(_fraction))
41.0% of articles in the hyperlink network also had the max pageview on the same date as the article.

What are these "co-bursting" pages?

cobursting_pv_articles = max_pv_dates_df[max_pv_dates_df.isin([page_max_pv_date,page_max_pv_date+1])].sort_index()
cobursting_pv_articles
Alan Sked                                                         2016-06-24
Better Off Out                                                    2016-06-24
Black Wednesday                                                   2016-06-24
Brexit                                                            2016-06-24
Britain Stronger in Europe                                        2016-06-24
Budget of the European Union                                      2016-06-24
City-state                                                        2016-06-24
Clacton (UK Parliament constituency)                              2016-06-24
Common Agricultural Policy                                        2016-06-24
Common Foreign and Security Policy                                2016-06-24
Conservative Party (UK)                                           2016-06-24
Constitution Unit                                                 2016-06-24
Constitution of the United Kingdom                                2016-06-24
Council of the European Union                                     2016-06-24
Country attractiveness                                            2016-06-25
Daniel Hannan                                                     2016-06-24
David Cameron                                                     2016-06-24
Deutsche Mark                                                     2016-06-25
Disputed status of Gibraltar                                      2016-06-24
Donald Tusk                                                       2016-06-24
Euro                                                              2016-06-24
European Council                                                  2016-06-24
European Court of Justice                                         2016-06-24
European Economic Area                                            2016-06-24
European Economic Community                                       2016-06-24
European Exchange Rate Mechanism                                  2016-06-24
European Free Trade Association                                   2016-06-24
European Parliament election, 2004 (United Kingdom)               2016-06-24
European Parliament election, 2009 (United Kingdom)               2016-06-24
European Parliament election, 2014 (United Kingdom)               2016-06-24
European Union                                                    2016-06-24
European Union (Amendment) Act 2008                               2016-06-25
European Union Act 2011                                           2016-06-24
European Union Referendum Act 2015                                2016-06-24
European Union law                                                2016-06-24
Euroscepticism                                                    2016-06-24
Euroscepticism in the United Kingdom                              2016-06-24
Geert Wilders                                                     2016-06-24
German Institute for International and Security Affairs           2016-06-24
Gordon Brown                                                      2016-06-24
Government of the United Kingdom                                  2016-06-24
Grassroots Out                                                    2016-06-24
Greater London                                                    2016-06-24
Greek withdrawal from the eurozone                                2016-06-24
José García-Margallo y Marfil                                     2016-06-24
Juxtaposed controls                                               2016-06-24
Labour In for Britain                                             2016-06-24
Labour Party (UK)                                                 2016-06-24
Lisbon Treaty                                                     2016-06-24
Maastricht Treaty                                                 2016-06-24
Multi-speed Europe                                                2016-06-24
Nick Clegg                                                        2016-06-24
Nigel Farage                                                      2016-06-24
Outer Hebrides                                                    2016-06-24
Portmanteau                                                       2016-06-24
Pound sterling                                                    2016-06-24
Pro-Europeanism                                                   2016-06-24
Project Fear (British politics)                                   2016-06-24
Ratification of the Treaty of Lisbon                              2016-06-24
Referendums related to the European Union                         2016-06-24
Scottish independence                                             2016-06-24
Signing of the Treaty of Lisbon                                   2016-06-24
Single European Act                                               2016-06-24
Swiss immigration referendum, February 2014                       2016-06-24
Switzerland–European Union relations                              2016-06-24
Treaty of Accession 1972                                          2016-06-24
Treaty of European Union                                          2016-06-24
Treaty of Lisbon                                                  2016-06-24
Treaty of Rome                                                    2016-06-24
Treaty on the Functioning of the European Union                   2016-06-24
UK Independence Party                                             2016-06-24
United Kingdom                                                    2016-06-24
United Kingdom European Communities membership referendum, 1975   2016-06-24
United Kingdom European Union membership referendum, 2016         2016-06-24
United Kingdom general election, 1983                             2016-06-24
United Kingdom general election, 1997                             2016-06-24
United Kingdom general election, 2015                             2016-06-24
United Kingdom general election, December 1910                    2016-06-24
United Kingdom general election, October 1974                     2016-06-24
Vienna Convention on the Law of Treaties                          2016-06-24
Vote Leave                                                        2016-06-24
Voting in the Council of the European Union                       2016-06-24
dtype: datetime64[ns]

Did this burst of pageviews translate into edits on these pages as well?

Get the articles with the peak amount of activity by date and then subset to only include those articles whose peak dates are the same as the peak pageview dates.

max_edits_dates_df = hl_edits_df.idxmax()
cobursting_edit_articles = max_edits_dates_df[max_edits_dates_df.isin([page_max_pv_date,page_max_pv_date+1])].sort_index()
cobursting_edit_articles
page_title
Brexit                                                      2016-06-24
Constitution Unit                                           2016-06-24
David Cameron                                               2016-06-24
European Union                                              2016-06-24
London independence                                         2016-06-25
Portmanteau                                                 2016-06-25
Scottish independence                                       2016-06-24
Swiss immigration referendum, February 2014                 2016-06-24
United Kingdom                                              2016-06-24
United Kingdom European Union membership referendum, 2016   2016-06-24
Withdrawal from the European Union                          2016-06-24
dtype: datetime64[ns]

How many edits beyond the target article did this translate into? Subset the dataframe to the relevant dates and articles and inspect.

dates = [page_max_pv_date,page_max_pv_date+1]
cobursting_edit_articles_list = list(set(cobursting_edit_articles.index) - set(page_title))
cobursting_hl_edits_df = hl_edits_df.loc[dates,cobursting_edit_articles.index]
cobursting_hl_edits_df
page_title Brexit Constitution Unit David Cameron European Union London independence Portmanteau Scottish independence Swiss immigration referendum, February 2014 United Kingdom United Kingdom European Union membership referendum, 2016 Withdrawal from the European Union
2016-06-24 178.0 1.0 161.0 93.0 0.0 1.0 27.0 6.0 44.0 346.0 48.0
2016-06-25 26.0 0.0 37.0 27.0 39.0 5.0 3.0 1.0 4.0 173.0 15.0

Now sum up all these edits.

cobursting_hl_edits_df.sum().sum()
1235.0