Lab 4 - Pageviews

Professor Brian Keegan
Department of Information Science, CU Boulder
This notebook is copyright and made available under the Apache License v2.0 license.

This is the third of five lab notebooks that will explore how to analyze the structure of collaborations in Wikipedia data about users' revisions across multiple articles. This lab will extend the methods in the previous two labs about analyzing a single article's revision histories and analyzing the hyperlink networks around a single Wikipedia page. You do not need to be fluent in either to complete the lab, but there are many options for extending the analyses we do here by using more advanced queries and scripting methods.

Acknowledgements
I'd like to thank the Wikimedia Foundation for the PAWS system and related Wikitech infrastructure that this workbook runs within. Yuvi Panda, Aaron Halfaker, Jonathan Morgan, and Dario Taraborelli have all provided crucial support and feedback.

Confirm that basic Python commands work

a = 3
b = 4
c = 5
(c-a)**b
16

Import modules and setup environment

Load up all the libraries we'll need to connect to the database, retreive information for analysis, and visualize results.

# Makes the plots appear within the notebook
%matplotlib inline

# Two fundamental packages for doing data manipulation
import numpy as np                   # http://www.numpy.org/
import pandas as pd                  # http://pandas.pydata.org/

# Two related packages for plotting data
import matplotlib.pyplot as plt      # http://matplotlib.org/
import seaborn as sb                 # https://stanford.edu/~mwaskom/software/seaborn/

# Package for requesting data via the web and parsing resulting JSON
import requests                      # http://docs.python-requests.org/en/master/
import json                          # https://docs.python.org/3/library/json.html
from bs4 import BeautifulSoup        # https://www.crummy.com/software/BeautifulSoup/bs4/doc/

# Two packages for accessing the MySQL server
import pymysql                       # http://pymysql.readthedocs.io/en/latest/
import os                            # https://docs.python.org/3.4/library/os.html

# Setup the code environment to use plots with a white background and DataFrames show more columns and rows
sb.set_style('whitegrid')
pd.options.display.max_columns = 100
pd.options.display.max_rows = 110

Define an article to examine pageview dynamics.

page_title = 'Brexit'

Get pageview data for a single article

Details about the Wikimedia REST API for pageviews is available here. Unfortunately, this data end point only provides information going back to July 1, 2015.

This is what the API returns as an example.

# Get today's date and yesterday's date
today = pd.datetime.today()
yesterday = pd.datetime.today() - pd.to_timedelta('2 day')

# Date
today_date_s = str(today.date())
yesterday_date_s = str(yesterday.date())

# Convert to strings
today_s = pd.datetime.strftime(today,'%Y%m%d00')
yesterday_s = pd.datetime.strftime(yesterday,'%Y%m%d00')

# Get the pageviews for today and yesterday
url_string = 'http://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/{0}/daily/{1}/{2}'
print(url_string.format(page_title.replace(' ','_'),yesterday_s,today_s))
http://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Brexit/daily/2016103000/2016110100

Write a function to get the pageviews from January 1, 2015 (in practice, the start date will be as late as August or as early as May) until yesterday.

def get_daily_pageviews(page_title,today_s):
    url_string = 'http://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/{0}/daily/2015010100/{1}'
    req = requests.get(url_string.format(page_title,today_s))

    json_s = json.loads(req.text)
    if 'items' in json_s.keys():
        _df = pd.DataFrame(json_s['items'])[['timestamp','views','article']]
        _df['timestamp'] = pd.to_datetime(_df['timestamp'],format='%Y%m%d00')
        _df['weekday'] = _df['timestamp'].apply(lambda x:x.weekday())
        return _df

Get the data for your page.

pageview_df = get_daily_pageviews(page_title,today_s)
pageview_df.head()
timestamp views article weekday
0 2015-07-01 59 Brexit 2
1 2015-07-02 44 Brexit 3
2 2015-07-03 40 Brexit 4
3 2015-07-04 49 Brexit 5
4 2015-07-05 64 Brexit 6

Interpret page view results

What does the pageview activity look like? Are there any bursts of attention? What might these bursts be linked to?

ax = pageview_df.plot.line(x='timestamp',y='views',logy=False,legend=False)
ax.set_xlabel('')
ax.set_ylabel('Pageviews')
<matplotlib.text.Text at 0x7fb46aa370f0>

Use a logarithmic scaling for the y-axis to see more of the detail in the lower-traffic days.

ax = pageview_df.plot.line(x='timestamp',y='views',logy=True,legend=False)
ax.set_xlabel('')
ax.set_ylabel('Pageviews')
<matplotlib.text.Text at 0x7fb470e63eb8>

What are the dates for the biggest pageview outliers? Here we define an "outlier" to be more than 3 standard deviations above the average number of pageviews over the time window.

std_threshold = 4
threshold_val = pageview_df['views'].mean() + pageview_df['views'].std() * std_threshold
peak_days = pageview_df[pageview_df['views'] > threshold_val]

peak_days.head(10)
timestamp views article weekday
359 2016-06-24 42436 Brexit 4
467 2016-10-10 25067 Brexit 0
468 2016-10-11 32849 Brexit 1
469 2016-10-12 38317 Brexit 2

How much of the total pageview activity occurred on these days compared to the rest of the pageviews?

peak_fraction = pageview_df.loc[peak_days.index,'views'].sum()/pageview_df['views'].sum()

print('{0:.1%} of all pageviews occurred on the {1} peak days.'.format(peak_fraction,len(peak_days)))
11.3% of all pageviews occurred on the 4 peak days.

How does pageview activity change over the course of a week?

g = sb.factorplot(x='weekday',y='views',data=pageview_df,kind='bar',color='grey',
                  aspect=1.67,estimator=np.median)
ax = g.axes[0][0]
ax.set_xticklabels(['Mon','Tue','Wed','Thu','Fri','Sat','Sun'],rotation=0)
ax.set_xlabel('')
ax.set_ylabel('Average pageviews')
<matplotlib.text.Text at 0x7fb46b929780>

Compare pageviews to another page

Lets write a function that takes a list of article names and returns a DataFrame indexed by date, columned by articles, and values being the number of pageviews.

def get_multiple_pageviews(page_list,today_s):
    multiple_pv_df = pd.DataFrame(index=pd.date_range('2015-05-01', today_date_s))
    for page in page_list:
        pv_df = get_daily_pageviews(page,today_s)
        try:
            multiple_pv_df[page] = pv_df.set_index('timestamp')['views'] 
        except:
            print("Error on: {0}".format(page))
            multiple_pv_df[page] = np.nan
    return multiple_pv_df.dropna(how='all')

Enter two related pages for which you want to compare their pageview behavior.

page_list = ['Brexit','European Union']

Get both of their data.

# Get the data
multiple_pvs = get_multiple_pageviews(page_list,today_s)

# Show the top rows
multiple_pvs.tail()
Brexit European Union
2016-10-27 11014.0 13550.0
2016-10-28 13620.0 10769.0
2016-10-29 10641.0 8528.0
2016-10-30 9905.0 10081.0
2016-10-31 10623.0 11455.0

Plot the data.

multiple_pvs.plot(logy=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7fb470b93080>

What is the correlation coefficient between these two articles' behavior?

multiple_pvs.apply(np.log).corr()
Brexit European Union
Brexit 1.000000 0.279141
European Union 0.279141 1.000000

How did the ratio between the two articles' pageviews change over time?

ratio_s = multiple_pvs[page_list[0]].div(multiple_pvs[page_list[1]])
ax = ratio_s.plot()
ax.set_ylabel('{0}/{1}'.format(page_list[0],page_list[1]))
<matplotlib.text.Text at 0x7fb46aa366a0>

Use the functions for resolving redirects and getting page outlinks from prior labs.

# From http://stackoverflow.com/a/312464/1574687
def make_chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]

def resolve_redirects(page_title_list):
    # Chunk the pages into a list of lists of size 50
    chunks = make_chunks(page_title_list,50)
    # Create an empty list to fill with the redirected titles
    redirected_page_titles = []
    # For each chunk try to get the redirects
    for chunk in chunks:
        # Create the query string that separates spaces within page titles by '+' 
        # and separates page titles by '|'
        page_titles_string = '|'.join([page.replace(' ','+') for page in chunk])
        # Put this large string into the URL
        url_string = 'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=pageprops&titles={0}+&redirects=1'.format(page_titles_string)
        # Do the query and parse the JSON response into a dictionary
        req = json.loads(requests.get(url_string).text)
        # Convert the returned values containing redirects into a dictionary
        if 'redirects' in req['query'].keys():
            redirected_titles = {d['from']:d['to'] for d in req['query']['redirects']}
            # Add the redirected titles to the list
            for title in chunk:
                try:
                    #print(len(redirected_page_titles), title, redirected_titles[title])
                    redirected_page_titles.append(redirected_titles[title])
                # If they don't have a redirect just add the original title
                except KeyError:
                    #print(len(redirected_page_titles), '\nFrom: ', title, '\nTo: ', title)
                    redirected_page_titles.append(title)
        else:
            for title in chunk:
                redirected_page_titles.append(title)
    # Make sure the number of page titles remained the same, otherwise raise a warning
    if len(page_title_list) == len(redirected_page_titles):
        return redirected_page_titles
    else:
        print("WARNING! The number of page titles in the redirected list ({0}) is not equal to the input list ({1})".format(len(redirected_page_titles),len(page_title_list)))
        return redirected_page_titles

def get_page_outlinks(page_title,redirects=1):
    # Replace spaces with underscores
    #page_title = page_title.replace(' ','_')
    
    bad_titles = ['Special:','Wikipedia:','Help:','Template:','Category:','International Standard','Portal:','s:']
    
    # Get the response from the API for a query
    # After passing a page title, the API returns the HTML markup of the current article version within a JSON payload
    req = requests.get('https://en.wikipedia.org/w/api.php?action=parse&format=json&page={0}&redirects={1}&prop=text&disableeditsection=1&disabletoc=1'.format(page_title,redirects))
    
    # Read the response into JSON to parse and extract the HTML
    json_string = json.loads(req.text)
    
    # Initialize an empty list to store the links
    outlinks_list = [] 
    
    if 'parse' in json_string.keys():
        page_html = json_string['parse']['text']['*']

        # Parse the HTML into Beautiful Soup
        soup = BeautifulSoup(page_html,'lxml')

        # Delete tags associated with templates
        for tag in soup.find_all('tr'):
            tag.replace_with('')

        # For each paragraph tag, extract the titles within the links
        for para in soup.find_all('p'):
            for link in para.find_all('a'):
                if link.has_attr('title'):
                    title = link['title']
                    # Ignore links that aren't interesting
                    if all(bad not in title for bad in bad_titles):
                        outlinks_list.append(title)

        # For each unordered list, extract the titles within the child links
        for unordered_list in soup.find_all('ul'):
            for item in unordered_list.find_all('li'):
                for link in item.find_all('a'):
                    if link.has_attr('title'):
                        title = link['title']
                        # Ignore links that aren't interesting
                        if all(bad not in title for bad in bad_titles):
                            outlinks_list.append(title)

    return outlinks_list

Get the outlinks.

raw_outlinks = get_page_outlinks(page_title)
redirected_outlinks = resolve_redirects(raw_outlinks)

Get the data.

This stage may take several minutes.

# Get the data
hl_pvs_df = get_multiple_pageviews(redirected_outlinks + [page_title],today_s)

# Show the top rows
hl_pvs_df.head()
Error on: Common Commercial Policy (page does not exist)
United Kingdom Withdrawal from the European Union United Kingdom European Union membership referendum, 2016 Portmanteau Theresa May Article 50 of the Treaty on European Union Bill (law) European Communities Act 1972 (UK) Statute book European Union law Law of the United Kingdom European Economic Community United Kingdom European Communities membership referendum, 1975 Labour Party (UK) Trade union Conservative Party (UK) UK Independence Party Greek withdrawal from the eurozone Greece Treaty of Rome President of France Charles de Gaulle Treaty of Accession 1972 Edward Heath Harold Wilson United Kingdom general election, October 1974 Labour Party (UK) Conference Pro-Europeanism Euroscepticism Constitutional convention (political custom) Cabinet collective responsibility Electoral district Shetland Outer Hebrides European Exchange Rate Mechanism Euro United Kingdom general election, 1983 Margaret Thatcher Single European Act Government of the United Kingdom Pound sterling Deutsche Mark Black Wednesday Maastricht Treaty European Union Treaty of Lisbon Treaty on the Functioning of the European Union Referendum Party James Goldsmith United Kingdom general election, 1997 ... Angus Dalgleish CERN European Molecular Biology Laboratory Lisbon Treaty London School of Economics Alan Sked Israel Treaty of European Union The Guardian Scottish independence Proposed second Scottish independence referendum Henry McLeish London, England Greater London Sadiq Khan London independence City-state Singapore Fabian Picardo Treaty of Utrecht José García-Margallo y Marfil Disputed status of Gibraltar Common Travel Area Enda Kenny Saint-Quentin, Aisne Xavier Bertrand Calais Dover Juxtaposed controls Emmanuel Macron François Hollande Bernard Cazeneuve Electricity sector in Ireland German Institute for International and Security Affairs Vienna Convention on the Law of Treaties European Communities (Amendment) Act 1986 European Union Act 2011 Book:Brexit, Article 50, and other articles Project Fear (British politics) Euroscepticism in the United Kingdom Multi-speed Europe Referendums related to the European Union Causes of the vote in favour of Brexit Frexit Dutch withdrawal from the European Union Digital object identifier Steve Peers OCLC DMOZ Brexit
2015-07-01 24031.0 1048.0 NaN 3373.0 789.0 NaN 406.0 25.0 9.0 271.0 274.0 1073.0 172.0 1877.0 1107.0 1743.0 1124.0 6144.0 60376.0 319.0 1512.0 2074.0 13.0 553.0 921.0 210.0 31.0 90.0 503.0 143.0 69.0 205.0 1173.0 459.0 983.0 9004.0 321.0 4594.0 111.0 832.0 3512.0 1044.0 592.0 953.0 16551.0 858.0 59.0 50.0 690.0 453.0 ... 7.0 2357.0 67.0 110.0 1100.0 32.0 12228.0 2.0 3280.0 495.0 NaN 44.0 155.0 1486.0 236.0 NaN 734.0 16127.0 38.0 295.0 16.0 123.0 210.0 1140.0 114.0 24.0 1103.0 655.0 64.0 126.0 2100.0 54.0 15.0 16.0 192.0 3.0 12.0 NaN NaN 71.0 74.0 151.0 NaN NaN NaN 2328.0 NaN 683.0 747.0 59.0
2015-07-02 24806.0 771.0 NaN 3295.0 698.0 NaN 372.0 27.0 18.0 272.0 347.0 923.0 191.0 1830.0 1257.0 1647.0 1058.0 4548.0 36390.0 346.0 1475.0 1917.0 8.0 636.0 846.0 201.0 40.0 73.0 568.0 121.0 90.0 210.0 1212.0 433.0 748.0 7485.0 288.0 5292.0 128.0 908.0 3626.0 835.0 516.0 851.0 14013.0 774.0 61.0 66.0 402.0 465.0 ... 5.0 2341.0 65.0 94.0 1190.0 63.0 11372.0 NaN 3122.0 482.0 NaN 42.0 151.0 1519.0 373.0 NaN 811.0 15631.0 42.0 352.0 16.0 137.0 266.0 1229.0 86.0 15.0 1035.0 549.0 39.0 147.0 1770.0 67.0 22.0 16.0 207.0 9.0 14.0 NaN NaN 88.0 80.0 225.0 NaN NaN NaN 3169.0 NaN 858.0 1012.0 44.0
2015-07-03 23958.0 968.0 NaN 3111.0 578.0 NaN 285.0 36.0 5.0 221.0 212.0 798.0 160.0 1628.0 1015.0 1386.0 1175.0 4459.0 32940.0 258.0 1519.0 2000.0 10.0 464.0 831.0 200.0 37.0 83.0 436.0 109.0 66.0 185.0 2851.0 494.0 693.0 6280.0 305.0 4855.0 121.0 668.0 3006.0 746.0 545.0 712.0 12084.0 602.0 33.0 38.0 430.0 479.0 ... 8.0 2923.0 30.0 105.0 1000.0 48.0 10653.0 1.0 3069.0 482.0 NaN 38.0 141.0 1387.0 308.0 NaN 662.0 13895.0 49.0 300.0 13.0 152.0 267.0 1152.0 118.0 30.0 771.0 667.0 47.0 176.0 1743.0 68.0 20.0 10.0 181.0 6.0 6.0 NaN NaN 64.0 65.0 160.0 NaN NaN NaN 1972.0 NaN 708.0 704.0 40.0
2015-07-04 22813.0 777.0 NaN 2882.0 557.0 NaN 263.0 26.0 5.0 151.0 205.0 731.0 171.0 1501.0 1244.0 1433.0 1136.0 3799.0 33687.0 230.0 1444.0 1993.0 8.0 553.0 858.0 150.0 20.0 74.0 417.0 82.0 85.0 155.0 1553.0 441.0 647.0 6464.0 295.0 4983.0 78.0 624.0 2988.0 733.0 455.0 738.0 11045.0 491.0 20.0 71.0 365.0 416.0 ... 2.0 2080.0 40.0 83.0 916.0 416.0 12132.0 2.0 3278.0 592.0 NaN 41.0 159.0 1269.0 371.0 NaN 642.0 13420.0 74.0 697.0 10.0 133.0 210.0 1030.0 98.0 30.0 1221.0 491.0 76.0 144.0 1389.0 78.0 8.0 11.0 122.0 1.0 6.0 NaN NaN 61.0 51.0 803.0 NaN NaN NaN 1847.0 NaN 823.0 676.0 49.0
2015-07-05 23968.0 1373.0 NaN 3065.0 616.0 NaN 220.0 146.0 8.0 200.0 249.0 1081.0 264.0 1970.0 992.0 1710.0 1124.0 9412.0 69144.0 374.0 1519.0 2235.0 10.0 546.0 951.0 209.0 28.0 104.0 623.0 79.0 85.0 145.0 1254.0 470.0 1139.0 10812.0 320.0 5963.0 105.0 670.0 3109.0 931.0 684.0 1041.0 17501.0 988.0 47.0 55.0 481.0 504.0 ... 8.0 2264.0 37.0 91.0 1136.0 51.0 12054.0 3.0 3203.0 557.0 NaN 29.0 163.0 1422.0 297.0 NaN 618.0 14580.0 53.0 582.0 16.0 184.0 223.0 1124.0 143.0 75.0 1093.0 518.0 43.0 359.0 2377.0 33.0 12.0 9.0 148.0 1.0 7.0 NaN NaN 91.0 74.0 566.0 NaN NaN NaN 2399.0 NaN 715.0 716.0 64.0

5 rows × 200 columns

What are the most-viewed articles in the hyperlink network?

most_viewed_articles = hl_pvs_df.cumsum().ix[yesterday_date_s]
most_viewed_articles = most_viewed_articles.sort_values(ascending=False)
most_viewed_articles.head(10)
United Kingdom    13671336.0
European Union     8634262.0
Singapore          8152456.0
Iceland            7185600.0
Israel             6873488.0
Netherlands        6156433.0
Switzerland        5891675.0
Norway             4589061.0
David Cameron      4411145.0
Greece             4380074.0
Name: 2016-10-30 00:00:00, dtype: float64

Most and least correlated articles

Which articles are most correlated with each other?

# Log the pageview data to reduce skew from bursty outliers abd make the correlation table
hl_corr_df = hl_pvs_df.apply(np.log).corr()

# Correlation table is symmetric, drop one half of them
# From: http://stackoverflow.com/questions/34417685/melt-the-upper-triangular-matrix-of-a-pandas-dataframe
hl_corr_df = hl_corr_df.where(np.triu(np.ones(hl_corr_df.shape)).astype(np.bool))

# Unstack the DataFrame into a series and sort
hl_corr_s = hl_corr_df.unstack().sort_values(ascending=False)

# Drop NaNs
hl_corr_s = hl_corr_s.dropna()

# Drop values equal to 1
hl_corr_s = hl_corr_s[hl_corr_s < 1]

List out the 10 most correlated articles.

hl_corr_s.head(10)
Council of the European Union                                    European Council                                             0.971216
Dutch withdrawal from the European Union                         Frexit                                                       0.953087
Switzerland–European Union relations                             United Kingdom European Union membership referendum, 2016    0.944353
Secretary of State for Exiting the European Union                David Davis (British politician)                             0.944072
                                                                 Secretary of State for Foreign and Commonwealth Affairs      0.939215
London independence                                              United Kingdom European Union membership referendum, 2016    0.938595
Treaty of Rome                                                   European Economic Community                                  0.935124
United Kingdom European Communities membership referendum, 1975  United Kingdom European Union membership referendum, 2016    0.932271
Maastricht Treaty                                                European Economic Community                                  0.927433
European Free Trade Association                                  European Economic Area                                       0.927009
dtype: float64

Inspect this correlation from the raw data.

_df = hl_pvs_df[list(hl_corr_s.index[0])]

ax = _df.plot(logy=True)

Look at the 10 least-correlated articles.

hl_corr_s.tail(10)
London independence             Citizens Rights Directive 2004                              -0.749153
Citizens Rights Directive 2004  United Kingdom general election, 1997                       -0.751446
                                United Kingdom European Union membership referendum, 2016   -0.751910
                                Gordon Brown                                                -0.757985
                                George Osborne                                              -0.759081
                                United Kingdom general election, October 1974               -0.765404
                                Conservative Party (UK)                                     -0.765705
                                David Cameron                                               -0.772506
                                Withdrawal from the European Union                          -0.777345
London independence             European Single Market                                      -0.847026
dtype: float64

Plot the correlation between the two most anti-correlated articles. These show some kinda wacky properties that are interesting to explore or think more about.

_df = hl_pvs_df[list(hl_corr_s.index[-1])]

ax = _df.plot(logy=True)

Is there a relationship between the position of the link on the page and the correlation between the linked article's pageviews and the seed article's pageviews? For instance, links closer to the top of the page might reflect more important topics while links towards the end of the page might be less relevant.

link_corrs = []

for num,link in enumerate(redirected_outlinks):
    try:
        link_corrs.append({'position':num,'title':link,'corr':hl_corr_s.ix[(page_title,link)]})
    except KeyError:
        print("Error on: {0}".format(link))
Error on: Common Commercial Policy (page does not exist)
Error on: Book:Brexit, Article 50, and other articles

Plot the results.

ax = pd.DataFrame(link_corrs).plot.scatter(x='position',y='corr')
ax.set_xlim((0,len(link_corrs)))
ax.set_ylim((-1,1))
ax.set_xlabel('Link position')
ax.set_ylabel('Correlation')
<matplotlib.text.Text at 0x7fb46a7e6b00>

Get page revisions

In this section, we'll repurpose and adapt code from the last lab to get data about page revisions. Rather than looking at the number of times a user contributed to a given article, we'll simply count the number of times the article was edited on a given date.

def get_page_edits_by_date(page_title,conn,date_string='2014-12-31'):
    """ Takes a page title and returns the number of revisions made on each date.
      page_title = a string for the page title to get its revisions
      date_string = a string for the date in YYYY-MM-DD format
      conn = a database connection
      
    Returns:
      A DataFrame with username, page title, edit count, and min/max timestamps
    """
    # In case you pass a page title with spaces in it, replace the spaces with underscores
    page_title = page_title.replace(' ','_').encode('utf8').decode('latin1')
    
    # The MySQL query string used to retrieve the data. By line, it is
    ## converting the timestamp to a date and 
    ## counting the number of elements
    ## from the "revisions" table
    ## joining the "page" table on it
    ## using the page_id and rev_page columns as keys
    ## limiting the results to entries that have the pagetitle, 
    ## occur in the namespace, and happen after Dec 31, 2014
    ## grouping the results by date
    s = """
            SELECT
                DATE(rev_timestamp) as date,
                page_title,
                COUNT(*) as edits
            FROM 
                revision 
            JOIN 
                page ON page.page_id = revision.rev_page
            WHERE 
                page.page_title = "{0}" 
                AND page_namespace = 0
                AND DATE(rev_timestamp) > '{1}'
            GROUP BY
                date
        """.format(page_title,date_string)

    # Use the connection to run the query and return the results as a DataFrame
    _df = pd.read_sql_query(s,conn)
    
    _df['page_title'] = _df['page_title'].str.decode('utf8')
    _df['page_title'] = _df['page_title'].str.replace('_',' ')
    
    # Return the data, with a clean index
    return _df

def get_neighbors_revisions(page_title,conn):
    """ Takes a page title and returns revisions for the page and its neighbors.
      page_title = a string for the page title to get its revisions
      
    Returns:
      A pandas DataFrame containing all the page revisions.
    """
    # Get the outlinks from the page and include the page itself in the list
    alters = get_page_outlinks(page_title) + [page_title]
    # Resolve the redirects in the list of alters
    alters = list(set(resolve_redirects(alters)))
    # Create an empty container to hold the DataFrames
    df_list = []
    # For each page, get the revision counts and append to the df_list
    for alter in alters:
        _df = get_page_edits_by_date(alter,conn)
        df_list.append(_df)
    # Concatenate the list of revision count DataFrames into a giant DataFrame
    df = pd.concat(df_list)
    # Return the data
    return df.reset_index(drop=True)

Get the authentication information and connect to the database.

host, user, password = os.environ['MYSQL_HOST'], os.environ['MYSQL_USERNAME'], os.environ['MYSQL_PASSWORD']
conn = pymysql.connect(host=host,user=user,password=password,database='enwiki_p',connect_timeout=3600)
conn.cursor().execute('use enwiki_p');

Get the number of revisions per day for all the articles.

hl_daily_rev_df = get_neighbors_revisions(page_title,conn)
hl_daily_rev_df.head()
date page_title edits
0 2015-01-10 Common Foreign and Security Policy 1.0
1 2015-01-16 Common Foreign and Security Policy 1.0
2 2015-04-03 Common Foreign and Security Policy 1.0
3 2015-05-01 Common Foreign and Security Policy 3.0
4 2015-05-03 Common Foreign and Security Policy 1.0

Reindex the edit data so it's starting and ending on the same dates as the pageviews data.

# Convert into a format like the hl_pageviews DataFrame
# Index are dates between Jan 1, 2015 and today; columns are article titles; values are number of edits
hl_edits_df = hl_daily_rev_df.set_index(['date','page_title'])['edits'].unstack(1)

# Reindex so dates are continuous
pv_start_ix = str(hl_pvs_df.index.min().date())
_date_range = pd.date_range(pv_start_ix,yesterday_date_s)
hl_edits_df = hl_edits_df.reindex(index=_date_range)

# Fill in empty observations with 0s
hl_edits_df = hl_edits_df.fillna(0)

hl_edits_df.head()
page_title A-G v De Keyser's Royal Hotel Ltd Alan Sked Angela Merkel Angus Dalgleish Article 50 of the Treaty on European Union Attorney General for England and Wales Bernard Cazeneuve Better Off Out Bill (law) Black Wednesday Brexit Britain Stronger in Europe Budget of the European Union CERN Cabinet collective responsibility Calais Causes of the vote in favour of Brexit Chancellor of the Exchequer Charles de Gaulle Citizens Rights Directive 2004 City-state Clacton (UK Parliament constituency) Common Agricultural Policy Common Foreign and Security Policy Common Travel Area Conservative Party (UK) Constitution Unit Constitution of the United Kingdom Constitutional Reform and Governance Act 2010 Constitutional convention (political custom) Council of the European Union Country attractiveness Curia advisari vult DMOZ Daniel Hannan David Cameron David Davis (British politician) David Pannick, Baron Pannick Department for Exiting the European Union Deposit (politics) Deutsche Mark Devilling Digital object identifier Disputed status of Gibraltar Divisional court (England and Wales) Donald Tusk Dover Dutch withdrawal from the European Union Economic and Social Research Council Edward Heath ... Royal prerogative in the United Kingdom Sadiq Khan Saint-Quentin, Aisne Scottish independence Secretary of State for Exiting the European Union Secretary of State for Foreign and Commonwealth Affairs Senate of Berlin Shetland Signatories to the Treaty establishing a Constitution for Europe Signing of the Treaty of Lisbon Singapore Single European Act St George's, University of London Statute book Steve Peers Stuart Rose Supreme Court of the United Kingdom Swiss immigration referendum, February 2014 Switzerland Switzerland–European Union relations Tampon Terence Etherton The Guardian The Observer The Times Theresa May Tony Blair Trade union Treaty of Accession 1972 Treaty of Lisbon Treaty of Rome Treaty of Utrecht Treaty on the Functioning of the European Union UK Independence Party Udo Di Fabio United Kingdom United Kingdom European Communities membership referendum, 1975 United Kingdom European Union membership referendum, 2016 United Kingdom general election, 1983 United Kingdom general election, 1997 United Kingdom general election, 2015 United Kingdom general election, December 1910 United Kingdom general election, October 1974 University of Bath Vienna Convention on the Law of Treaties Vote Leave Voting in the Council of the European Union Withdrawal from the European Union World Pensions & Investments Forum Xavier Bertrand
2015-07-01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 4.0 1.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2015-07-02 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 0.0 2.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2015-07-03 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 24.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 15.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2015-07-04 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 6.0 0.0 0.0
2015-07-05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 2.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 10.0 0.0 0.0

5 rows × 193 columns

Are pageviews and edits correlated with each other?

_s1 = hl_pvs_df[page_title]
_s2 = hl_edits_df[page_title]
#len(_s1)
#len(_s2)

np.corrcoef(_s1.apply(np.log),_s2)[0][1]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-190-260fc64c40b4> in <module>()
      4 #len(_s2)
      5 
----> 6 np.corrcoef(_s1.apply(np.log),_s2)[0][1]

/srv/paws/lib/python3.4/site-packages/numpy/lib/function_base.py in corrcoef(x, y, rowvar, bias, ddof)
   2559         warnings.warn('bias and ddof have no effect and are deprecated',
   2560                       DeprecationWarning)
-> 2561     c = cov(x, y, rowvar)
   2562     try:
   2563         d = diag(c)

/srv/paws/lib/python3.4/site-packages/numpy/lib/function_base.py in cov(m, y, rowvar, bias, ddof, fweights, aweights)
   2430         if rowvar == 0 and y.shape[0] != 1:
   2431             y = y.T
-> 2432         X = np.vstack((X, y))
   2433 
   2434     if ddof is None:

/srv/paws/lib/python3.4/site-packages/numpy/core/shape_base.py in vstack(tup)
    228 
    229     """
--> 230     return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
    231 
    232 def hstack(tup):

ValueError: all the input array dimensions except for the concatenation axis must match exactly
single_pv_edits_df = pd.DataFrame({'pageviews':_s1,'edits':_s2})
ax = single_pv_edits_df.plot(secondary_y='edits',logy=True)
ax.right_ax.set_yscale('log')
ax.set_ylabel('Pageviews')
ax.right_ax.set_ylabel('Edits')
<matplotlib.text.Text at 0x7fb46aac2a20>

Can Wikipedia supply information to keep up with demand?

The ratio between the cumulative pageviews and cumulative edits.

ax = (_s1.cumsum()/_s2.cumsum()).plot()

ax.set_ylabel('Cumulative pageviews per edit')
<matplotlib.text.Text at 0x7fb46b5d8c88>
def zscore(series):
    return np.abs((series - series.mean())/series.std())

Look at the normalized (z-score) excitation and relaxation in edits and pageviews by day. Each point is a single day in the article's history and they're connected if they come one day after each other. Values along the diagonal in red suggest that increases in attention to the article are matched by similar increases in editing activity on the article. Alternatively, data points in the upper-left triangle suggest increases in pageviews are not matched by increases in edits while data points in the lower-right triangle suggest increases in edits are not matched by increases in pageviews.

f,ax = plt.subplots(1,1)

ax.set_xlabel('Edits (z-score)')
ax.set_ylabel('Pageviews (z-score)')
ax.set_xlim((1e-3,1e2))
ax.set_ylim((1e-3,1e2))
ax.set_xscale('log')
ax.set_yscale('log')

plt.text(1e-1,1e1,'More views than edits',ha='center',weight='bold')
plt.text(1e1,1e-1,'More edits than views',ha='center',weight='bold')

plt.plot([1e-3,1e2],[1e-3,1e2],axes=ax,c='r')

_s1 = zscore(hl_edits_df[page_title])
_s2 = zscore(hl_pvs_df[page_title])
plt.plot(_s1,_s2,'o-',axes=ax,c='grey');
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-149-d26e0722c15b> in <module>()
     15 _s1 = zscore(hl_edits_df[page_title])
     16 _s2 = zscore(hl_pvs_df[page_title])
---> 17 plt.plot(_s1,_s2,'o-',axes=ax,c='grey');

/srv/paws/lib/python3.4/site-packages/matplotlib/pyplot.py in plot(*args, **kwargs)
   3159         ax.hold(hold)
   3160     try:
-> 3161         ret = ax.plot(*args, **kwargs)
   3162     finally:
   3163         ax.hold(washold)

/srv/paws/lib/python3.4/site-packages/matplotlib/__init__.py in inner(ax, *args, **kwargs)
   1817                     warnings.warn(msg % (label_namer, func.__name__),
   1818                                   RuntimeWarning, stacklevel=2)
-> 1819             return func(ax, *args, **kwargs)
   1820         pre_doc = inner.__doc__
   1821         if pre_doc is None:

/srv/paws/lib/python3.4/site-packages/matplotlib/axes/_axes.py in plot(self, *args, **kwargs)
   1380         kwargs = cbook.normalize_kwargs(kwargs, _alias_map)
   1381 
-> 1382         for line in self._get_lines(*args, **kwargs):
   1383             self.add_line(line)
   1384             lines.append(line)

/srv/paws/lib/python3.4/site-packages/matplotlib/axes/_base.py in _grab_next_args(self, *args, **kwargs)
    379                 return
    380             if len(remaining) <= 3:
--> 381                 for seg in self._plot_args(remaining, kwargs):
    382                     yield seg
    383                 return

/srv/paws/lib/python3.4/site-packages/matplotlib/axes/_base.py in _plot_args(self, tup, kwargs)
    357             x, y = index_of(tup[-1])
    358 
--> 359         x, y = self._xy_from_xy(x, y)
    360 
    361         if self.command == 'plot':

/srv/paws/lib/python3.4/site-packages/matplotlib/axes/_base.py in _xy_from_xy(self, x, y)
    217         y = _check_1d(y)
    218         if x.shape[0] != y.shape[0]:
--> 219             raise ValueError("x and y must have same first dimension")
    220         if x.ndim > 2 or y.ndim > 2:
    221             raise ValueError("x and y can be no greater than 2-D")

ValueError: x and y must have same first dimension
f,ax = plt.subplots(1,1)

ax.set_xlabel('Edits (z-score)')
ax.set_ylabel('Pageviews (z-score)')
ax.set_xlim((1e-3,1e2))
ax.set_ylim((1e-3,1e2))
ax.set_xscale('log')
ax.set_yscale('log')

for page in hl_edits_df.columns:
    _s1 = zscore(hl_edits_df[page])
    _s2 = zscore(hl_pvs_df[page])
    #plt.plot(_s1.ix[_s2.index],_s2,'.-',axes=ax,alpha=.125,c='grey')
    plt.scatter(_s1,_s2,axes=ax,alpha=.125,c='grey')
    
# Diagonal red line showing perfect matching
plt.plot([1e-3,1e2],[1e-3,1e2],axes=ax,c='r',zorder=-1);