Lab 4 - Pageviews

Professor Brian Keegan
Department of Information Science, CU Boulder
This notebook is copyright and made available under the Apache License v2.0 license.

This is the third of five lab notebooks that will explore how to analyze the structure of collaborations in Wikipedia data about users' revisions across multiple articles. This lab will extend the methods in the previous two labs about analyzing a single article's revision histories and analyzing the hyperlink networks around a single Wikipedia page. You do not need to be fluent in either to complete the lab, but there are many options for extending the analyses we do here by using more advanced queries and scripting methods.

Acknowledgements
I'd like to thank the Wikimedia Foundation for the PAWS system and related Wikitech infrastructure that this workbook runs within. Yuvi Panda, Aaron Halfaker, Jonathan Morgan, and Dario Taraborelli have all provided crucial support and feedback.

Confirm that basic Python commands work

a = 3
b = 4
c = 5
(c-a)**b
16

Import modules and setup environment

Load up all the libraries we'll need to connect to the database, retreive information for analysis, and visualize results.

# Makes the plots appear within the notebook
%matplotlib inline

# Two fundamental packages for doing data manipulation
import numpy as np                   # http://www.numpy.org/
import pandas as pd                  # http://pandas.pydata.org/

# Two related packages for plotting data
import matplotlib.pyplot as plt      # http://matplotlib.org/
import seaborn as sb                 # https://stanford.edu/~mwaskom/software/seaborn/

# Package for requesting data via the web and parsing resulting JSON
import requests                      # http://docs.python-requests.org/en/master/
import json                          # https://docs.python.org/3/library/json.html
from bs4 import BeautifulSoup        # https://www.crummy.com/software/BeautifulSoup/bs4/doc/

# Two packages for accessing the MySQL server
import pymysql                       # http://pymysql.readthedocs.io/en/latest/
import os                            # https://docs.python.org/3.4/library/os.html

# Setup the code environment to use plots with a white background and DataFrames show more columns and rows
sb.set_style('whitegrid')
pd.options.display.max_columns = 100
pd.options.display.max_rows = 110

Define an article to examine pageview dynamics.

page_title = 'Lady Gaga'

Get pageview data for a single article

Details about the Wikimedia REST API for pageviews is available here. Unfortunately, this data end point only provides information going back to July 1, 2015.

This is what the API returns as an example.

# Get today's date and yesterday's date
today = pd.datetime.today()
yesterday = pd.datetime.today() - pd.to_timedelta('2 day')

# Convert to strings
today_s = pd.datetime.strftime(today,'%Y%m%d00')
yesterday_s = pd.datetime.strftime(yesterday,'%Y%m%d00')

# Get the pageviews for today and yesterday
url_string = 'http://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/{0}/daily/{1}/{2}'
print(url_string.format(page_title.replace(' ','_'),yesterday_s,today_s))
http://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Lady_Gaga/daily/2016102500/2016102700

Write a function to get the pageviews from January 1, 2015 (in practice, the start date will be as late as August or as early as May) until yesterday.

def get_daily_pageviews(page_title):
    url_string = 'http://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/{0}/daily/2015010100/{1}'
    today = pd.datetime.strftime(pd.datetime.today(),'%Y%m%d00')
    req = requests.get(url_string.format(page_title,today))

    json_s = json.loads(req.text)
    if 'items' in json_s.keys():
        _df = pd.DataFrame(json_s['items'])[['timestamp','views','article']]
        _df['timestamp'] = pd.to_datetime(_df['timestamp'],format='%Y%m%d00')
        _df['weekday'] = _df['timestamp'].apply(lambda x:x.weekday())
        return _df

Get the data for your page.

pageview_df = get_daily_pageviews(page_title)
pageview_df.head()
timestamp views article weekday
0 2015-07-01 8586 Lady_Gaga 2
1 2015-07-02 8331 Lady_Gaga 3
2 2015-07-03 7903 Lady_Gaga 4
3 2015-07-04 8123 Lady_Gaga 5
4 2015-07-05 8953 Lady_Gaga 6

Interpret page view results

What does the pageview activity look like? Are there any bursts of attention? What might these bursts be linked to?

ax = pageview_df.plot.line(x='timestamp',y='views',logy=False,legend=False)
ax.set_xlabel('')
ax.set_ylabel('Pageviews')
<matplotlib.text.Text at 0x7f4adca08358>

Use a logarithmic scaling for the y-axis to see more of the detail in the lower-traffic days.

ax = pageview_df.plot.line(x='timestamp',y='views',logy=True,legend=False)
ax.set_xlabel('')
ax.set_ylabel('Pageviews')
<matplotlib.text.Text at 0x7f4adca4ba20>

What are the dates for the biggest pageview outliers? Here we define an "outlier" to be more than 3 standard deviations above the average number of pageviews over the time window.

std_threshold = 3
threshold_val = pageview_df['views'].mean() + pageview_df['views'].std() * std_threshold
peak_days = pageview_df[pageview_df['views'] > threshold_val]

peak_days.head(10)
timestamp views article weekday
194 2016-01-11 144035 Lady_Gaga 0
221 2016-02-07 122755 Lady_Gaga 6
222 2016-02-08 188591 Lady_Gaga 0
223 2016-02-09 70809 Lady_Gaga 1
230 2016-02-16 106072 Lady_Gaga 1
243 2016-02-29 215743 Lady_Gaga 0
480 2016-10-23 87536 Lady_Gaga 6

How much of the total pageview activity occurred on these days compared to the rest of the pageviews?

peak_fraction = pageview_df.loc[peak_days.index,'views'].sum()/pageview_df['views'].sum()

print('{0:.1%} of all pageviews occurred on the {1} peak days.'.format(peak_fraction,len(peak_days)))
12.6% of all pageviews occurred on the 7 peak days.

How does pageview activity change over the course of a week?

g = sb.factorplot(x='weekday',y='views',data=pageview_df,kind='bar',color='grey',
                  aspect=1.67,estimator=np.median)
ax = g.axes[0][0]
ax.set_xticklabels(['Mon','Tue','Wed','Thu','Fri','Sat','Sun'],rotation=0)
ax.set_xlabel('')
ax.set_ylabel('Average pageviews')
<matplotlib.text.Text at 0x7f4adc361208>

Compare pageviews to another page

Lets write a function that takes a list of article names and returns a DataFrame indexed by date, columned by articles, and values being the number of pageviews.

def get_multiple_pageviews(page_list):
    today = str(pd.datetime.today().date())
    multiple_pv_df = pd.DataFrame(index=pd.date_range('2015-05-01',today))
    for page in page_list:
        pv_df = get_daily_pageviews(page)
        try:
            multiple_pv_df[page] = pv_df.set_index('timestamp')['views'] 
        except:
            print("Error on: {0}".format(page))
            multiple_pv_df[page] = np.nan
    return multiple_pv_df.dropna(how='all')

Enter two related pages for which you want to compare their pageview behavior.

page_list = ['Lady Gaga','Halsey_(singer)']

Get both of their data.

# Get the data
multiple_pvs = get_multiple_pageviews(page_list)

# Show the top rows
multiple_pvs.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-10-61921f7e0045> in <module>()
      1 # Get the data
----> 2 multiple_pvs = get_multiple_pageviews(page_list)
      3 
      4 # Show the top rows
      5 multiple_pvs.head()

NameError: name 'get_multiple_pageviews' is not defined

Plot the data.

multiple_pvs.plot(logy=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7f4adc361f98>

What is the correlation coefficient between these two articles' behavior?

multiple_pvs.corr()
Lady Gaga Korn
Lady Gaga 1.000000 0.086955
Korn 0.086955 1.000000

How did the ratio between the two articles' pageviews change over time?

ratio_s = multiple_pvs[page_list[0]].div(multiple_pvs[page_list[1]])
ax = ratio_s.plot()
ax.set_ylabel('{0}/{1}'.format(page_list[0],page_list[1]))
<matplotlib.text.Text at 0x7f4adc26d2e8>

Use the functions for resolving redirects and getting page outlinks from prior labs.

# From http://stackoverflow.com/a/312464/1574687
def make_chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]

def resolve_redirects(page_title_list):
    # Chunk the pages into a list of lists of size 50
    chunks = make_chunks(page_title_list,50)
    # Create an empty list to fill with the redirected titles
    redirected_page_titles = []
    # For each chunk try to get the redirects
    for chunk in chunks:
        # Create the query string that separates spaces within page titles by '+' 
        # and separates page titles by '|'
        page_titles_string = '|'.join([page.replace(' ','+') for page in chunk])
        # Put this large string into the URL
        url_string = 'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=pageprops&titles={0}+&redirects=1'.format(page_titles_string)
        # Do the query and parse the JSON response into a dictionary
        req = json.loads(requests.get(url_string).text)
        # Convert the returned values containing redirects into a dictionary
        if 'redirects' in req['query'].keys():
            redirected_titles = {d['from']:d['to'] for d in req['query']['redirects']}
            # Add the redirected titles to the list
            for title in chunk:
                try:
                    #print(len(redirected_page_titles), title, redirected_titles[title])
                    redirected_page_titles.append(redirected_titles[title])
                # If they don't have a redirect just add the original title
                except KeyError:
                    #print(len(redirected_page_titles), '\nFrom: ', title, '\nTo: ', title)
                    redirected_page_titles.append(title)
        else:
            for title in chunk:
                redirected_page_titles.append(title)
    # Make sure the number of page titles remained the same, otherwise raise a warning
    if len(page_title_list) == len(redirected_page_titles):
        return redirected_page_titles
    else:
        print("WARNING! The number of page titles in the redirected list ({0}) is not equal to the input list ({1})".format(len(redirected_page_titles),len(page_title_list)))
        return redirected_page_titles

def get_page_outlinks(page_title,redirects=1):
    # Replace spaces with underscores
    #page_title = page_title.replace(' ','_')
    
    bad_titles = ['Special:','Wikipedia:','Help:','Template:','Category:','International Standard','Portal:','s:']
    
    # Get the response from the API for a query
    # After passing a page title, the API returns the HTML markup of the current article version within a JSON payload
    req = requests.get('https://en.wikipedia.org/w/api.php?action=parse&format=json&page={0}&redirects={1}&prop=text&disableeditsection=1&disabletoc=1'.format(page_title,redirects))
    
    # Read the response into JSON to parse and extract the HTML
    json_string = json.loads(req.text)
    
    # Initialize an empty list to store the links
    outlinks_list = [] 
    
    if 'parse' in json_string.keys():
        page_html = json_string['parse']['text']['*']

        # Parse the HTML into Beautiful Soup
        soup = BeautifulSoup(page_html,'lxml')

        # Delete tags associated with templates
        for tag in soup.find_all('tr'):
            tag.replace_with('')

        # For each paragraph tag, extract the titles within the links
        for para in soup.find_all('p'):
            for link in para.find_all('a'):
                if link.has_attr('title'):
                    title = link['title']
                    # Ignore links that aren't interesting
                    if all(bad not in title for bad in bad_titles):
                        outlinks_list.append(title)

        # For each unordered list, extract the titles within the child links
        for unordered_list in soup.find_all('ul'):
            for item in unordered_list.find_all('li'):
                for link in item.find_all('a'):
                    if link.has_attr('title'):
                        title = link['title']
                        # Ignore links that aren't interesting
                        if all(bad not in title for bad in bad_titles):
                            outlinks_list.append(title)

    return outlinks_list

Get the outlinks.

raw_outlinks = get_page_outlinks(page_title)
redirected_outlinks = resolve_redirects(raw_outlinks)

Get the data.

This stage may take several minutes.

# Get the data
hl_pvs_df = get_multiple_pageviews(redirected_outlinks + [page_title])

# Show the top rows
hl_pvs_df.head()
Error on: Sony/ATV Music Publishing
Error on: Sony/ATV Music Publishing
Error on: Grammy Award for Best Dance/Electronica Album
Error on: wikt:à la
Error on: wikt:outré
Error on: HIV/AIDS
Error on: Re/code
Collaborative Arts Project 21 (CAP21) New York University Tisch School of the Arts Band (rock and pop) Lower East Side Avant-garde Def Jam Recordings Sony/ATV Music Publishing Akon Interscope Records KonLive Distribution The Fame Just Dance (song) Poker Face (Lady Gaga song) Extended play The Fame Monster Bad Romance Telephone (song) Alejandro (song) Born This Way (album) Born This Way (song) Artpop Applause (Lady Gaga song) Tony Bennett Cheek to Cheek (album) American Horror Story: Hotel Golden Globe Award Joanne (album) List of best-selling music artists List of awards and nominations received by Lady Gaga Guinness World Records Brit Awards Grammy Award Songwriters Hall of Fame Council of Fashion Designers of America Billboard (magazine) Forbes Time (magazine) Billboard Music Award LGBT rights by country or territory Born This Way Foundation Lenox Hill Hospital Manhattan Upper East Side Italian Americans French Canadians Upper West Side Convent of the Sacred Heart (New York City) Open mic Guys and Dolls ... The Advocate Europride LGBT rights in Europe Universal Life Church Monastery 2016 Orlando nightclub shooting Human Rights Campaign Brian Solis Rolling Stone Miley Cyrus Nicki Minaj Ellie Goulding Halsey (singer) Nick Jonas Lorde Sam Smith (singer) Greyson Chance MGMT Nomenclature Fern Gaga (genus) Gagadon Aleiodes gaga List of best-selling singles MTV Video Music Award Recording Industry Association of America Katy Perry: Part of Me Muppets Most Wanted Jeremy Scott: The People's Designer Artists with the most number-ones on the U.S. Dance Club Songs chart Honorific nicknames in popular music LGBT culture in New York City List of Billboard Social 50 number-one artists Los Angeles Times MusicOMH San Jose Mercury News The Times (South Africa) South China Morning Post St. Martin's Press McFarland & Company Boston Lady Gaga: Queen of Pop John Blake Publishing Sterling Publishing Andrews McMeel Publishing Orion Publishing Group Robert Christgau DMOZ IMDb Rotten Tomatoes Lady Gaga
2015-07-01 44.0 1699.0 666.0 NaN 692.0 1773.0 1693.0 NaN 6407.0 1872.0 36.0 646.0 316.0 506.0 1819.0 563.0 510.0 316.0 291.0 689.0 393.0 911.0 312.0 2145.0 921.0 5692.0 1244.0 NaN 5704.0 360.0 1882.0 363.0 1926.0 246.0 111.0 1325.0 2502.0 1471.0 325.0 3827.0 82.0 137.0 4839.0 1334.0 34.0 60.0 1525.0 78.0 135.0 735.0 ... 234.0 84.0 482.0 298.0 NaN 587.0 45.0 2218.0 11669.0 20039.0 6834.0 4459.0 6928.0 4215.0 6738.0 534.0 1645.0 692.0 1077.0 24.0 30.0 NaN 2160.0 642.0 1026.0 378.0 934.0 NaN NaN 1343.0 22.0 134.0 939.0 46.0 140.0 48.0 189.0 155.0 42.0 5890.0 29.0 11.0 49.0 27.0 56.0 608.0 747.0 127.0 4199.0 8586.0
2015-07-02 34.0 1626.0 651.0 NaN 625.0 2129.0 1849.0 NaN 6504.0 1800.0 43.0 697.0 330.0 447.0 1920.0 580.0 542.0 367.0 305.0 692.0 387.0 890.0 305.0 1564.0 828.0 5627.0 1180.0 NaN 6390.0 231.0 1788.0 310.0 2020.0 244.0 116.0 1444.0 2460.0 1419.0 303.0 3395.0 58.0 146.0 4770.0 1248.0 57.0 45.0 1763.0 81.0 127.0 736.0 ... 216.0 80.0 440.0 328.0 NaN 564.0 59.0 2519.0 20185.0 19134.0 7010.0 3623.0 5823.0 4711.0 6769.0 437.0 1611.0 703.0 1314.0 24.0 35.0 NaN 2232.0 611.0 993.0 436.0 1038.0 NaN NaN 1394.0 25.0 165.0 1010.0 26.0 160.0 44.0 201.0 177.0 94.0 5444.0 41.0 14.0 59.0 33.0 46.0 631.0 1012.0 181.0 4455.0 8331.0
2015-07-03 45.0 1529.0 716.0 NaN 650.0 1562.0 1734.0 NaN 6496.0 1640.0 39.0 632.0 344.0 521.0 1764.0 565.0 505.0 361.0 289.0 643.0 400.0 894.0 285.0 2036.0 781.0 5521.0 1227.0 NaN 5934.0 203.0 1851.0 326.0 1905.0 216.0 118.0 1276.0 2122.0 1199.0 265.0 6202.0 53.0 132.0 4479.0 1013.0 44.0 27.0 1126.0 82.0 114.0 682.0 ... 205.0 81.0 423.0 339.0 NaN 515.0 51.0 2174.0 18930.0 18163.0 7361.0 3553.0 5882.0 4740.0 7227.0 427.0 1777.0 523.0 954.0 11.0 15.0 NaN 2124.0 612.0 831.0 407.0 919.0 NaN NaN 1176.0 34.0 125.0 895.0 33.0 123.0 40.0 168.0 140.0 73.0 5395.0 18.0 9.0 41.0 29.0 47.0 535.0 704.0 145.0 4584.0 7903.0
2015-07-04 39.0 1499.0 715.0 NaN 674.0 1440.0 1583.0 NaN 6389.0 1594.0 43.0 621.0 340.0 514.0 1680.0 575.0 522.0 359.0 322.0 632.0 358.0 929.0 318.0 1774.0 896.0 5613.0 1265.0 NaN 5611.0 215.0 1765.0 265.0 1797.0 256.0 86.0 1165.0 1989.0 1220.0 286.0 5401.0 44.0 112.0 4507.0 961.0 47.0 50.0 871.0 60.0 95.0 678.0 ... 224.0 127.0 430.0 348.0 NaN 440.0 35.0 2283.0 13986.0 18726.0 7969.0 4167.0 6208.0 4840.0 7210.0 538.0 1478.0 457.0 969.0 33.0 29.0 NaN 2225.0 549.0 781.0 405.0 985.0 NaN NaN 1148.0 21.0 137.0 866.0 32.0 122.0 16.0 123.0 110.0 78.0 5776.0 26.0 29.0 34.0 41.0 38.0 555.0 676.0 155.0 4911.0 8123.0
2015-07-05 30.0 1523.0 769.0 NaN 618.0 1609.0 1555.0 NaN 6668.0 1557.0 35.0 700.0 329.0 576.0 1727.0 593.0 537.0 450.0 282.0 700.0 404.0 981.0 305.0 2150.0 981.0 6399.0 1417.0 NaN 6071.0 223.0 2080.0 320.0 1977.0 282.0 99.0 1395.0 2087.0 1385.0 263.0 3709.0 47.0 103.0 4766.0 1222.0 32.0 35.0 926.0 94.0 99.0 774.0 ... 179.0 87.0 357.0 482.0 NaN 458.0 32.0 2345.0 13053.0 18779.0 7968.0 3922.0 6124.0 4957.0 6757.0 522.0 1427.0 448.0 1020.0 20.0 25.0 NaN 2537.0 588.0 854.0 537.0 1253.0 NaN NaN 1192.0 16.0 113.0 929.0 31.0 144.0 53.0 188.0 131.0 58.0 5680.0 25.0 15.0 29.0 34.0 46.0 694.0 716.0 155.0 5594.0 8953.0

5 rows × 422 columns

What are the most-viewed articles in the hyperlink network?

most_viewed_articles = hl_pvs_df.cumsum().ix[str(yesterday.date())]
most_viewed_articles = most_viewed_articles.sort_values(ascending=False)
most_viewed_articles.head(10)
Facebook                        21561113.0
David Bowie                     19989022.0
Barack Obama                    17349389.0
Hillary Clinton                 16054338.0
YouTube                         13909262.0
Michael Jackson                 10801839.0
American Horror Story: Hotel    10790869.0
Bill Clinton                    10242354.0
Beyoncé                          9007759.0
American Horror Story            8896126.0
Name: 2016-10-25 00:00:00, dtype: float64

Most and least correlated articles

Which articles are most correlated with each other?

# Log the pageview data to reduce skew from bursty outliers abd make the correlation table
hl_corr_df = hl_pvs_df.apply(np.log).corr()

# Correlation table is symmetric, drop one half of them
# From: http://stackoverflow.com/questions/34417685/melt-the-upper-triangular-matrix-of-a-pandas-dataframe
hl_corr_df = hl_corr_df.where(np.triu(np.ones(hl_corr_df.shape)).astype(np.bool))

# Unstack the DataFrame into a series and sort
hl_corr_s = hl_corr_df.unstack().sort_values(ascending=False)

# Drop NaNs
hl_corr_s = hl_corr_s.dropna()

# Drop values equal to 1
hl_corr_s = hl_corr_s[hl_corr_s < 1]

List out the 10 most correlated articles.

hl_corr_s.head(10)
The Fame Monster             The Fame                       0.947405
Born This Way (album)        The Fame Monster               0.931940
2011 MTV Video Music Awards  2010 MTV Video Music Awards    0.929003
Artpop                       The Fame Monster               0.927394
                             Born This Way (album)          0.922916
Born This Way (album)        The Fame                       0.922399
MTV Video Music Award        2010 MTV Video Music Awards    0.915732
Lady Gaga                    Lady Starlight                 0.911772
Artpop                       The Fame                       0.911746
The Remix (Lady Gaga album)  Born This Way (album)          0.910227
dtype: float64

Inspect this correlation from the raw data.

_df = hl_pvs_df[list(hl_corr_s.index[0])]

ax = _df.plot(logy=True)

Look at the 10 least-correlated articles.

hl_corr_s.tail(10)
Sam Smith (singer)           Band (rock and pop)           -0.644386
Intersex                     Perfect Illusion              -0.655298
Perfect Illusion             Vincent Herbert               -0.666736
IMDb                         87th Academy Awards           -0.722458
Sam Smith (singer)           French Canadians              -0.732706
2009 MTV Video Music Awards  Perfect Illusion              -0.768817
Perfect Illusion             2010 MTV Video Music Awards   -0.782531
MTV Video Music Award        Perfect Illusion              -0.783498
2011 MTV Video Music Awards  Perfect Illusion              -0.819068
IMDb                         Sam Smith (singer)            -0.821329
dtype: float64

Plot the correlation between the two most anti-correlated articles. These show some kinda wacky properties that are interesting to explore or think more about.

_df = hl_pvs_df[list(hl_corr_s.index[-1])]

ax = _df.plot(logy=True)

Is there a relationship between the position of the link on the page and the correlation between the linked article's pageviews and the seed article's pageviews? For instance, links closer to the top of the page might reflect more important topics while links towards the end of the page might be less relevant.

link_corrs = []

for num,link in enumerate(redirected_outlinks):
    try:
        link_corrs.append({'position':num,'title':link,'corr':hl_corr_s.ix[(page_title,link)]})
    except KeyError:
        print("Error on: {0}".format(link))
Error on: Sony/ATV Music Publishing
Error on: Sony/ATV Music Publishing
Error on: Grammy Award for Best Dance/Electronica Album
Error on: wikt:à la
Error on: wikt:outré
Error on: HIV/AIDS
Error on: Re/code

Plot the results.

ax = pd.DataFrame(link_corrs).plot.scatter(x='position',y='corr')
ax.set_xlim((0,len(link_corrs)))
ax.set_ylim((-1,1))
ax.set_xlabel('Link position')
ax.set_ylabel('Correlation')
<matplotlib.text.Text at 0x7f4adc2eb048>

Get page revisions

In this section, we'll repurpose and adapt code from the last lab to get data about page revisions. Rather than looking at the number of times a user contributed to a given article, we'll simply count the number of times the article was edited on a given date.

def get_page_edits_by_date(page_title,conn,date_string='2014-12-31'):
    """ Takes a page title and returns the number of revisions made on each date.
      page_title = a string for the page title to get its revisions
      date_string = a string for the date in YYYY-MM-DD format
      conn = a database connection
      
    Returns:
      A DataFrame with username, page title, edit count, and min/max timestamps
    """
    # In case you pass a page title with spaces in it, replace the spaces with underscores
    page_title = page_title.replace(' ','_').encode('utf8').decode('latin1')
    
    # The MySQL query string used to retrieve the data. By line, it is
    ## converting the timestamp to a date and 
    ## counting the number of elements
    ## from the "revisions" table
    ## joining the "page" table on it
    ## using the page_id and rev_page columns as keys
    ## limiting the results to entries that have the pagetitle, 
    ## occur in the namespace, and happen after Dec 31, 2014
    ## grouping the results by date
    s = """
            SELECT
                DATE(rev_timestamp) as date,
                page_title,
                COUNT(*) as edits
            FROM 
                revision 
            JOIN 
                page ON page.page_id = revision.rev_page
            WHERE 
                page.page_title = "{0}" 
                AND page_namespace = 0
                AND DATE(rev_timestamp) > '{1}'
            GROUP BY
                date
        """.format(page_title,date_string)

    # Use the connection to run the query and return the results as a DataFrame
    _df = pd.read_sql_query(s,conn)
    
    _df['page_title'] = _df['page_title'].str.decode('utf8')
    _df['page_title'] = _df['page_title'].str.replace('_',' ')
    
    # Return the data, with a clean index
    return _df

def get_neighbors_revisions(page_title,conn):
    """ Takes a page title and returns revisions for the page and its neighbors.
      page_title = a string for the page title to get its revisions
      
    Returns:
      A pandas DataFrame containing all the page revisions.
    """
    # Get the outlinks from the page and include the page itself in the list
    alters = get_page_outlinks(page_title) + [page_title]
    # Resolve the redirects in the list of alters
    alters = list(set(resolve_redirects(alters)))
    # Create an empty container to hold the DataFrames
    df_list = []
    # For each page, get the revision counts and append to the df_list
    for alter in alters:
        _df = get_page_edits_by_date(alter,conn)
        df_list.append(_df)
    # Concatenate the list of revision count DataFrames into a giant DataFrame
    df = pd.concat(df_list)
    # Return the data
    return df.reset_index(drop=True)

Get the authentication information and connect to the database.

host, user, password = os.environ['MYSQL_HOST'], os.environ['MYSQL_USERNAME'], os.environ['MYSQL_PASSWORD']
conn = pymysql.connect(host=host,user=user,password=password,database='enwiki_p',connect_timeout=3600)
conn.cursor().execute('use enwiki_p');

Get the number of revisions per day for all the articles.

hl_daily_rev_df = get_neighbors_revisions(page_title,conn)
hl_daily_rev_df.head()
date page_title edits
0 2015-01-01 The Pussycat Dolls 1.0
1 2015-01-04 The Pussycat Dolls 2.0
2 2015-01-06 The Pussycat Dolls 1.0
3 2015-01-07 The Pussycat Dolls 1.0
4 2015-01-08 The Pussycat Dolls 1.0

Reindex the edit data so it's starting and ending on the same dates as the pageviews data.

# Convert into a format like the hl_pageviews DataFrame
# Index are dates between Jan 1, 2015 and today; columns are article titles; values are number of edits
hl_edits_df = hl_daily_rev_df.set_index(['date','page_title'])['edits'].unstack(1)

# Reindex so dates are continuous
pv_start_ix = str(hl_pvs_df.index.min().date())
pv_end_ix = str((pd.Timestamp.today() - pd.Timedelta('1 d')).date())
_date_range = pd.date_range(pv_start_ix,pv_end_ix)
hl_edits_df = hl_edits_df.reindex(index=_date_range)

# Fill in empty observations with 0s
hl_edits_df = hl_edits_df.fillna(0)

hl_edits_df.head()
page_title 14th Dalai Lama 2009 MTV Video Music Awards 2010 BRIT Awards 2010 Haiti earthquake 2010 MTV Video Music Awards 2011 MTV Video Music Awards 2011 Tōhoku earthquake and tsunami 2015 European Games 2016 Democratic National Convention 2016 Orlando nightclub shooting 52nd Annual Grammy Awards 58th Annual Grammy Awards 73rd Golden Globe Awards 87th Academy Awards 88th Academy Awards A Funny Thing Happened on the Way to the Forum A Star Is Born (1954 film) A Very Gaga Holiday A Very Gaga Thanksgiving ABBA ABC News Academy Award for Best Original Song Academy Awards Acetabular labrum Airplay Akon Aleiodes gaga Alejandro (song) Alexander McQueen Alter ego American Broadcasting Company American Horror Story American Horror Story: Hotel American Horror Story: Roanoke Andrews McMeel Publishing Androgyny Andy Warhol Anheuser-Busch brands Ann Powers Applause (Lady Gaga song) ArtRave ArtRave: The Artpop Ball Artists Against Fracking Artists and repertoire Artists with the most number-ones on the U.S. Dance Club Songs chart Artpop Auto-Tune Avant-garde Azerbaijan Baby, It's Cold Outside ... The Lady Is a Tramp The Monster Ball Tour The O2 Arena The Pussycat Dolls The Remix (Lady Gaga album) The Simpsons The Sopranos The Sound of Music (film) The Star-Spangled Banner The Sunday Times The Telltale Moozadell The Times (South Africa) The Zen of Bennett This Is It (concerts) Til It Happens to You Time (magazine) Tisch School of the Arts Tom Ford Tony Bennett Tony Bennett and Lady Gaga: Cheek to Cheek Live! Trade name Twitter United States Armed Forces United States Conference of Mayors Universal Life Church Monastery University of Nevada, Las Vegas University of South Carolina Upper East Side Upper West Side Urban legend Urban music V (American magazine) Versace Viacom Vice President of the United States Vincent Herbert Viral video Vogue (magazine) Vox Media Warner Bros. Warsaw We Are the World 25 for Haiti White House Task Force to Protect Students from Sexual Assault Whitney Houston WikiLeaks William Morris Endeavor Wprost Yoko Ono You and I (Lady Gaga song) YouTube
2015-07-01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2015-07-02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 2.0 2.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 0.0 5.0 0.0 0.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 2.0
2015-07-03 2.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 6.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
2015-07-04 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2015-07-05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 6.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 7.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0

5 rows × 411 columns

Are pageviews and edits correlated with each other?

_s1 = hl_pvs_df[page_title]
_s2 = hl_edits_df[page_title]

np.corrcoef(_s1.apply(np.log),_s2)[0][1]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-93-8034d180ccbe> in <module>()
      2 _s2 = hl_edits_df[page_title]
      3 
----> 4 np.corrcoef(_s1.apply(np.log),_s2)[0][1]

/srv/paws/lib/python3.4/site-packages/numpy/lib/function_base.py in corrcoef(x, y, rowvar, bias, ddof)
   2559         warnings.warn('bias and ddof have no effect and are deprecated',
   2560                       DeprecationWarning)
-> 2561     c = cov(x, y, rowvar)
   2562     try:
   2563         d = diag(c)

/srv/paws/lib/python3.4/site-packages/numpy/lib/function_base.py in cov(m, y, rowvar, bias, ddof, fweights, aweights)
   2430         if rowvar == 0 and y.shape[0] != 1:
   2431             y = y.T
-> 2432         X = np.vstack((X, y))
   2433 
   2434     if ddof is None:

/srv/paws/lib/python3.4/site-packages/numpy/core/shape_base.py in vstack(tup)
    228 
    229     """
--> 230     return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
    231 
    232 def hstack(tup):

ValueError: all the input array dimensions except for the concatenation axis must match exactly
single_pv_edits_df = pd.DataFrame({'pageviews':_s1,'edits':_s2})
ax = single_pv_edits_df.plot(secondary_y='edits',logy=True)
ax.right_ax.set_yscale('log')
ax.set_ylabel('Pageviews')
ax.right_ax.set_ylabel('Edits')
<matplotlib.text.Text at 0x7f4adab605f8>

Can Wikipedia supply information to keep up with demand?

The ratio between the cumulative pageviews and cumulative edits.

ax = (_s1.cumsum()/_s2.cumsum()).plot()

ax.set_ylabel('Cumulative pageviews per edit')
<matplotlib.text.Text at 0x7f4adaea5cc0>
def zscore(series):
    return np.abs((series - series.mean())/series.std())

Look at the normalized (z-score) excitation and relaxation in edits and pageviews by day. Each point is a single day in the article's history and they're connected if they come one day after each other. Values along the diagonal in red suggest that increases in attention to the article are matched by similar increases in editing activity on the article. Alternatively, data points in the upper-left triangle suggest increases in pageviews are not matched by increases in edits while data points in the lower-right triangle suggest increases in edits are not matched by increases in pageviews.

f,ax = plt.subplots(1,1)

ax.set_xlabel('Edits (z-score)')
ax.set_ylabel('Pageviews (z-score)')
ax.set_xlim((1e-3,1e2))
ax.set_ylim((1e-3,1e2))
ax.set_xscale('log')
ax.set_yscale('log')

plt.text(1e-1,1e1,'More views than edits',ha='center',weight='bold')
plt.text(1e1,1e-1,'More edits than views',ha='center',weight='bold')

plt.plot([1e-3,1e2],[1e-3,1e2],axes=ax,c='r')

_s1 = zscore(hl_edits_df[page_title])
_s2 = zscore(hl_pvs_df[page_title])
plt.plot(_s1,_s2,'o-',axes=ax,c='grey');
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-97-d26e0722c15b> in <module>()
     15 _s1 = zscore(hl_edits_df[page_title])
     16 _s2 = zscore(hl_pvs_df[page_title])
---> 17 plt.plot(_s1,_s2,'o-',axes=ax,c='grey');

/srv/paws/lib/python3.4/site-packages/matplotlib/pyplot.py in plot(*args, **kwargs)
   3159         ax.hold(hold)
   3160     try:
-> 3161         ret = ax.plot(*args, **kwargs)
   3162     finally:
   3163         ax.hold(washold)

/srv/paws/lib/python3.4/site-packages/matplotlib/__init__.py in inner(ax, *args, **kwargs)
   1817                     warnings.warn(msg % (label_namer, func.__name__),
   1818                                   RuntimeWarning, stacklevel=2)
-> 1819             return func(ax, *args, **kwargs)
   1820         pre_doc = inner.__doc__
   1821         if pre_doc is None:

/srv/paws/lib/python3.4/site-packages/matplotlib/axes/_axes.py in plot(self, *args, **kwargs)
   1380         kwargs = cbook.normalize_kwargs(kwargs, _alias_map)
   1381 
-> 1382         for line in self._get_lines(*args, **kwargs):
   1383             self.add_line(line)
   1384             lines.append(line)

/srv/paws/lib/python3.4/site-packages/matplotlib/axes/_base.py in _grab_next_args(self, *args, **kwargs)
    379                 return
    380             if len(remaining) <= 3:
--> 381                 for seg in self._plot_args(remaining, kwargs):
    382                     yield seg
    383                 return

/srv/paws/lib/python3.4/site-packages/matplotlib/axes/_base.py in _plot_args(self, tup, kwargs)
    357             x, y = index_of(tup[-1])
    358 
--> 359         x, y = self._xy_from_xy(x, y)
    360 
    361         if self.command == 'plot':

/srv/paws/lib/python3.4/site-packages/matplotlib/axes/_base.py in _xy_from_xy(self, x, y)
    217         y = _check_1d(y)
    218         if x.shape[0] != y.shape[0]:
--> 219             raise ValueError("x and y must have same first dimension")
    220         if x.ndim > 2 or y.ndim > 2:
    221             raise ValueError("x and y can be no greater than 2-D")

ValueError: x and y must have same first dimension
f,ax = plt.subplots(1,1)

ax.set_xlabel('Edits (z-score)')
ax.set_ylabel('Pageviews (z-score)')
ax.set_xlim((1e-3,1e2))
ax.set_ylim((1e-3,1e2))
ax.set_xscale('log')
ax.set_yscale('log')

for page in hl_edits_df.columns:
    _s1 = zscore(hl_edits_df[page])
    _s2 = zscore(hl_pvs_df[page])
    #plt.plot(_s1.ix[_s2.index],_s2,'.-',axes=ax,alpha=.125,c='grey')
    plt.scatter(_s1,_s2,axes=ax,alpha=.125,c='grey')
    
# Diagonal red line showing perfect matching
plt.plot([1e-3,1e2],[1e-3,1e2],axes=ax,c='r',zorder=-1);
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-98-6fec507d7c66> in <module>()
     12     _s2 = zscore(hl_pvs_df[page])
     13     #plt.plot(_s1.ix[_s2.index],_s2,'.-',axes=ax,alpha=.125,c='grey')
---> 14     plt.scatter(_s1,_s2,axes=ax,alpha=.125,c='grey')
     15 
     16 # Diagonal red line showing perfect matching

/srv/paws/lib/python3.4/site-packages/matplotlib/pyplot.py in scatter(x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, verts, edgecolors, hold, data, **kwargs)
   3256                          vmin=vmin, vmax=vmax, alpha=alpha,
   3257                          linewidths=linewidths, verts=verts,
-> 3258                          edgecolors=edgecolors, data=data, **kwargs)
   3259     finally:
   3260         ax.hold(washold)

/srv/paws/lib/python3.4/site-packages/matplotlib/__init__.py in inner(ax, *args, **kwargs)
   1817                     warnings.warn(msg % (label_namer, func.__name__),
   1818                                   RuntimeWarning, stacklevel=2)
-> 1819             return func(ax, *args, **kwargs)
   1820         pre_doc = inner.__doc__
   1821         if pre_doc is None:

/srv/paws/lib/python3.4/site-packages/matplotlib/axes/_axes.py in scatter(self, x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, verts, edgecolors, **kwargs)
   3808         y = np.ma.ravel(y)
   3809         if x.size != y.size:
-> 3810             raise ValueError("x and y must be the same size")
   3811 
   3812         s = np.ma.ravel(s)  # This doesn't have to match x, y in size.

ValueError: x and y must be the same size

Did a burst of pageviews diffuse to adjacent pages?

On what day did the maximum number of pageviews occur?

page_max_pv_date = hl_pvs_df[page_title].idxmax()
print(page_max_pv_date.date())
2016-03-06
# For each article, what was the date when the maximum number of 
max_pv_dates_df = hl_pvs_df.idxmax()

# Check whether these dates were also the page_max_pv_date
_fraction = max_pv_dates_df.isin([page_max_pv_date,page_max_pv_date+1]).sum()/len(max_pv_dates_df)

print("{0:.1%} of articles in the hyperlink network also had the max pageview on the same date as the article.".format(_fraction))
19.1% of articles in the hyperlink network also had the max pageview on the same date as the article.

What are these "co-bursting" pages?

cobursting_pv_articles = max_pv_dates_df[max_pv_dates_df.isin([page_max_pv_date,page_max_pv_date+1])].sort_index()
cobursting_pv_articles
Adolfo (designer)                          2016-03-07
Alla Nazimova                              2016-03-06
Barbara Bush                               2016-03-06
Bel Air, Los Angeles                       2016-03-06
Bess Truman                                2016-03-06
Betty Ford                                 2016-03-06
Brenda Marshall                            2016-03-06
Chasen's                                   2016-03-07
Death and state funeral of Gerald Ford     2016-03-07
Death and state funeral of Ronald Reagan   2016-03-06
Donald Regan                               2016-03-06
Donovan's Brain (film)                     2016-03-07
East Gate Bel Air, Los Angeles             2016-03-07
Edith Luckett Davis                        2016-03-07
First Lady of the United States            2016-03-06
General Electric Theater                   2016-03-07
Gerald Ford                                2016-03-07
Governor of California                     2016-03-06
Governor's Mansion State Historic Park     2016-03-07
Gridiron Club                              2016-03-06
Hellcats of the Navy                       2016-03-07
James Galanos                              2016-03-07
Jane Wyman                                 2016-03-07
Jimmy Carter                               2016-03-06
Joan Quigley                               2016-03-06
John F. Kennedy                            2016-03-07
John Sears (political strategist)          2016-03-06
Just Say No                                2016-03-07
Kitty Kelley                               2016-03-07
Latin School of Chicago                    2016-03-07
Laura Bush                                 2016-03-06
Limoges Box                                2016-03-07
Lou Cannon                                 2016-03-06
Lumpectomy                                 2016-03-07
Lute Song (musical)                        2016-03-07
Maureen Reagan                             2016-03-07
Michael Deaver                             2016-03-06
My Turn (memoir)                           2016-03-07
Nancy Reagan                               2016-03-06
Portrait of Jennie                         2016-03-07
Power behind the throne                    2016-03-07
Rancho del Cielo                           2016-03-07
Ronald Reagan                              2016-03-06
Roosevelt Avenue                           2016-03-06
Senior Corps                               2016-03-07
Sloane Hospital for Women                  2016-03-06
Smith College                              2016-03-06
Steven Ford                                2016-03-07
Stop the Madness                           2016-03-07
The Dark Wave                              2016-03-06
White House Chief of Staff                 2016-03-06
White House china                          2016-03-07
William Holden                             2016-03-06
dtype: datetime64[ns]

Did this burst of pageviews translate into edits on these pages as well?

Get the articles with the peak amount of activity by date and then subset to only include those articles whose peak dates are the same as the peak pageview dates.

max_edits_dates_df = hl_edits_df.idxmax()
cobursting_edit_articles = max_edits_dates_df[max_edits_dates_df.isin([page_max_pv_date,page_max_pv_date+1])].sort_index()
cobursting_edit_articles
page_title
Alla Nazimova                             2016-03-07
Council of Fashion Designers of America   2016-03-07
Edith Luckett Davis                       2016-03-07
Framing (social sciences)                 2016-03-07
Hellcats of the Navy                      2016-03-07
Just Say No                               2016-03-07
Maria Shriver                             2016-03-06
Maureen Reagan                            2016-03-07
Michael Reagan                            2016-03-06
Mr. T                                     2016-03-07
Nancy Reagan                              2016-03-06
Patti Davis                               2016-03-07
Rancho Mirage, California                 2016-03-06
Ronald Reagan Presidential Library        2016-03-07
The Next Voice You Hear...                2016-03-06
WorldCat                                  2016-03-07
dtype: datetime64[ns]

How many edits beyond the target article did this translate into? Subset the dataframe to the relevant dates and articles and inspect.

dates = [page_max_pv_date,page_max_pv_date+1]
cobursting_edit_articles_list = list(set(cobursting_edit_articles.index) - set(page_title))
cobursting_hl_edits_df = hl_edits_df.loc[dates,cobursting_edit_articles.index]
cobursting_hl_edits_df
page_title Alla Nazimova Council of Fashion Designers of America Edith Luckett Davis Framing (social sciences) Hellcats of the Navy Just Say No Maria Shriver Maureen Reagan Michael Reagan Mr. T Nancy Reagan Patti Davis Rancho Mirage, California Ronald Reagan Presidential Library The Next Voice You Hear... WorldCat
2016-03-06 0.0 0.0 5.0 0.0 1.0 2.0 9.0 3.0 15.0 0.0 251.0 1.0 10.0 0.0 2.0 0.0
2016-03-07 3.0 6.0 6.0 8.0 11.0 4.0 1.0 10.0 7.0 10.0 68.0 5.0 0.0 6.0 0.0 4.0

Now sum up all these edits.

cobursting_hl_edits_df.sum().sum()
448.0