Lab 4 - Pageviews

Professor Brian Keegan
Department of Information Science, CU Boulder
This notebook is copyright and made available under the Apache License v2.0 license.

This is the third of five lab notebooks that will explore how to analyze the structure of collaborations in Wikipedia data about users' revisions across multiple articles. This lab will extend the methods in the previous two labs about analyzing a single article's revision histories and analyzing the hyperlink networks around a single Wikipedia page. You do not need to be fluent in either to complete the lab, but there are many options for extending the analyses we do here by using more advanced queries and scripting methods.

Acknowledgements
I'd like to thank the Wikimedia Foundation for the PAWS system and related Wikitech infrastructure that this workbook runs within. Yuvi Panda, Aaron Halfaker, Jonathan Morgan, and Dario Taraborelli have all provided crucial support and feedback.

Confirm that basic Python commands work

a = 3
b = 4
c = 5
(c-a)**b
16

Import modules and setup environment

Load up all the libraries we'll need to connect to the database, retreive information for analysis, and visualize results.

# Makes the plots appear within the notebook
%matplotlib inline

# Two fundamental packages for doing data manipulation
import numpy as np                   # http://www.numpy.org/
import pandas as pd                  # http://pandas.pydata.org/

# Two related packages for plotting data
import matplotlib.pyplot as plt      # http://matplotlib.org/
import seaborn as sb                 # https://stanford.edu/~mwaskom/software/seaborn/

# Package for requesting data via the web and parsing resulting JSON
import requests                      # http://docs.python-requests.org/en/master/
import json                          # https://docs.python.org/3/library/json.html
from bs4 import BeautifulSoup        # https://www.crummy.com/software/BeautifulSoup/bs4/doc/

# Two packages for accessing the MySQL server
import pymysql                       # http://pymysql.readthedocs.io/en/latest/
import os                            # https://docs.python.org/3.4/library/os.html

# Setup the code environment to use plots with a white background and DataFrames show more columns and rows
sb.set_style('whitegrid')
pd.options.display.max_columns = 100
pd.options.display.max_rows = 110

Define an article to examine pageview dynamics.

page_title = 'Patient Protection and Affordable Care Act'

Get pageview data for a single article

Details about the Wikimedia REST API for pageviews is available here. Unfortunately, this data end point only provides information going back to July 1, 2015.

This is what the API returns as an example.

# Get today's date and yesterday's date
today = pd.datetime.today()
yesterday = pd.datetime.today() - pd.to_timedelta('1 day')

# Date
today_date_s = str(today.date())
yesterday_date_s = str(yesterday.date())

# Convert to strings
today_s = pd.datetime.strftime(today,'%Y%m%d00')
yesterday_s = pd.datetime.strftime(yesterday,'%Y%m%d00')

# Get the pageviews for today and yesterday
url_string = 'http://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/{0}/daily/{1}/{2}'
print(url_string.format(page_title.replace(' ','_'),yesterday_s,today_s))
http://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Patient_Protection_and_Affordable_Care_Act/daily/2016110100/2016110200

Write a function to get the pageviews from January 1, 2015 (in practice, the start date will be as late as August or as early as May) until yesterday.

def get_daily_pageviews(page_title,today_s):
    url_string = 'http://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/{0}/daily/2015010100/{1}'
    req = requests.get(url_string.format(page_title,today_s))

    json_s = json.loads(req.text)
    if 'items' in json_s.keys():
        _df = pd.DataFrame(json_s['items'])[['timestamp','views','article']]
        _df['timestamp'] = pd.to_datetime(_df['timestamp'],format='%Y%m%d00')
        _df['weekday'] = _df['timestamp'].apply(lambda x:x.weekday())
        return _df

Get the data for your page.

pageview_df = get_daily_pageviews(page_title,today_s)
pageview_df.head()
timestamp views article weekday
0 2015-07-01 3754 Patient_Protection_and_Affordable_Care_Act 2
1 2015-07-02 3369 Patient_Protection_and_Affordable_Care_Act 3
2 2015-07-03 2646 Patient_Protection_and_Affordable_Care_Act 4
3 2015-07-04 2396 Patient_Protection_and_Affordable_Care_Act 5
4 2015-07-05 2368 Patient_Protection_and_Affordable_Care_Act 6

Interpret page view results

What does the pageview activity look like? Are there any bursts of attention? What might these bursts be linked to?

ax = pageview_df.plot.line(x='timestamp',y='views',logy=False,legend=False)
ax.set_xlabel('')
ax.set_ylabel('Pageviews')
<matplotlib.text.Text at 0x7f9c32bb0c50>

Use a logarithmic scaling for the y-axis to see more of the detail in the lower-traffic days.

ax = pageview_df.plot.line(x='timestamp',y='views',logy=True,legend=False)
ax.set_xlabel('')
ax.set_ylabel('Pageviews')
<matplotlib.text.Text at 0x7f9c30963c88>

What are the dates for the biggest pageview outliers? Here we define an "outlier" to be more than 3 standard deviations above the average number of pageviews over the time window.

std_threshold = 4
threshold_val = pageview_df['views'].mean() + pageview_df['views'].std() * std_threshold
peak_days = pageview_df[pageview_df['views'] > threshold_val]

peak_days.head(10)
timestamp views article weekday
467 2016-10-10 14398 Patient_Protection_and_Affordable_Care_Act 0
477 2016-10-20 11893 Patient_Protection_and_Affordable_Care_Act 3
482 2016-10-25 13619 Patient_Protection_and_Affordable_Care_Act 1
483 2016-10-26 13216 Patient_Protection_and_Affordable_Care_Act 2
484 2016-10-27 9674 Patient_Protection_and_Affordable_Care_Act 3
489 2016-11-01 10614 Patient_Protection_and_Affordable_Care_Act 1

How much of the total pageview activity occurred on these days compared to the rest of the pageviews?

peak_fraction = pageview_df.loc[peak_days.index,'views'].sum()/pageview_df['views'].sum()

print('{0:.1%} of all pageviews occurred on the {1} peak days.'.format(peak_fraction,len(peak_days)))
4.2% of all pageviews occurred on the 6 peak days.

How does pageview activity change over the course of a week?

g = sb.factorplot(x='weekday',y='views',data=pageview_df,kind='bar',color='grey',
                  aspect=1.67,estimator=np.median)
ax = g.axes[0][0]
ax.set_xticklabels(['Mon','Tue','Wed','Thu','Fri','Sat','Sun'],rotation=0)
ax.set_xlabel('')
ax.set_ylabel('Average pageviews')
<matplotlib.text.Text at 0x7f9c309a3898>

Compare pageviews to another page

Lets write a function that takes a list of article names and returns a DataFrame indexed by date, columned by articles, and values being the number of pageviews.

def get_multiple_pageviews(page_list,today_s):
    multiple_pv_df = pd.DataFrame(index=pd.date_range('2015-05-01', today_date_s))
    for page in page_list:
        pv_df = get_daily_pageviews(page,today_s)
        try:
            multiple_pv_df[page] = pv_df.set_index('timestamp')['views'] 
        except:
            print("Error on: {0}".format(page))
            multiple_pv_df[page] = np.nan
    return multiple_pv_df.dropna(how='all')

Enter two related pages for which you want to compare their pageview behavior.

page_list = ['Patient Protection and Affordable Care Act','Chipotle Mexican Grill']

Get both of their data.

# Get the data
multiple_pvs = get_multiple_pageviews(page_list,today_s)

# Show the top rows
multiple_pvs.tail()
Patient Protection and Affordable Care Act Chipotle Mexican Grill
2016-10-28 7713.0 2144.0
2016-10-29 6024.0 1897.0
2016-10-30 6032.0 1834.0
2016-10-31 6758.0 2247.0
2016-11-01 10614.0 2296.0

Plot the data.

multiple_pvs.plot(logy=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7f9c2a1a33c8>

What is the correlation coefficient between these two articles' behavior?

multiple_pvs.apply(np.log).corr()
Patient Protection and Affordable Care Act Chipotle Mexican Grill
Patient Protection and Affordable Care Act 1.000000 0.103682
Chipotle Mexican Grill 0.103682 1.000000

How did the ratio between the two articles' pageviews change over time?

ratio_s = multiple_pvs[page_list[0]].div(multiple_pvs[page_list[1]])
ax = ratio_s.plot()
ax.set_ylabel('{0}/{1}'.format(page_list[0],page_list[1]))
<matplotlib.text.Text at 0x7f9c2a241320>

Use the functions for resolving redirects and getting page outlinks from prior labs.

# From http://stackoverflow.com/a/312464/1574687
def make_chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]

def resolve_redirects(page_title_list):
    # Chunk the pages into a list of lists of size 50
    chunks = make_chunks(page_title_list,50)
    # Create an empty list to fill with the redirected titles
    redirected_page_titles = []
    # For each chunk try to get the redirects
    for chunk in chunks:
        # Create the query string that separates spaces within page titles by '+' 
        # and separates page titles by '|'
        page_titles_string = '|'.join([page.replace(' ','+') for page in chunk])
        # Put this large string into the URL
        url_string = 'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=pageprops&titles={0}+&redirects=1'.format(page_titles_string)
        # Do the query and parse the JSON response into a dictionary
        req = json.loads(requests.get(url_string).text)
        # Convert the returned values containing redirects into a dictionary
        if 'redirects' in req['query'].keys():
            redirected_titles = {d['from']:d['to'] for d in req['query']['redirects']}
            # Add the redirected titles to the list
            for title in chunk:
                try:
                    #print(len(redirected_page_titles), title, redirected_titles[title])
                    redirected_page_titles.append(redirected_titles[title])
                # If they don't have a redirect just add the original title
                except KeyError:
                    #print(len(redirected_page_titles), '\nFrom: ', title, '\nTo: ', title)
                    redirected_page_titles.append(title)
        else:
            for title in chunk:
                redirected_page_titles.append(title)
    # Make sure the number of page titles remained the same, otherwise raise a warning
    if len(page_title_list) == len(redirected_page_titles):
        return redirected_page_titles
    else:
        print("WARNING! The number of page titles in the redirected list ({0}) is not equal to the input list ({1})".format(len(redirected_page_titles),len(page_title_list)))
        return redirected_page_titles

def get_page_outlinks(page_title,redirects=1):
    # Replace spaces with underscores
    #page_title = page_title.replace(' ','_')
    
    bad_titles = ['Special:','Wikipedia:','Help:','Template:','Category:','International Standard','Portal:','s:']
    
    # Get the response from the API for a query
    # After passing a page title, the API returns the HTML markup of the current article version within a JSON payload
    req = requests.get('https://en.wikipedia.org/w/api.php?action=parse&format=json&page={0}&redirects={1}&prop=text&disableeditsection=1&disabletoc=1'.format(page_title,redirects))
    
    # Read the response into JSON to parse and extract the HTML
    json_string = json.loads(req.text)
    
    # Initialize an empty list to store the links
    outlinks_list = [] 
    
    if 'parse' in json_string.keys():
        page_html = json_string['parse']['text']['*']

        # Parse the HTML into Beautiful Soup
        soup = BeautifulSoup(page_html,'lxml')

        # Delete tags associated with templates
        for tag in soup.find_all('tr'):
            tag.replace_with('')

        # For each paragraph tag, extract the titles within the links
        for para in soup.find_all('p'):
            for link in para.find_all('a'):
                if link.has_attr('title'):
                    title = link['title']
                    # Ignore links that aren't interesting
                    if all(bad not in title for bad in bad_titles):
                        outlinks_list.append(title)

        # For each unordered list, extract the titles within the child links
        for unordered_list in soup.find_all('ul'):
            for item in unordered_list.find_all('li'):
                for link in item.find_all('a'):
                    if link.has_attr('title'):
                        title = link['title']
                        # Ignore links that aren't interesting
                        if all(bad not in title for bad in bad_titles):
                            outlinks_list.append(title)

    return outlinks_list

Get the outlinks.

raw_outlinks = get_page_outlinks(page_title)
redirected_outlinks = resolve_redirects(raw_outlinks)

Get the data.

This stage may take several minutes.

# Get the data
hl_pvs_df = get_multiple_pageviews(redirected_outlinks + [page_title],today_s)

# Show the top rows
hl_pvs_df.head()
Law of the United States President of the United States Barack Obama Health Care and Education Reconciliation Act of 2010 Health care in the United States Medicare (United States) Medicaid Health insurance Health insurance coverage in the United States Premium tax credit Health insurance marketplace Guaranteed issue Community rating Pre-existing condition Congressional Budget Office United States Congress Federal judiciary of the United States State governments of the United States Conservatism in the United States Advocacy group Trade union Supreme Court of the United States Health insurance mandate Patient Protection and Affordable Care Act Centers for Disease Control and Prevention Gallup (company) Poverty in the United States State Children's Health Insurance Program Subsidy Tax credit Provisions of the Patient Protection and Affordable Care Act Health insurance in the United States Tricare Internal Revenue Service Death spiral (insurance) Free rider problem Virtuous circle and vicious circle Adverse selection Jonathan Gruber (economist) Risk pool National Academy of Medicine Essential health benefits United States Preventive Services Task Force Birth control Reinsurance Insurance Tobacco Copayment Co-insurance Deductible ... Commerce Clause Catholic Church United States Conference of Catholic Bishops First Amendment to the United States Constitution Free Exercise Clause United States House of Representatives v. Burwell Washington, D.C. Nonviolence 112th United States Congress 113th United States Congress Steve King Michele Bachmann Veto United States federal government shutdown of 2013 Center for Economic and Policy Research Job lock Presidency of Barack Obama Community Living Assistance Services and Supports Act Long-term care insurance HealthCare.gov Burwell v. Hobby Lobby Stores, Inc. Wheaton College (Illinois) John Bel Edwards Illegal immigrant population of the United States Fee-for-service Bundled payment Hip replacement Medicare Part D coverage gap Social Security Act Hospital readmission Hospital-acquired infection Great Recession in the United States Acronyms in healthcare Comparison of the healthcare systems in Canada and the United States EBSA Form 700 Health care reform Health systems by country Individual shared responsibility provision King v. Burwell Medicare Access and CHIP Reauthorization Act of 2015 National health insurance Universal health coverage by country Health system Steven Brill (journalist) The Wall Street Journal CNN Congressional Research Service CBS News United States House Committee on Oversight and Government Reform Democracy Now!
2015-07-01 758.0 7656.0 28422.0 128.0 921.0 2235.0 1602.0 992.0 148.0 15.0 397.0 33.0 52.0 170.0 194.0 2741.0 411.0 224.0 1227.0 212.0 1107.0 15274.0 87.0 3754.0 1013.0 417.0 967.0 165.0 691.0 417.0 60.0 550.0 756.0 1236.0 42.0 540.0 318.0 349.0 138.0 104.0 NaN NaN 63.0 1203.0 1169.0 3040.0 1998.0 187.0 267.0 613.0 ... 574.0 3851.0 90.0 9250.0 772.0 31.0 9863.0 325.0 249.0 425.0 219.0 1111.0 601.0 395.0 68.0 25.0 797.0 13.0 269.0 202.0 795.0 269.0 163.0 167.0 237.0 142.0 527.0 131.0 244.0 NaN 534.0 200.0 156.0 NaN NaN 166.0 102.0 NaN 1678.0 32.0 144.0 468.0 427.0 34.0 1290.0 2026.0 164.0 459.0 100.0 228.0
2015-07-02 695.0 7884.0 26087.0 123.0 903.0 2039.0 1519.0 966.0 134.0 18.0 346.0 29.0 47.0 167.0 193.0 2625.0 452.0 288.0 967.0 264.0 1257.0 10513.0 68.0 3369.0 971.0 404.0 987.0 194.0 699.0 524.0 76.0 623.0 615.0 1312.0 43.0 471.0 339.0 357.0 118.0 120.0 NaN NaN 73.0 1235.0 1099.0 3167.0 1922.0 150.0 257.0 656.0 ... 514.0 3774.0 111.0 7728.0 608.0 17.0 10555.0 344.0 290.0 449.0 213.0 1176.0 599.0 361.0 94.0 19.0 676.0 24.0 226.0 130.0 684.0 284.0 142.0 198.0 229.0 114.0 560.0 147.0 246.0 NaN 540.0 261.0 153.0 NaN NaN 210.0 113.0 NaN 1154.0 26.0 162.0 517.0 493.0 36.0 1454.0 2161.0 246.0 444.0 110.0 292.0
2015-07-03 573.0 7702.0 25254.0 110.0 680.0 1452.0 1181.0 716.0 150.0 10.0 317.0 38.0 35.0 162.0 144.0 2451.0 301.0 177.0 877.0 216.0 1015.0 8810.0 64.0 2646.0 863.0 276.0 777.0 122.0 566.0 386.0 79.0 559.0 483.0 956.0 34.0 440.0 252.0 267.0 135.0 79.0 NaN NaN 62.0 1107.0 1045.0 2514.0 1759.0 147.0 187.0 391.0 ... 453.0 3097.0 72.0 11455.0 507.0 13.0 9337.0 276.0 237.0 329.0 194.0 964.0 516.0 301.0 99.0 23.0 630.0 27.0 194.0 107.0 466.0 240.0 142.0 223.0 165.0 96.0 442.0 118.0 180.0 NaN 474.0 368.0 110.0 NaN NaN 101.0 124.0 NaN 759.0 22.0 140.0 401.0 353.0 40.0 2632.0 1748.0 168.0 457.0 85.0 240.0
2015-07-04 553.0 9850.0 25659.0 89.0 638.0 1079.0 830.0 674.0 104.0 9.0 180.0 20.0 21.0 84.0 136.0 3639.0 307.0 215.0 1092.0 152.0 1244.0 9167.0 45.0 2396.0 686.0 234.0 746.0 112.0 484.0 269.0 66.0 440.0 225.0 1083.0 29.0 391.0 219.0 238.0 83.0 88.0 NaN NaN 54.0 1132.0 594.0 1943.0 1625.0 71.0 88.0 244.0 ... 325.0 3411.0 100.0 11947.0 477.0 4.0 10827.0 298.0 218.0 338.0 181.0 922.0 504.0 265.0 120.0 31.0 632.0 19.0 179.0 87.0 358.0 226.0 113.0 206.0 137.0 51.0 358.0 89.0 184.0 NaN 300.0 200.0 58.0 NaN NaN 108.0 91.0 NaN 562.0 20.0 149.0 346.0 326.0 24.0 3621.0 1688.0 142.0 393.0 49.0 224.0
2015-07-05 545.0 9156.0 25946.0 77.0 613.0 1137.0 910.0 772.0 92.0 11.0 179.0 21.0 40.0 106.0 149.0 2599.0 274.0 191.0 888.0 211.0 992.0 6404.0 67.0 2368.0 733.0 258.0 840.0 112.0 569.0 403.0 63.0 410.0 185.0 1529.0 44.0 415.0 218.0 304.0 128.0 102.0 NaN NaN 45.0 1107.0 558.0 2087.0 1877.0 73.0 81.0 266.0 ... 382.0 3626.0 92.0 8067.0 361.0 5.0 10361.0 415.0 147.0 237.0 117.0 943.0 510.0 309.0 70.0 24.0 571.0 28.0 195.0 106.0 396.0 245.0 91.0 248.0 105.0 60.0 424.0 80.0 200.0 NaN 377.0 185.0 118.0 NaN NaN 112.0 102.0 NaN 500.0 16.0 127.0 342.0 329.0 38.0 3018.0 1849.0 192.0 415.0 78.0 223.0

5 rows × 269 columns

What are the most-viewed articles in the hyperlink network?

most_viewed_articles = hl_pvs_df.cumsum().ix[yesterday_date_s]
most_viewed_articles = most_viewed_articles.sort_values(ascending=False)
most_viewed_articles.head(10)
Barack Obama                        17564775.0
Hillary Clinton                     16462238.0
Bill Clinton                        10398691.0
George H. W. Bush                    6257659.0
California                           6214725.0
President of the United States       5187922.0
Republican Party (United States)     5165128.0
Washington, D.C.                     5153025.0
Twitter                              5032233.0
United States Constitution           3583113.0
Name: 2016-10-30 00:00:00, dtype: float64

Most and least correlated articles

Which articles are most correlated with each other?

# Log the pageview data to reduce skew from bursty outliers abd make the correlation table
hl_corr_df = hl_pvs_df.apply(np.log).corr()

# Correlation table is symmetric, drop one half of them
# From: http://stackoverflow.com/questions/34417685/melt-the-upper-triangular-matrix-of-a-pandas-dataframe
hl_corr_df = hl_corr_df.where(np.triu(np.ones(hl_corr_df.shape)).astype(np.bool))

# Unstack the DataFrame into a series and sort
hl_corr_s = hl_corr_df.unstack().sort_values(ascending=False)

# Drop NaNs
hl_corr_s = hl_corr_s.dropna()

# Drop values equal to 1
hl_corr_s = hl_corr_s[hl_corr_s < 1]

List out the 10 most correlated articles.

hl_corr_s.head(10)
United States House of Representatives  United States Congress          0.929328
National health insurance               Long-term care insurance        0.928820
Medicaid                                Medicare (United States)        0.920835
Hospital-acquired infection             Trade union                     0.915777
National health insurance               Hospital-acquired infection     0.914001
                                        Health insurance                0.913471
Co-insurance                            Copayment                       0.907352
Loss ratio                              Deductible                      0.907232
112th United States Congress            111th United States Congress    0.906779
Hillary Clinton                         Bill Clinton                    0.899148
dtype: float64

Inspect this correlation from the raw data.

_df = hl_pvs_df[list(hl_corr_s.index[0])]

ax = _df.plot(logy=True)

Look at the 10 least-correlated articles.

hl_corr_s.tail(10)
Jim Bunning                       Clinton health care plan of 1993                -0.365712
                                  Premium tax credit                              -0.375894
                                  Tricare                                         -0.401831
African Americans                 Cadillac insurance plan                         -0.407881
King v. Burwell                   Single-payer healthcare                         -0.431747
Hospital readmission              Democratic Party presidential primaries, 2008   -0.439345
King v. Burwell                   African Americans                               -0.472968
The Peter G. Peterson Foundation  Jim Bunning                                     -0.510160
Jim Bunning                       Single-payer healthcare                         -0.529140
African Americans                 Jim Bunning                                     -0.544210
dtype: float64

Plot the correlation between the two most anti-correlated articles. These show some kinda wacky properties that are interesting to explore or think more about.

_df = hl_pvs_df[list(hl_corr_s.index[-1])]

ax = _df.plot(logy=True)

Is there a relationship between the position of the link on the page and the correlation between the linked article's pageviews and the seed article's pageviews? For instance, links closer to the top of the page might reflect more important topics while links towards the end of the page might be less relevant.

link_corrs = []

for num,link in enumerate(redirected_outlinks):
    try:
        link_corrs.append({'position':num,'title':link,'corr':hl_corr_s.ix[(page_title,link)]})
    except KeyError:
        print("Error on: {0}".format(link))
Error on: Patient Protection and Affordable Care Act
Error on: Centers for Disease Control and Prevention
Error on: Gallup (company)
Error on: Poverty in the United States
Error on: State Children's Health Insurance Program
Error on: Poverty in the United States
Error on: Subsidy
Error on: Tax credit
Error on: Provisions of the Patient Protection and Affordable Care Act
Error on: Health insurance in the United States
Error on: Tricare
Error on: Internal Revenue Service
Error on: Death spiral (insurance)
Error on: Free rider problem
Error on: Virtuous circle and vicious circle
Error on: Adverse selection
Error on: Jonathan Gruber (economist)
Error on: Risk pool
Error on: National Academy of Medicine
Error on: Essential health benefits
Error on: United States Preventive Services Task Force
Error on: Birth control
Error on: National Academy of Medicine
Error on: Reinsurance
Error on: Adverse selection
Error on: Insurance
Error on: Tobacco
Error on: Copayment
Error on: Co-insurance
Error on: Deductible
Error on: Mammography
Error on: Colonoscopy
Error on: Gestational diabetes
Error on: Human papillomavirus infection
Error on: Sexually transmitted infection
Error on: HIV
Error on: Domestic violence
Error on: Appeal
Error on: Loss ratio
Error on: Insurance
Error on: Vermont health care reform
Error on: Single-payer healthcare
Error on: Health maintenance organization
Error on: Universal health care
Error on: The Heritage Foundation
Error on: Single-payer healthcare
Error on: Republican Party (United States)
Error on: Free rider problem
Error on: Emergency Medical Treatment and Active Labor Act
Error on: Bill Clinton
Error on: Clinton health care plan of 1993
Error on: Health maintenance organization
Error on: 105th United States Congress
Error on: State Children's Health Insurance Program
Error on: John Chafee
Error on: Health Equity and Access Reform Today Act of 1993
Error on: Orrin Hatch
Error on: Chuck Grassley
Error on: Bob Bennett (politician)
Error on: Kit Bond
Error on: Don Nickles
Error on: George H. W. Bush
Error on: Massachusetts health care reform
Error on: Mitt Romney
Error on: Massachusetts health care reform
Error on: Mitt Romney presidential campaign, 2008
Error on: Jim DeMint
Error on: Bob Bennett (politician)
Error on: Ron Wyden
Error on: Healthy Americans Act
Error on: Healthy Americans Act
Error on: Democratic Party presidential primaries, 2008
Error on: Hillary Clinton
Error on: Subsidy
Error on: United States presidential election, 2008
Error on: United States House of Representatives
Error on: United States Senate Committee on Finance
Error on: Max Baucus
Error on: Jeff Bingaman
Error on: Kent Conrad
Error on: Mike Enzi
Error on: Chuck Grassley
Error on: Olympia Snowe
Error on: Massachusetts Institute of Technology
Error on: Jonathan Gruber (economist)
Error on: David Cutler
Error on: Adverse selection
Error on: Free rider problem
Error on: Death spiral (insurance)
Error on: Single-payer healthcare
Error on: United States National Health Care Act
Error on: Supermajority
Error on: Howard Baker
Error on: Bob Dole
Error on: Tom Daschle
Error on: George J. Mitchell
Error on: Filibuster in the United States Senate
Error on: Mitch McConnell
Error on: Ezra Klein
Error on: The New Yorker
Error on: The New York Times
Error on: Lobbying
Error on: Sunlight Foundation
Error on: Tea Party movement
Error on: Barack Obama speech to joint session of Congress, September 2009
Error on: Affordable Health Care for America Act
Error on: United States Constitution
Error on: Internal Revenue Code
Error on: United States Senate Committee on Health, Education, Labor and Pensions
Error on: United States Senate Committee on Finance
Error on: Filibuster in the United States Senate
Error on: Supermajority
Error on: 111th United States Congress
Error on: United States Senate election in Minnesota, 2008
Error on: Al Franken
Error on: Arlen Specter
Error on: Ted Kennedy
Error on: Paul G. Kirk
Error on: Harry Reid
Error on: Joe Lieberman
Error on: Ben Nelson
Error on: Public health insurance option
Error on: Vermont health care reform
Error on: Abortion debate
Error on: Cloture
Error on: Filibuster in the United States Senate
Error on: Jim Bunning
Error on: American Medical Association
Error on: AARP
Error on: Massachusetts
Error on: Scott Brown (politician)
Error on: United States Senate special election in Massachusetts, 2010
Error on: Massachusetts
Error on: Cloture
Error on: White House Chief of Staff
Error on: Rahm Emanuel
Error on: Speaker of the United States House of Representatives
Error on: Nancy Pelosi
Error on: Anthem Inc.
Error on: California
Error on: United States congressional conference committee
Error on: Reconciliation (United States Congress)
Error on: Congressional Budget and Impoundment Control Act of 1974
Error on: Filibuster in the United States Senate
Error on: Byrd Rule
Error on: Anti-abortion movements
Error on: Bart Stupak
Error on: Executive Order 13535
Error on: Hyde Amendment
Error on: Centers for Medicare and Medicaid Services
Error on: Harvard University
Error on: Benjamin Sommers
Error on: United States Department of Health and Human Services
Error on: Cadillac insurance plan
Error on: Center for Medicare and Medicaid Innovation
Error on: PricewaterhouseCoopers
Error on: Financial crisis of 2007–2008
Error on: Government spending
Error on: Government revenue
Error on: United States Congress Joint Committee on Taxation
Error on: Dynamic scoring
Error on: Taxation in the United States
Error on: Pharmaceutical drug
Error on: Medical device
Error on: Itemized deduction
Error on: Flexible spending account
Error on: Excise
Error on: Cadillac insurance plan
Error on: Sales tax
Error on: Tanning bed
Error on: Medicare Advantage
Error on: Government Accountability Office
Error on: Medicare Payment Advisory Commission
Error on: Bush tax cuts
Error on: Uwe Reinhardt
Error on: Health economics
Error on: Medicare Prescription Drug, Improvement, and Modernization Act
Error on: Douglas Holtz-Eakin
Error on: John McCain
Error on: Preventive healthcare
Error on: Electronic health record
Error on: David M. Walker (U.S. Comptroller General)
Error on: Comptroller General of the United States
Error on: The Peter G. Peterson Foundation
Error on: Center on Budget and Policy Priorities
Error on: Perverse incentive
Error on: Great Recession
Error on: Marginal cost
Error on: National Association for Business Economics
Error on: Rural area
Error on: Hill–Burton Act
Error on: Rural health
Error on: Emergency department
Error on: Urgent care
Error on: Non-Hispanic whites
Error on: Hispanic and Latino Americans
Error on: African Americans
Error on: Opinion poll
Error on: Pejorative
Error on: Rudy Giuliani
Error on: John McCain
Error on: John Edwards
Error on: Clinton health care plan of 1993
Error on: Mitt Romney
Error on: Des Moines, Iowa
Error on: Massachusetts health care reform
Error on: John Conyers
Error on: Twitter
Error on: Associated Press
Error on: NPR
Error on: Sarah Palin
Error on: Death panel
Error on: PolitiFact.com
Error on: FactCheck.org
Error on: American Dialect Society
Error on: AARP
Error on: Pew Research Center
Error on: Independent Payment Advisory Board
Error on: Advance healthcare directive
Error on: America's Affordable Health Choices Act of 2009
Error on: Federal Employees Health Benefits Program
Error on: Tea Party movement
Error on: Universal health care
Error on: AFL–CIO
Error on: International Brotherhood of Teamsters
Error on: United Food and Commercial Workers
Error on: UNITE HERE
Error on: Laborers' International Union of North America
Error on: UNITE HERE
Error on: National Federation of Independent Business v. Sebelius
Error on: Constitutional challenges to the Patient Protection and Affordable Care Act
Error on: Commerce Clause
Error on: Catholic Church
Error on: United States Conference of Catholic Bishops
Error on: United States Department of Health and Human Services
Error on: First Amendment to the United States Constitution
Error on: Free Exercise Clause
Error on: United States House of Representatives v. Burwell
Error on: Washington, D.C.
Error on: Patient Protection and Affordable Care Act
Error on: Nonviolence
Error on: Republican Party (United States)
Error on: 111th United States Congress
Error on: 112th United States Congress
Error on: 113th United States Congress
Error on: Steve King
Error on: Michele Bachmann
Error on: Jim DeMint
Error on: Federal Employees Health Benefits Program
Error on: Veto
Error on: United States federal government shutdown of 2013
Error on: Centers for Medicare and Medicaid Services
Error on: Center for Economic and Policy Research
Error on: Job lock
Error on: National Federation of Independent Business v. Sebelius
Error on: Internal Revenue Service
Error on: Loss ratio
Error on: Presidency of Barack Obama
Error on: Provisions of the Patient Protection and Affordable Care Act
Error on: Community Living Assistance Services and Supports Act
Error on: Long-term care insurance
Error on: HealthCare.gov
Error on: Loss ratio
Error on: Loss ratio
Error on: Government Accountability Office
Error on: Burwell v. Hobby Lobby Stores, Inc.
Error on: Wheaton College (Illinois)
Error on: Loss ratio
Error on: John Bel Edwards
Error on: Illegal immigrant population of the United States
Error on: Subsidy
Error on: Fee-for-service
Error on: Bundled payment
Error on: Hip replacement
Error on: Medicare Part D coverage gap
Error on: Social Security Act
Error on: Hospital readmission
Error on: Medicare Part D coverage gap
Error on: Hospital-acquired infection
Error on: Electronic health record
Error on: Bundled payment
Error on: Community Living Assistance Services and Supports Act
Error on: Long-term care insurance
Error on: Great Recession in the United States
Error on: Acronyms in healthcare
Error on: Community Living Assistance Services and Supports Act
Error on: Comparison of the healthcare systems in Canada and the United States
Error on: EBSA Form 700
Error on: Health care reform
Error on: Health systems by country
Error on: Individual shared responsibility provision
Error on: King v. Burwell
Error on: Massachusetts health care reform
Error on: Medicare Access and CHIP Reauthorization Act of 2015
Error on: National health insurance
Error on: Single-payer healthcare
Error on: Universal health care
Error on: Universal health coverage by country
Error on: Health system
Error on: Steven Brill (journalist)
Error on: The New York Times
Error on: The Wall Street Journal
Error on: CNN
Error on: Congressional Research Service
Error on: CNN
Error on: CBS News
Error on: Associated Press
Error on: United States House Committee on Oversight and Government Reform
Error on: Democracy Now!
Error on: Democracy Now!
Error on: Democracy Now!

Plot the results.

ax = pd.DataFrame(link_corrs).plot.scatter(x='position',y='corr')
ax.set_xlim((0,len(link_corrs)))
ax.set_ylim((-1,1))
ax.set_xlabel('Link position')
ax.set_ylabel('Correlation')
<matplotlib.text.Text at 0x7fd40306b978>

Get page revisions

In this section, we'll repurpose and adapt code from the last lab to get data about page revisions. Rather than looking at the number of times a user contributed to a given article, we'll simply count the number of times the article was edited on a given date.

def get_page_edits_by_date(page_title,conn,date_string='2014-12-31'):
    """ Takes a page title and returns the number of revisions made on each date.
      page_title = a string for the page title to get its revisions
      date_string = a string for the date in YYYY-MM-DD format
      conn = a database connection
      
    Returns:
      A DataFrame with username, page title, edit count, and min/max timestamps
    """
    # In case you pass a page title with spaces in it, replace the spaces with underscores
    page_title = page_title.replace(' ','_').encode('utf8').decode('latin1')
    
    # The MySQL query string used to retrieve the data. By line, it is
    ## converting the timestamp to a date and 
    ## counting the number of elements
    ## from the "revisions" table
    ## joining the "page" table on it
    ## using the page_id and rev_page columns as keys
    ## limiting the results to entries that have the pagetitle, 
    ## occur in the namespace, and happen after Dec 31, 2014
    ## grouping the results by date
    s = """
            SELECT
                DATE(rev_timestamp) as date,
                page_title,
                COUNT(*) as edits
            FROM 
                revision 
            JOIN 
                page ON page.page_id = revision.rev_page
            WHERE 
                page.page_title = "{0}" 
                AND page_namespace = 0
                AND DATE(rev_timestamp) > '{1}'
            GROUP BY
                date
        """.format(page_title,date_string)

    # Use the connection to run the query and return the results as a DataFrame
    _df = pd.read_sql_query(s,conn)
    
    _df['page_title'] = _df['page_title'].str.decode('utf8')
    _df['page_title'] = _df['page_title'].str.replace('_',' ')
    
    # Return the data, with a clean index
    return _df

def get_neighbors_revisions(page_title,conn):
    """ Takes a page title and returns revisions for the page and its neighbors.
      page_title = a string for the page title to get its revisions
      
    Returns:
      A pandas DataFrame containing all the page revisions.
    """
    # Get the outlinks from the page and include the page itself in the list
    alters = get_page_outlinks(page_title) + [page_title]
    # Resolve the redirects in the list of alters
    alters = list(set(resolve_redirects(alters)))
    # Create an empty container to hold the DataFrames
    df_list = []
    # For each page, get the revision counts and append to the df_list
    for alter in alters:
        _df = get_page_edits_by_date(alter,conn)
        df_list.append(_df)
    # Concatenate the list of revision count DataFrames into a giant DataFrame
    df = pd.concat(df_list)
    # Return the data
    return df.reset_index(drop=True)

Get the authentication information and connect to the database.

host, user, password = os.environ['MYSQL_HOST'], os.environ['MYSQL_USERNAME'], os.environ['MYSQL_PASSWORD']
conn = pymysql.connect(host=host,user=user,password=password,database='enwiki_p',connect_timeout=3600)
conn.cursor().execute('use enwiki_p');

Get the number of revisions per day for all the articles.

hl_daily_rev_df = get_neighbors_revisions(page_title,conn)
hl_daily_rev_df.head()
date page_title edits
0 2015-02-03 David Cutler 1.0
1 2015-03-17 David Cutler 1.0
2 2015-05-30 David Cutler 1.0
3 2015-06-04 David Cutler 1.0
4 2016-02-15 David Cutler 1.0

Reindex the edit data so it's starting and ending on the same dates as the pageviews data.

# Convert into a format like the hl_pageviews DataFrame
# Index are dates between Jan 1, 2015 and today; columns are article titles; values are number of edits
hl_edits_df = hl_daily_rev_df.set_index(['date','page_title'])['edits'].unstack(1)

# Reindex so dates are continuous
pv_start_ix = str(hl_pvs_df.index.min().date())
_date_range = pd.date_range(pv_start_ix,yesterday_date_s)
hl_edits_df = hl_edits_df.reindex(index=_date_range)

# Fill in empty observations with 0s
hl_edits_df = hl_edits_df.fillna(0)

hl_edits_df.head()
page_title 105th United States Congress 111th United States Congress 112th United States Congress 113th United States Congress AARP AFL–CIO Abortion debate Acronyms in healthcare Advance healthcare directive Adverse selection Advocacy group Affordable Health Care for America Act African Americans Al Franken America's Affordable Health Choices Act of 2009 American Dialect Society American Medical Association Anthem Inc. Anti-abortion movements Appeal Arlen Specter Associated Press Barack Obama Barack Obama speech to joint session of Congress, September 2009 Bart Stupak Ben Nelson Benjamin Sommers Bill Clinton Birth control Bob Bennett (politician) Bob Dole Bundled payment Burwell v. Hobby Lobby Stores, Inc. Bush tax cuts Byrd Rule CBS News CNN Cadillac insurance plan California Catholic Church Center for Economic and Policy Research Center for Medicare and Medicaid Innovation Center on Budget and Policy Priorities Centers for Disease Control and Prevention Centers for Medicare and Medicaid Services Chuck Grassley Clinton health care plan of 1993 Cloture Co-insurance Colonoscopy ... Steve King Steven Brill (journalist) Subsidy Sunlight Foundation Supermajority Supreme Court of the United States Tanning bed Tax credit Taxation in the United States Tea Party movement Ted Kennedy The Heritage Foundation The New York Times The New Yorker The Peter G. Peterson Foundation The Wall Street Journal Tobacco Tom Daschle Trade union Tricare Twitter UNITE HERE United Food and Commercial Workers United States Conference of Catholic Bishops United States Congress United States Congress Joint Committee on Taxation United States Constitution United States Department of Health and Human Services United States House Committee on Oversight and Government Reform United States House of Representatives United States House of Representatives v. Burwell United States National Health Care Act United States Preventive Services Task Force United States Senate Committee on Finance United States Senate Committee on Health, Education, Labor and Pensions United States Senate election in Minnesota, 2008 United States Senate special election in Massachusetts, 2010 United States congressional conference committee United States federal government shutdown of 2013 United States presidential election, 2008 Universal health care Universal health coverage by country Urgent care Uwe Reinhardt Vermont health care reform Veto Virtuous circle and vicious circle Washington, D.C. Wheaton College (Illinois) White House Chief of Staff
2015-07-01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 4.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0
2015-07-02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 6.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0
2015-07-03 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 2.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 14.0 1.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
2015-07-04 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2015-07-05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0

5 rows × 267 columns

Are pageviews and edits correlated with each other?

_s1 = hl_pvs_df[page_title]
_s2 = hl_edits_df[page_title]

np.corrcoef(_s1.apply(np.log),_s2)[0][1]
-0.019636102098212628
single_pv_edits_df = pd.DataFrame({'pageviews':_s1,'edits':_s2})
ax = single_pv_edits_df.plot(secondary_y='edits',logy=True)
ax.right_ax.set_yscale('log')
ax.set_ylabel('Pageviews')
ax.right_ax.set_ylabel('Edits')
<matplotlib.text.Text at 0x7fd401f22eb8>

Can Wikipedia supply information to keep up with demand?

The ratio between the cumulative pageviews and cumulative edits.

ax = (_s1.cumsum()/_s2.cumsum()).plot()

ax.set_ylabel('Cumulative pageviews per edit')
<matplotlib.text.Text at 0x7fd4025bcf98>
def zscore(series):
    return np.abs((series - series.mean())/series.std())

Look at the normalized (z-score) excitation and relaxation in edits and pageviews by day. Each point is a single day in the article's history and they're connected if they come one day after each other. Values along the diagonal in red suggest that increases in attention to the article are matched by similar increases in editing activity on the article. Alternatively, data points in the upper-left triangle suggest increases in pageviews are not matched by increases in edits while data points in the lower-right triangle suggest increases in edits are not matched by increases in pageviews.

f,ax = plt.subplots(1,1)

ax.set_xlabel('Edits (z-score)')
ax.set_ylabel('Pageviews (z-score)')
ax.set_xlim((1e-3,1e2))
ax.set_ylim((1e-3,1e2))
ax.set_xscale('log')
ax.set_yscale('log')

plt.text(1e-1,1e1,'More views than edits',ha='center',weight='bold')
plt.text(1e1,1e-1,'More edits than views',ha='center',weight='bold')

plt.plot([1e-3,1e2],[1e-3,1e2],axes=ax,c='r')

_s1 = zscore(hl_edits_df[page_title])
_s2 = zscore(hl_pvs_df[page_title])
plt.plot(_s1,_s2,'o-',axes=ax,c='grey');
f,ax = plt.subplots(1,1)

ax.set_xlabel('Edits (z-score)')
ax.set_ylabel('Pageviews (z-score)')
ax.set_xlim((1e-3,1e2))
ax.set_ylim((1e-3,1e2))
ax.set_xscale('log')
ax.set_yscale('log')

for page in hl_edits_df.columns:
    _s1 = zscore(hl_edits_df[page])
    _s2 = zscore(hl_pvs_df[page])
    #plt.plot(_s1.ix[_s2.index],_s2,'.-',axes=ax,alpha=.125,c='grey')
    plt.scatter(_s1,_s2,axes=ax,alpha=.125,c='grey')
    
# Diagonal red line showing perfect matching
plt.plot([1e-3,1e2],[1e-3,1e2],axes=ax,c='r',zorder=-1);