Lab 4 - Pageviews

Professor Brian Keegan
Department of Information Science, CU Boulder
This notebook is copyright and made available under the Apache License v2.0 license.

This is the third of five lab notebooks that will explore how to analyze the structure of collaborations in Wikipedia data about users' revisions across multiple articles. This lab will extend the methods in the previous two labs about analyzing a single article's revision histories and analyzing the hyperlink networks around a single Wikipedia page. You do not need to be fluent in either to complete the lab, but there are many options for extending the analyses we do here by using more advanced queries and scripting methods.

Acknowledgements
I'd like to thank the Wikimedia Foundation for the PAWS system and related Wikitech infrastructure that this workbook runs within. Yuvi Panda, Aaron Halfaker, Jonathan Morgan, and Dario Taraborelli have all provided crucial support and feedback.

Confirm that basic Python commands work

a = 3
b = 4
c = 5
(c-a)**b
16

Import modules and setup environment

Load up all the libraries we'll need to connect to the database, retreive information for analysis, and visualize results.

# Makes the plots appear within the notebook
%matplotlib inline

# Two fundamental packages for doing data manipulation
import numpy as np                   # http://www.numpy.org/
import pandas as pd                  # http://pandas.pydata.org/

# Two related packages for plotting data
import matplotlib.pyplot as plt      # http://matplotlib.org/
import seaborn as sb                 # https://stanford.edu/~mwaskom/software/seaborn/

# Package for requesting data via the web and parsing resulting JSON
import requests                      # http://docs.python-requests.org/en/master/
import json                          # https://docs.python.org/3/library/json.html
from bs4 import BeautifulSoup        # https://www.crummy.com/software/BeautifulSoup/bs4/doc/

# Two packages for accessing the MySQL server
import pymysql                       # http://pymysql.readthedocs.io/en/latest/
import os                            # https://docs.python.org/3.4/library/os.html

# Setup the code environment to use plots with a white background and DataFrames show more columns and rows
sb.set_style('whitegrid')
pd.options.display.max_columns = 100
pd.options.display.max_rows = 110

Define an article to examine pageview dynamics.

page_title = 'Halsey (singer)'

Get pageview data for a single article

Details about the Wikimedia REST API for pageviews is available here. Unfortunately, this data end point only provides information going back to July 1, 2015.

This is what the API returns as an example.

# Get today's date and yesterday's date
today = pd.datetime.today()
yesterday = pd.datetime.today() - pd.to_timedelta('1 day')

# Date
today_date_s = str(today.date())
yesterday_date_s = str(yesterday.date())

# Convert to strings
today_s = pd.datetime.strftime(today,'%Y%m%d00')
yesterday_s = pd.datetime.strftime(yesterday,'%Y%m%d00')

# Get the pageviews for today and yesterday
url_string = 'http://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/{0}/daily/{1}/{2}'
print(url_string.format(page_title.replace(' ','_'),yesterday_s,today_s))
http://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Halsey_(singer)/daily/2016103100/2016110100

Write a function to get the pageviews from January 1, 2015 (in practice, the start date will be as late as August or as early as May) until yesterday.

def get_daily_pageviews(page_title,today_s):
    url_string = 'http://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/{0}/daily/2015010100/{1}'
    req = requests.get(url_string.format(page_title,today_s))

    json_s = json.loads(req.text)
    if 'items' in json_s.keys():
        _df = pd.DataFrame(json_s['items'])[['timestamp','views','article']]
        _df['timestamp'] = pd.to_datetime(_df['timestamp'],format='%Y%m%d00')
        _df['weekday'] = _df['timestamp'].apply(lambda x:x.weekday())
        return _df

Get the data for your page.

pageview_df = get_daily_pageviews(page_title,today_s)
pageview_df.head()
timestamp views article weekday
0 2015-07-01 4459 Halsey_(singer) 2
1 2015-07-02 3623 Halsey_(singer) 3
2 2015-07-03 3553 Halsey_(singer) 4
3 2015-07-04 4167 Halsey_(singer) 5
4 2015-07-05 3922 Halsey_(singer) 6

Interpret page view results

What does the pageview activity look like? Are there any bursts of attention? What might these bursts be linked to?

ax = pageview_df.plot.line(x='timestamp',y='views',logy=True,legend=False)
ax.set_xlabel('')
ax.set_ylabel('Pageviews')
<matplotlib.text.Text at 0x7f67e47b17b8>

Use a logarithmic scaling for the y-axis to see more of the detail in the lower-traffic days.

ax = pageview_df.plot.line(x='timestamp',y='views',logy=True,legend=False)
ax.set_xlabel('')
ax.set_ylabel('Pageviews')
<matplotlib.text.Text at 0x7f67e7bcba90>

What are the dates for the biggest pageview outliers? Here we define an "outlier" to be more than 3 standard deviations above the average number of pageviews over the time window.

std_threshold = 4
threshold_val = pageview_df['views'].mean() + pageview_df['views'].std() * std_threshold
peak_days = pageview_df[pageview_df['views'] > threshold_val]

peak_days.head(10)
timestamp views article weekday
425 2016-08-29 79443 Halsey_(singer) 0
426 2016-08-30 56326 Halsey_(singer) 1
427 2016-08-31 37843 Halsey_(singer) 2

How much of the total pageview activity occurred on these days compared to the rest of the pageviews?

peak_fraction = pageview_df.loc[peak_days.index,'views'].sum()/pageview_df['views'].sum()

print('{0:.1%} of all pageviews occurred on the {1} peak days.'.format(peak_fraction,len(peak_days)))
4.5% of all pageviews occurred on the 3 peak days.

How does pageview activity change over the course of a week?

g = sb.factorplot(x='weekday',y='views',data=pageview_df,kind='bar',color='grey',
                  aspect=1.67,estimator=np.median)
ax = g.axes[0][0]
ax.set_xticklabels(['Mon','Tue','Wed','Thu','Fri','Sat','Sun'],rotation=0)
ax.set_xlabel('')
ax.set_ylabel('Average pageviews')
<matplotlib.text.Text at 0x7f67e7c33fd0>

Compare pageviews to another page

Lets write a function that takes a list of article names and returns a DataFrame indexed by date, columned by articles, and values being the number of pageviews.

def get_multiple_pageviews(page_list,today_s):
    multiple_pv_df = pd.DataFrame(index=pd.date_range('2015-05-01', today_date_s))
    for page in page_list:
        pv_df = get_daily_pageviews(page,today_s)
        try:
            multiple_pv_df[page] = pv_df.set_index('timestamp')['views'] 
        except:
            print("Error on: {0}".format(page))
            multiple_pv_df[page] = np.nan
    return multiple_pv_df.dropna(how='all')

Enter two related pages for which you want to compare their pageview behavior.

page_list = ['Halsey_(singer)','Lady_Gaga']

Get both of their data.

# Get the data
multiple_pvs = get_multiple_pageviews(page_list,today_s)

# Show the top rows
multiple_pvs.tail()
Halsey_(singer) Lady_Gaga
2016-10-26 10572.0 88346.0
2016-10-27 9729.0 81683.0
2016-10-28 9048.0 59796.0
2016-10-29 9226.0 49229.0
2016-10-30 9651.0 44431.0

Plot the data.

multiple_pvs.plot(logy=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7f67e7b2b320>

What is the correlation coefficient between these two articles' behavior?

multiple_pvs.apply(np.log).corr()
Halsey_(singer) Lady_Gaga
Halsey_(singer) 1.000000 0.296763
Lady_Gaga 0.296763 1.000000

How did the ratio between the two articles' pageviews change over time?

ratio_s = multiple_pvs[page_list[0]].div(multiple_pvs[page_list[1]])
ax = ratio_s.plot()
ax.set_ylabel('{0}/{1}'.format(page_list[0],page_list[1]))
<matplotlib.text.Text at 0x7f67e7a4f240>

Use the functions for resolving redirects and getting page outlinks from prior labs.

# From http://stackoverflow.com/a/312464/1574687
def make_chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]

def resolve_redirects(page_title_list):
    # Chunk the pages into a list of lists of size 50
    chunks = make_chunks(page_title_list,50)
    # Create an empty list to fill with the redirected titles
    redirected_page_titles = []
    # For each chunk try to get the redirects
    for chunk in chunks:
        # Create the query string that separates spaces within page titles by '+' 
        # and separates page titles by '|'
        page_titles_string = '|'.join([page.replace(' ','+') for page in chunk])
        # Put this large string into the URL
        url_string = 'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=pageprops&titles={0}+&redirects=1'.format(page_titles_string)
        # Do the query and parse the JSON response into a dictionary
        req = json.loads(requests.get(url_string).text)
        # Convert the returned values containing redirects into a dictionary
        if 'redirects' in req['query'].keys():
            redirected_titles = {d['from']:d['to'] for d in req['query']['redirects']}
            # Add the redirected titles to the list
            for title in chunk:
                try:
                    #print(len(redirected_page_titles), title, redirected_titles[title])
                    redirected_page_titles.append(redirected_titles[title])
                # If they don't have a redirect just add the original title
                except KeyError:
                    #print(len(redirected_page_titles), '\nFrom: ', title, '\nTo: ', title)
                    redirected_page_titles.append(title)
        else:
            for title in chunk:
                redirected_page_titles.append(title)
    # Make sure the number of page titles remained the same, otherwise raise a warning
    if len(page_title_list) == len(redirected_page_titles):
        return redirected_page_titles
    else:
        print("WARNING! The number of page titles in the redirected list ({0}) is not equal to the input list ({1})".format(len(redirected_page_titles),len(page_title_list)))
        return redirected_page_titles

def get_page_outlinks(page_title,redirects=1):
    # Replace spaces with underscores
    #page_title = page_title.replace(' ','_')
    
    bad_titles = ['Special:','Wikipedia:','Help:','Template:','Category:','International Standard','Portal:','s:']
    
    # Get the response from the API for a query
    # After passing a page title, the API returns the HTML markup of the current article version within a JSON payload
    req = requests.get('https://en.wikipedia.org/w/api.php?action=parse&format=json&page={0}&redirects={1}&prop=text&disableeditsection=1&disabletoc=1'.format(page_title,redirects))
    
    # Read the response into JSON to parse and extract the HTML
    json_string = json.loads(req.text)
    
    # Initialize an empty list to store the links
    outlinks_list = [] 
    
    if 'parse' in json_string.keys():
        page_html = json_string['parse']['text']['*']

        # Parse the HTML into Beautiful Soup
        soup = BeautifulSoup(page_html,'lxml')

        # Delete tags associated with templates
        for tag in soup.find_all('tr'):
            tag.replace_with('')

        # For each paragraph tag, extract the titles within the links
        for para in soup.find_all('p'):
            for link in para.find_all('a'):
                if link.has_attr('title'):
                    title = link['title']
                    # Ignore links that aren't interesting
                    if all(bad not in title for bad in bad_titles):
                        outlinks_list.append(title)

        # For each unordered list, extract the titles within the child links
        for unordered_list in soup.find_all('ul'):
            for item in unordered_list.find_all('li'):
                for link in item.find_all('a'):
                    if link.has_attr('title'):
                        title = link['title']
                        # Ignore links that aren't interesting
                        if all(bad not in title for bad in bad_titles):
                            outlinks_list.append(title)

    return outlinks_list

Get the outlinks.

raw_outlinks = get_page_outlinks(page_title)
redirected_outlinks = resolve_redirects(raw_outlinks)

Get the data.

This stage may take several minutes.

# Get the data
hl_pvs_df = get_multiple_pageviews(redirected_outlinks + [page_title],today_s)

# Show the top rows
hl_pvs_df.head()
Astralwerks Extended play Room 93 Badlands (Halsey album) New Jersey Nickname YouTube Taylor Swift I Knew You Were Trouble One Direction Harry Styles Tumblr SoundCloud The Kooks South by Southwest Imagine Dragons Smoke + Mirrors Tour Billboard (magazine) The Late Show with Stephen Colbert Last Call with Carson Daly Lido (musician) The Chainsmokers Closer (The Chainsmokers song) Billboard Hot 100 Electropop The Notorious B.I.G. Slick Rick Bone Thugs-n-Harmony Tupac Shakur The Cure Alanis Morissette Nirvana (band) Panic! at the Disco Lady Gaga Adam Lazzara Taking Back Sunday Kanye West The Weeknd Brand New (band) Bright Eyes (band) Quentin Tarantino Harmony Korine Larry Clark Badlands Tour Young Rising Sons Halsey_(singer)
2015-07-01 116.0 1819.0 NaN 875.0 5230.0 603.0 22238.0 35747.0 633.0 9863.0 5170.0 3502.0 1327.0 693.0 799.0 7355.0 194.0 1325.0 2315.0 245.0 NaN 595.0 NaN 4148.0 361.0 9170.0 1129.0 2959.0 15187.0 2064.0 3074.0 6832.0 3286.0 8586.0 147.0 693.0 38124.0 17814.0 1733.0 648.0 7977.0 1114.0 361.0 NaN 259.0 4459.0
2015-07-02 135.0 1920.0 NaN 972.0 4932.0 675.0 21938.0 25564.0 568.0 9324.0 4809.0 3313.0 1464.0 764.0 829.0 6115.0 183.0 1444.0 4108.0 294.0 NaN 569.0 NaN 4351.0 365.0 8909.0 1184.0 3265.0 15041.0 2303.0 2996.0 6540.0 3277.0 8331.0 210.0 782.0 33423.0 16586.0 1666.0 647.0 9211.0 1153.0 434.0 NaN 133.0 3623.0
2015-07-03 131.0 1764.0 NaN 1199.0 4546.0 664.0 19940.0 26749.0 590.0 12784.0 5714.0 3103.0 1040.0 667.0 738.0 5027.0 118.0 1276.0 1818.0 199.0 NaN 538.0 NaN 4049.0 339.0 8735.0 1138.0 2598.0 14392.0 2115.0 2887.0 6958.0 2982.0 7903.0 203.0 684.0 27691.0 15605.0 1299.0 510.0 8619.0 1160.0 407.0 NaN 104.0 3553.0
2015-07-04 109.0 1680.0 NaN 1456.0 4508.0 576.0 19515.0 34108.0 645.0 9466.0 5086.0 3044.0 858.0 696.0 739.0 5483.0 206.0 1165.0 1502.0 285.0 NaN 551.0 NaN 3951.0 331.0 8144.0 1459.0 2421.0 13877.0 2010.0 2891.0 7174.0 2972.0 8123.0 186.0 655.0 26079.0 14665.0 1040.0 528.0 9373.0 1346.0 414.0 NaN 108.0 4167.0
2015-07-05 128.0 1727.0 NaN 1390.0 4632.0 609.0 21370.0 30541.0 605.0 15950.0 6625.0 3416.0 982.0 678.0 869.0 6001.0 238.0 1395.0 1268.0 212.0 NaN 575.0 NaN 3956.0 352.0 8185.0 1118.0 3200.0 14274.0 2110.0 4573.0 7570.0 3150.0 8953.0 141.0 566.0 25087.0 14979.0 931.0 555.0 9794.0 1321.0 469.0 NaN 128.0 3922.0

What are the most-viewed articles in the hyperlink network?

most_viewed_articles = hl_pvs_df.cumsum().ix[str(yesterday.date())]
most_viewed_articles = most_viewed_articles.sort_values(ascending=False)
most_viewed_articles.head(10)
YouTube                 14042361.0
Taylor Swift            11023550.0
Kanye West              10448695.0
Tupac Shakur             9272378.0
The Weeknd               8857980.0
Lady Gaga                7745612.0
One Direction            6729110.0
Quentin Tarantino        6280362.0
The Notorious B.I.G.     4930118.0
Halsey_(singer)          3851557.0
Name: 2016-10-30 00:00:00, dtype: float64

Most and least correlated articles

Which articles are most correlated with each other?

# Log the pageview data to reduce skew from bursty outliers abd make the correlation table
hl_corr_df = hl_pvs_df.apply(np.log).corr()

# Correlation table is symmetric, drop one half of them
# From: http://stackoverflow.com/questions/34417685/melt-the-upper-triangular-matrix-of-a-pandas-dataframe
hl_corr_df = hl_corr_df.where(np.triu(np.ones(hl_corr_df.shape)).astype(np.bool))

# Unstack the DataFrame into a series and sort
hl_corr_s = hl_corr_df.unstack().sort_values(ascending=False)

# Drop NaNs
hl_corr_s = hl_corr_s.dropna()

# Drop values equal to 1
hl_corr_s = hl_corr_s[hl_corr_s < 1]

List out the 10 most correlated articles.

hl_corr_s.head(10)
Taking Back Sunday       Adam Lazzara               0.883020
Harry Styles             One Direction              0.787293
Badlands (Halsey album)  Astralwerks                0.759586
Smoke + Mirrors Tour     Imagine Dragons            0.742321
Imagine Dragons          Badlands (Halsey album)    0.665905
Halsey_(singer)          Astralwerks                0.660839
Young Rising Sons        Imagine Dragons            0.659175
Billboard (magazine)     Extended play              0.650561
Smoke + Mirrors Tour     Badlands (Halsey album)    0.649752
The Weeknd               Badlands (Halsey album)    0.641230
dtype: float64

Inspect this correlation from the raw data.

_df = hl_pvs_df[list(hl_corr_s.index[0])]

ax = _df.plot(logy=True)

Look at the 10 least-correlated articles.

hl_corr_s.tail(10)
Nirvana (band)                      The Chainsmokers                     -0.548528
The Chainsmokers                    Badlands (Halsey album)              -0.548899
                                    I Knew You Were Trouble              -0.586601
                                    The Late Show with Stephen Colbert   -0.598018
The Weeknd                          The Chainsmokers                     -0.605107
The Late Show with Stephen Colbert  Room 93                              -0.606741
Young Rising Sons                   The Chainsmokers                     -0.612024
The Chainsmokers                    Smoke + Mirrors Tour                 -0.657159
Bone Thugs-n-Harmony                Room 93                              -0.660283
The Chainsmokers                    Imagine Dragons                      -0.755484
dtype: float64

Plot the correlation between the two most anti-correlated articles. These show some kinda wacky properties that are interesting to explore or think more about.

_df = hl_pvs_df[list(hl_corr_s.index[-1])]

ax = _df.plot(logy=True)

Is there a relationship between the position of the link on the page and the correlation between the linked article's pageviews and the seed article's pageviews? For instance, links closer to the top of the page might reflect more important topics while links towards the end of the page might be less relevant.

link_corrs = []

for num,link in enumerate(redirected_outlinks):
    try:
        link_corrs.append({'position':num,'title':link,'corr':hl_corr_s.ix[(page_title,link)]})
    except KeyError:
        print("Error on: {0}".format(link))

Plot the results.

ax = pd.DataFrame(link_corrs).plot.scatter(x='position',y='corr')
ax.set_xlim((0,len(link_corrs)))
ax.set_ylim((-1,1))
ax.set_xlabel('Link position')
ax.set_ylabel('Correlation')
<matplotlib.text.Text at 0x7f67e5f72c18>

Get page revisions

In this section, we'll repurpose and adapt code from the last lab to get data about page revisions. Rather than looking at the number of times a user contributed to a given article, we'll simply count the number of times the article was edited on a given date.

def get_page_edits_by_date(page_title,conn,date_string='2014-12-31'):
    """ Takes a page title and returns the number of revisions made on each date.
      page_title = a string for the page title to get its revisions
      date_string = a string for the date in YYYY-MM-DD format
      conn = a database connection
      
    Returns:
      A DataFrame with username, page title, edit count, and min/max timestamps
    """
    # In case you pass a page title with spaces in it, replace the spaces with underscores
    page_title = page_title.replace(' ','_').encode('utf8').decode('latin1')
    
    # The MySQL query string used to retrieve the data. By line, it is
    ## converting the timestamp to a date and 
    ## counting the number of elements
    ## from the "revisions" table
    ## joining the "page" table on it
    ## using the page_id and rev_page columns as keys
    ## limiting the results to entries that have the pagetitle, 
    ## occur in the namespace, and happen after Dec 31, 2014
    ## grouping the results by date
    s = """
            SELECT
                DATE(rev_timestamp) as date,
                page_title,
                COUNT(*) as edits
            FROM 
                revision 
            JOIN 
                page ON page.page_id = revision.rev_page
            WHERE 
                page.page_title = "{0}" 
                AND page_namespace = 0
                AND DATE(rev_timestamp) > '{1}'
            GROUP BY
                date
        """.format(page_title,date_string)

    # Use the connection to run the query and return the results as a DataFrame
    _df = pd.read_sql_query(s,conn)
    
    _df['page_title'] = _df['page_title'].str.decode('utf8')
    _df['page_title'] = _df['page_title'].str.replace('_',' ')
    
    # Return the data, with a clean index
    return _df

def get_neighbors_revisions(page_title,conn):
    """ Takes a page title and returns revisions for the page and its neighbors.
      page_title = a string for the page title to get its revisions
      
    Returns:
      A pandas DataFrame containing all the page revisions.
    """
    # Get the outlinks from the page and include the page itself in the list
    alters = get_page_outlinks(page_title) + [page_title]
    # Resolve the redirects in the list of alters
    alters = list(set(resolve_redirects(alters)))
    # Create an empty container to hold the DataFrames
    df_list = []
    # For each page, get the revision counts and append to the df_list
    for alter in alters:
        _df = get_page_edits_by_date(alter,conn)
        df_list.append(_df)
    # Concatenate the list of revision count DataFrames into a giant DataFrame
    df = pd.concat(df_list)
    # Return the data
    return df.reset_index(drop=True)

Get the authentication information and connect to the database.

host, user, password = os.environ['MYSQL_HOST'], os.environ['MYSQL_USERNAME'], os.environ['MYSQL_PASSWORD']
conn = pymysql.connect(host=host,user=user,password=password,database='enwiki_p',connect_timeout=3600)
conn.cursor().execute('use enwiki_p');

Get the number of revisions per day for all the articles.

hl_daily_rev_df = get_neighbors_revisions(page_title,conn)
hl_daily_rev_df.head()
date page_title edits
0 2015-01-01 Taylor Swift 2
1 2015-01-02 Taylor Swift 2
2 2015-01-03 Taylor Swift 16
3 2015-01-04 Taylor Swift 12
4 2015-01-05 Taylor Swift 3

Reindex the edit data so it's starting and ending on the same dates as the pageviews data.

# Convert into a format like the hl_pageviews DataFrame
# Index are dates between Jan 1, 2015 and today; columns are article titles; values are number of edits
hl_edits_df = hl_daily_rev_df.set_index(['date','page_title'])['edits'].unstack(1)

# Reindex so dates are continuous
pv_start_ix = str(hl_pvs_df.index.min().date())
_date_range = pd.date_range(pv_start_ix,yesterday_date_s)
hl_edits_df = hl_edits_df.reindex(index=_date_range)

# Fill in empty observations with 0s
hl_edits_df = hl_edits_df.fillna(0)

hl_edits_df.head()
page_title Adam Lazzara Alanis Morissette Astralwerks Badlands (Halsey album) Badlands Tour Billboard (magazine) Billboard Hot 100 Bone Thugs-n-Harmony Brand New (band) Bright Eyes (band) Closer (The Chainsmokers song) Electropop Extended play Halsey (singer) Harmony Korine Harry Styles I Knew You Were Trouble Imagine Dragons Kanye West Lady Gaga Larry Clark Last Call with Carson Daly Lido (musician) New Jersey Nickname Nirvana (band) One Direction Panic! at the Disco Quentin Tarantino Room 93 Slick Rick Smoke + Mirrors Tour SoundCloud South by Southwest Taking Back Sunday Taylor Swift The Chainsmokers The Cure The Kooks The Late Show with Stephen Colbert The Notorious B.I.G. The Weeknd Tumblr Tupac Shakur YouTube Young Rising Sons
2015-07-01 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 7.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 9.0 1.0 0.0 0.0 1.0
2015-07-02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 1.0 2.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 3.0 0.0 1.0 0.0 9.0 0.0 2.0 0.0 2.0 0.0 0.0 0.0 0.0 5.0 0.0 0.0 2.0 1.0
2015-07-03 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2015-07-04 2.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 2.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 4.0 0.0 0.0 0.0
2015-07-05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0

Are pageviews and edits correlated with each other?

page_title
'Halsey (singer)'
_s1 = hl_pvs_df['Halsey (singer)']
_s2 = hl_edits_df['Halsey_(singer)']

np.corrcoef(_s1.apply(np.log),_s2)[0][1]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/srv/paws/lib/python3.4/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   1944             try:
-> 1945                 return self._engine.get_loc(key)
   1946             except KeyError:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)()

KeyError: 'Halsey (singer)'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-84-553cb9dc45ee> in <module>()
----> 1 _s1 = hl_pvs_df['Halsey (singer)']
      2 _s2 = hl_edits_df['Halsey_(singer)']
      3 
      4 np.corrcoef(_s1.apply(np.log),_s2)[0][1]

/srv/paws/lib/python3.4/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1995             return self._getitem_multilevel(key)
   1996         else:
-> 1997             return self._getitem_column(key)
   1998 
   1999     def _getitem_column(self, key):

/srv/paws/lib/python3.4/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2002         # get column
   2003         if self.columns.is_unique:
-> 2004             return self._get_item_cache(key)
   2005 
   2006         # duplicate columns & possible reduce dimensionality

/srv/paws/lib/python3.4/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1348         res = cache.get(item)
   1349         if res is None:
-> 1350             values = self._data.get(item)
   1351             res = self._box_item_values(item, values)
   1352             cache[item] = res

/srv/paws/lib/python3.4/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3288 
   3289             if not isnull(item):
-> 3290                 loc = self.items.get_loc(item)
   3291             else:
   3292                 indexer = np.arange(len(self.items))[isnull(self.items)]

/srv/paws/lib/python3.4/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   1945                 return self._engine.get_loc(key)
   1946             except KeyError:
-> 1947                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   1948 
   1949         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)()

KeyError: 'Halsey (singer)'
single_pv_edits_df = pd.DataFrame({'pageviews':_s1,'edits':_s2})
ax = single_pv_edits_df.plot(secondary_y='edits',logy=True)
ax.right_ax.set_yscale('log')
ax.set_ylabel('Pageviews')
ax.right_ax.set_ylabel('Edits')
<matplotlib.text.Text at 0x7f67e4e84cc0>

Can Wikipedia supply information to keep up with demand?

The ratio between the cumulative pageviews and cumulative edits.

ax = (_s1.cumsum()/_s2.cumsum()).plot()

ax.set_ylabel('Cumulative pageviews per edit')
<matplotlib.text.Text at 0x7f67e4ff1c88>
def zscore(series):
    return np.abs((series - series.mean())/series.std())

Look at the normalized (z-score) excitation and relaxation in edits and pageviews by day. Each point is a single day in the article's history and they're connected if they come one day after each other. Values along the diagonal in red suggest that increases in attention to the article are matched by similar increases in editing activity on the article. Alternatively, data points in the upper-left triangle suggest increases in pageviews are not matched by increases in edits while data points in the lower-right triangle suggest increases in edits are not matched by increases in pageviews.

f,ax = plt.subplots(1,1)

ax.set_xlabel('Edits (z-score)')
ax.set_ylabel('Pageviews (z-score)')
ax.set_xlim((1e-3,1e2))
ax.set_ylim((1e-3,1e2))
ax.set_xscale('log')
ax.set_yscale('log')

plt.text(1e-1,1e1,'More views than edits',ha='center',weight='bold')
plt.text(1e1,1e-1,'More edits than views',ha='center',weight='bold')

plt.plot([1e-3,1e2],[1e-3,1e2],axes=ax,c='r')

_s1 = zscore(hl_edits_df['Halsey (singer)'])
_s2 = zscore(hl_pvs_df['Halsey_(singer)'])
plt.plot(_s1,_s2,'o-',axes=ax,c='grey');
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-74-1a9d02e3d4f8> in <module>()
     15 _s1 = zscore(hl_edits_df['Halsey (singer)'])
     16 _s2 = zscore(hl_pvs_df['Halsey_(singer)'])
---> 17 plt.plot(_s1,_s2,'o-',axes=ax,c='grey');

/srv/paws/lib/python3.4/site-packages/matplotlib/pyplot.py in plot(*args, **kwargs)
   3159         ax.hold(hold)
   3160     try:
-> 3161         ret = ax.plot(*args, **kwargs)
   3162     finally:
   3163         ax.hold(washold)

/srv/paws/lib/python3.4/site-packages/matplotlib/__init__.py in inner(ax, *args, **kwargs)
   1817                     warnings.warn(msg % (label_namer, func.__name__),
   1818                                   RuntimeWarning, stacklevel=2)
-> 1819             return func(ax, *args, **kwargs)
   1820         pre_doc = inner.__doc__
   1821         if pre_doc is None:

/srv/paws/lib/python3.4/site-packages/matplotlib/axes/_axes.py in plot(self, *args, **kwargs)
   1380         kwargs = cbook.normalize_kwargs(kwargs, _alias_map)
   1381 
-> 1382         for line in self._get_lines(*args, **kwargs):
   1383             self.add_line(line)
   1384             lines.append(line)

/srv/paws/lib/python3.4/site-packages/matplotlib/axes/_base.py in _grab_next_args(self, *args, **kwargs)
    379                 return
    380             if len(remaining) <= 3:
--> 381                 for seg in self._plot_args(remaining, kwargs):
    382                     yield seg
    383                 return

/srv/paws/lib/python3.4/site-packages/matplotlib/axes/_base.py in _plot_args(self, tup, kwargs)
    357             x, y = index_of(tup[-1])
    358 
--> 359         x, y = self._xy_from_xy(x, y)
    360 
    361         if self.command == 'plot':

/srv/paws/lib/python3.4/site-packages/matplotlib/axes/_base.py in _xy_from_xy(self, x, y)
    217         y = _check_1d(y)
    218         if x.shape[0] != y.shape[0]:
--> 219             raise ValueError("x and y must have same first dimension")
    220         if x.ndim > 2 or y.ndim > 2:
    221             raise ValueError("x and y can be no greater than 2-D")

ValueError: x and y must have same first dimension
f,ax = plt.subplots(1,1)

ax.set_xlabel('Edits (z-score)')
ax.set_ylabel('Pageviews (z-score)')
ax.set_xlim((1e-3,1e2))
ax.set_ylim((1e-3,1e2))
ax.set_xscale('log')
ax.set_yscale('log')

for page in hl_edits_df.columns:
    _s1 = zscore(hl_edits_df[page])
    _s2 = zscore(hl_pvs_df[page])
    #plt.plot(_s1.ix[_s2.index],_s2,'.-',axes=ax,alpha=.125,c='grey')
    plt.scatter(_s1,_s2,axes=ax,alpha=.125,c='grey')
    
# Diagonal red line showing perfect matching
plt.plot([1e-3,1e2],[1e-3,1e2],axes=ax,c='r',zorder=-1);
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/srv/paws/lib/python3.4/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   1944             try:
-> 1945                 return self._engine.get_loc(key)
   1946             except KeyError:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)()

KeyError: 'Halsey (singer)'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-50-6fec507d7c66> in <module>()
     10 for page in hl_edits_df.columns:
     11     _s1 = zscore(hl_edits_df[page])
---> 12     _s2 = zscore(hl_pvs_df[page])
     13     #plt.plot(_s1.ix[_s2.index],_s2,'.-',axes=ax,alpha=.125,c='grey')
     14     plt.scatter(_s1,_s2,axes=ax,alpha=.125,c='grey')

/srv/paws/lib/python3.4/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1995             return self._getitem_multilevel(key)
   1996         else:
-> 1997             return self._getitem_column(key)
   1998 
   1999     def _getitem_column(self, key):

/srv/paws/lib/python3.4/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2002         # get column
   2003         if self.columns.is_unique:
-> 2004             return self._get_item_cache(key)
   2005 
   2006         # duplicate columns & possible reduce dimensionality

/srv/paws/lib/python3.4/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1348         res = cache.get(item)
   1349         if res is None:
-> 1350             values = self._data.get(item)
   1351             res = self._box_item_values(item, values)
   1352             cache[item] = res

/srv/paws/lib/python3.4/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3288 
   3289             if not isnull(item):
-> 3290                 loc = self.items.get_loc(item)
   3291             else:
   3292                 indexer = np.arange(len(self.items))[isnull(self.items)]

/srv/paws/lib/python3.4/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   1945                 return self._engine.get_loc(key)
   1946             except KeyError:
-> 1947                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   1948 
   1949         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)()

KeyError: 'Halsey (singer)'