English Wikipedia page views, 2008 - 2017

Here I retrieve, aggregate and visualize the number of monthly visitors to English Wikipedia from January 2008 through September 2017. I group the data by the nature of the visit, whether via the desktop website, or via mobile, which includes the mobile website and the mobile application. I also present total visits by month, which is simply the sum of monthly desktop and mobile visits. I visualize the results (see below) and save the output to a .csv file.

This work is meant to be fully reproducible. I welcome comments if any errors or hurdles are discovered during an attempted reproduction.

Setup

We run a few lines of code to set up the system before we get going.

#https://github.com/rexthompson/data-512-a1/blob/master/hcds-a1-data-curation.ipynb
import json
import matplotlib.pyplot as plt
import os
import pandas as pd
import requests

%matplotlib inline

Data Ingest

The first step in all data projects is data retrieval. For this project we will be downloading data from Wikimedia's REST API. We'll pull from two endpoints:

  • Legacy Pagecounts API
  • Pageview API

Before we make our first call, we define an empty dictionary to hold the results. We also define a second dictionary with some contact information which we'll pass to the API calls when they are made.

# create a dictionary to hold API response data
data_dict = dict()

# define headers to pass to the API call
headers={'User-Agent' : 'https://github.com/rexthompson', 'From' : 'rext@uw.edu'}

Pagecounts (Legacy data)

The code below performs the following steps the Pagecount data, for both the desktop and mobile websites:

  • retrieves data from the API, or loads the data from local memory if it already exists
  • saves the raw data for each source as .json files if it does not already exist
  • extracts the year, month and count data and converts it to a dataframe
  • saves the dataframe to a dictionary for later combination with other sources

I chose to perform this step in a loop since we would otherwise be repeating much of the same code.

# define endpoint and access sites
endpoint = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end}'
access_sites = ['desktop-site', 'mobile-site']

# repeat for each access site of interest
for access_site in access_sites:
    
    # set filename for access-specific API call response JSON file
    filename = 'data/raw/pagecounts_' + access_site + '_200807_201709.json'
    
    # check if file already exists; load if so, create if not
    if os.path.isfile(filename):
        with open(filename) as json_data:
            response = json.load(json_data)
        print('loaded JSON data from ./' + filename)
    else:
        # define parameters
        params = {'project' : 'en.wikipedia.org',
                  'access-site' : access_site,     # [all-sites, desktop-site, mobile-site]
                  'granularity' : 'monthly',       # [hourly, daily, monthly]
                  'start' : '2008010100',
                  'end' : '2017100100' # use the first day of the following month to ensure a full month of data is collected
                  }
        
        # fetch and format data
        api_call = requests.get(endpoint.format(**params), headers)
        response = api_call.json()
        
        # format and save output as JSON file
        with open(filename, 'w') as f:
             json.dump(response, f)
        print('saved JSON data to ./' + filename)
    
    # convert to dataframe
    temp_df = pd.DataFrame.from_dict(response['items'])
    temp_df['yyyymm'] = temp_df.timestamp.str[0:6]
    col_name = 'pc_' + access_site
    temp_df.rename(columns={'count': col_name}, inplace=True)
        
    # save to dictionary for later combination
    data_dict[col_name] = temp_df[['yyyymm', col_name]]
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-3-beabd893f73c> in <module>()
     28 
     29         # format and save output as JSON file
---> 30         with open(filename, 'w') as f:
     31              json.dump(response, f)
     32         print('saved JSON data to ./' + filename)

FileNotFoundError: [Errno 2] No such file or directory: 'data/raw/pagecounts_desktop-site_200807_201709.json'

Pageviews (New data)

We now repeat the same process as above, but for the Pageviews data. I chose to perform this step separate from the loop above since I felt there were enough differences between the API call parameters and schema to warrant breaking these steps into two. See Pageview API and Legacy Pagecounts API.

The loop below continues appending dataframes to our data dictionary. When complete, the dictionary contains data for all five API call endpoints.

# define endpoint and access sites
endpoint = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'
access_sites = ['desktop', 'mobile-app', 'mobile-web']

# repeat for each access site of interest
for access_site in access_sites:
    
    # set filename for access-specific API call response JSON file
    filename = 'data/raw/pageviews_' + access_site  + '_200807_201709.json'

    # check if file already exists; load if so, create if not
    if os.path.isfile(filename):
        with open(filename) as f:
            response = json.load(f)
        print('loaded JSON data from ./' + filename)
    else:
        # define parameters
        params = {'project' : 'en.wikipedia.org',
                  'access' : access_site,         # [all-access, desktop, mobile-app, mobile-web]
                  'agent' : 'user',               # [all-agents, user, spider]
                  'granularity' : 'monthly',      # [hourly, daily, monthly]
                  'start' : '2008010100',
                  'end' : '2017100100' # use the first day of the following month to ensure a full month of data is collected
                  }

        # fetch and format data
        api_call = requests.get(endpoint.format(**params), headers)
        response = api_call.json()

        # format and save output as JSON file
        with open(filename, 'w') as f:
             json.dump(response, f)
        print('saved JSON data to ./' + filename)

    # convert to dataframe
    temp_df = pd.DataFrame.from_dict(response['items'])
    temp_df['yyyymm'] = temp_df.timestamp.str[0:6]
    col_name = 'pv_' + access_site
    temp_df.rename(columns={'views': col_name}, inplace=True)
        
    # save to dictionary for later combination
    data_dict[col_name] = temp_df[['yyyymm', col_name]]
saved JSON data to ./data/raw/pageviews_desktop_200807_201709.json
saved JSON data to ./data/raw/pageviews_mobile-app_200807_201709.json
saved JSON data to ./data/raw/pageviews_mobile-web_200807_201709.json

Data Aggregation

At this point our data consists of five separate dataframes, stored in a single data dictionary. We want to get this all into the same dataframe so we can work with it, export it, and plot it. To do this, we will define a new dataframe (df) which consists of the first dataframe in the dictionary. We'll then loop over the remaining dictionary elements and merge them all together.

The end result is a single dataframe with a single date column (yyyymm) and a column for each of the original five dataframes. The dataframes are merged on the date column so they are sure to align vertically.

keys = list(data_dict.keys())

df = data_dict[keys[0]]
for i in range(1, len(keys)):
    df = df.merge(data_dict[keys[i]], how='outer', on='yyyymm')

Data Export

First we run the following line to convert missing values to zero, as per the assignment instructions.

df = df.fillna(0)

We then run the following line to convert the data to integers, since we are working with discrete counts. Omitting this step would result in unnecessary decimal precision in the output .csv file.

df.iloc[:,1:] = df.iloc[:,1:].astype(int)

Now we perform a few simple string splitting and aggregation steps to create the final dataframe that we'll export to CSV.

# create dataframe for .csv and plot
df_new = pd.DataFrame({'year':df['yyyymm'].str[0:4],
                       'month':df['yyyymm'].str[4:6],
                       'pagecount_all_views':df['pc_desktop-site'] + df['pc_mobile-site'],
                       'pagecount_desktop_views':df['pc_desktop-site'].astype(int),
                       'pagecount_mobile_views':df['pc_mobile-site'],
                       'pageview_all_views':df['pv_desktop'] + df['pv_mobile-app'] + df['pv_mobile-web'],
                       'pageview_desktop_views':df['pv_desktop'],
                       'pageview_mobile_views':df['pv_mobile-app'] + df['pv_mobile-web']})

# reorder columns
df_new = df_new[['year',
                 'month',
                 'pagecount_all_views',
                 'pagecount_desktop_views',
                 'pagecount_mobile_views',
                 'pageview_all_views',
                 'pageview_desktop_views',
                 'pageview_mobile_views']]

Now we'll save to CSV, unless the file already exists in which case we can load for utmost transparency. We'll call this data "prelim" since we aren't sure yet if there are any problems with the data.

# set filename for combined data CSV
filename = 'data/en-wikipedia_traffic_200801-201709_prelim.csv'

# check if file already exists; load if so, create if not
if os.path.isfile(filename):
    df_new = pd.read_csv(filename)
    print('loaded CSV data from ./' + filename)
else:
    df_new.to_csv(filename, index=False, )
    print('saved CSV data to ./' + filename)
saved CSV data to ./data/en-wikipedia_traffic_200801-201709_prelim.csv

Data Visualization

We want to have a look at the data. But first, we need a few cleanup steps.

We don't want missing values to clutter the plot, so we retract our previous step and convert 0's (zeros) back to NaN.

df_new.replace(0, float('nan'), inplace=True)

We also pull out a series that consists of timestamps. This is a necessary step since the plot function won't recognize the current date integers as dates.

yyyymm = pd.to_datetime(df_new['year'].astype(str) + df_new['month'].astype(str), format='%Y%m')

Now we plot the data.

plt.rcParams["figure.figsize"] = (14, 4)
plt.plot(yyyymm, df_new['pagecount_desktop_views']/1e6, 'g--')
plt.plot(yyyymm, df_new['pagecount_mobile_views']/1e6, 'b--')
plt.plot(yyyymm, df_new['pagecount_all_views']/1e6, 'k--')
plt.legend(['main site','mobile site','total'], loc=2, framealpha=1)
plt.plot(yyyymm, df_new['pageview_desktop_views']/1e6, 'g-')
plt.plot(yyyymm, df_new['pageview_mobile_views']/1e6, 'b-')
plt.plot(yyyymm, df_new['pageview_all_views']/1e6, 'k-')
plt.grid(True)
plt.ylim((0,12000))
plt.xlim(('2008-01-01','2017-10-01'))
plt.title('Page Views on English Wikipedia (x 1,000,000)')
plt.suptitle('May 2015: a new pageview definition took effect, which eliminated all crawler traffic. Solid lines mark new definition.', y=0.04, color='#b22222');

This looks pretty good! However, we notice in the plot above that there appears to be some bad data in mid-2016.

Data Cleaning

Let's see if we can figure out what's going on here. We display the data for all of 2016 below to look for bad values.

df_new.loc[(pd.DatetimeIndex(yyyymm).year == 2016)]
year month pagecount_all_views pagecount_desktop_views pagecount_mobile_views pageview_all_views pageview_desktop_views pageview_mobile_views
96 2016 01 9.309261e+09 5.569633e+09 3.739629e+09 8.154016e+09 4.436179e+09 3.717837e+09
97 2016 02 8.680941e+09 5.347709e+09 3.333231e+09 7.585859e+09 4.250997e+09 3.334862e+09
98 2016 03 8.827530e+09 5.407676e+09 3.419854e+09 7.673275e+09 4.286590e+09 3.386684e+09
99 2016 04 8.873621e+09 5.572235e+09 3.301385e+09 7.408148e+09 4.149384e+09 3.258764e+09
100 2016 05 8.748968e+09 5.330532e+09 3.418436e+09 7.586811e+09 4.191778e+09 3.395033e+09
101 2016 06 8.347711e+09 4.975092e+09 3.372618e+09 7.243631e+09 3.888840e+09 3.354791e+09
102 2016 07 8.864628e+09 5.363966e+09 3.500661e+09 7.834440e+09 4.337866e+09 3.496574e+09
103 2016 08 1.393717e+09 9.136759e+08 4.800413e+08 8.210866e+09 4.695046e+09 3.515819e+09
104 2016 09 NaN NaN NaN 7.528292e+09 4.135006e+09 3.393286e+09
105 2016 10 NaN NaN NaN 7.871022e+09 4.361738e+09 3.509284e+09
106 2016 11 NaN NaN NaN 7.983113e+09 4.392068e+09 3.591045e+09
107 2016 12 NaN NaN NaN 7.986152e+09 4.209609e+09 3.776544e+09

It looks like row 103 has lower values for the first three columns. Interestingly, this month wasn't even supposed to be included in the analysis. So it seems these values may be an artifact from the API process, in which case we were instructed to set the end date to the start of the following month.

Thus, we set these values to NaN so they won't clutter the analysis.

# update values
df_new.loc[103,['pagecount_all_views', 'pagecount_desktop_views', 'pagecount_mobile_views']] = None
df_new.iloc[100:106,]
year month pagecount_all_views pagecount_desktop_views pagecount_mobile_views pageview_all_views pageview_desktop_views pageview_mobile_views
100 2016 05 8.748968e+09 5.330532e+09 3.418436e+09 7.586811e+09 4.191778e+09 3.395033e+09
101 2016 06 8.347711e+09 4.975092e+09 3.372618e+09 7.243631e+09 3.888840e+09 3.354791e+09
102 2016 07 8.864628e+09 5.363966e+09 3.500661e+09 7.834440e+09 4.337866e+09 3.496574e+09
103 2016 08 NaN NaN NaN 8.210866e+09 4.695046e+09 3.515819e+09
104 2016 09 NaN NaN NaN 7.528292e+09 4.135006e+09 3.393286e+09
105 2016 10 NaN NaN NaN 7.871022e+09 4.361738e+09 3.509284e+09

That looks better. Let's plot again to see how it looks graphically.

plt.rcParams["figure.figsize"] = (14, 4)
plt.plot(yyyymm, df_new['pagecount_desktop_views']/1e6, 'g--')
plt.plot(yyyymm, df_new['pagecount_mobile_views']/1e6, 'b--')
plt.plot(yyyymm, df_new['pagecount_all_views']/1e6, 'k--')
plt.legend(['main site','mobile site','total'], loc=2, framealpha=1)
plt.plot(yyyymm, df_new['pageview_desktop_views']/1e6, 'g-')
plt.plot(yyyymm, df_new['pageview_mobile_views']/1e6, 'b-')
plt.plot(yyyymm, df_new['pageview_all_views']/1e6, 'k-')
plt.grid(True)
plt.ylim((0,12000))
plt.xlim(('2008-01-01','2017-10-01'))
plt.title('Page Views on English Wikipedia (x 1,000,000)')
plt.suptitle('May 2015: a new pageview definition took effect, which eliminated all crawler traffic. Solid lines mark new definition.', y=0.04, color='#b22222')
plt.savefig('en-wikipedia_traffic_200801-201709.png', dpi=80);

That looks better!! So, it seems we have successfully reproduced the plot found here: https://wiki.communitydata.cc/upload/a/a8/PlotPageviewsEN_overlap.png

Data Export, Attempt 2

Finally, we'll clean up the data (in the same way as before) and output the final data file to .csv.

# data formatting updates
df_new = df_new.fillna(0)
df_new.iloc[:,1:] = df_new.iloc[:,1:].astype(int)

# set filename for combined data CSV
filename = 'data/en-wikipedia_traffic_200801-201709.csv'

# check if file already exists; load if so, create if not
if os.path.isfile(filename):
    df_new = pd.read_csv(filename)
    print('loaded CSV data from ./' + filename)
else:
    df_new.to_csv(filename, index=False, )
    print('saved CSV data to ./' + filename)
saved CSV data to ./data/en-wikipedia_traffic_200801-201709.csv

This concludes the analysis.