Lab 1 - Revision Histories

Professor Brian Keegan
Department of Information Science, CU Boulder
This notebook is copyright and made available under the Apache License v2.0 license.

This is the first of five lab notebooks that will explore how to do some introductory data extraction and analysis from Wikipedia data. This lab will use a combination of MySQL and Python to retrieve the revision histories for a selection of articles. You do not need to be fluent in either to complete the lab, but there are many options for extending the analyses we do here by using more advanced queries and scripting methods.

Acknowledgements
I'd like to thank the Wikimedia Foundation for the PAWS system and related Wikitech infrastructure that this workbook runs within. Yuvi Panda, Aaron Halfaker, Jonathan Morgan, and Dario Taraborelli have all provided crucial support and feedback.

Confirm basic Python commands work

a = 1
b = 2
a+b
3
list(range(0,5))
[0, 1, 2, 3, 4]
your_name = 'Brian Keegan'
print("Hello, {0}".format(your_name))
Hello, Brian Keegan

Import modules and setup environment

Load up all the libraries we'll need to connect to the database, retrieve information for analysis, and visualizing results.

# Makes the plots appear within the notebook
%matplotlib inline

# Two fundamental packages for doing data manipulation
import numpy as np                   # http://www.numpy.org/
import pandas as pd                  # http://pandas.pydata.org/

# Two related packages for plotting data
import matplotlib.pyplot as plt      # http://matplotlib.org/
import seaborn as sb                 # https://stanford.edu/~mwaskom/software/seaborn/

# Two packages for accessing the MySQL server
import pymysql                       # http://pymysql.readthedocs.io/en/latest/
import os                            # https://docs.python.org/3.4/library/os.html

# Setup the code environment to use plots with a white background and DataFrames show more columns and rows
sb.set_style('whitegrid')
pd.options.display.max_columns = 100
pd.options.display.max_rows = 110

Setup a connection to the English Wikipedia database

Retrieve the host name, username, and (encrypted) password associated with your account so you can login to the database.

host, user, password = os.environ['MYSQL_HOST'], os.environ['MYSQL_USERNAME'], os.environ['MYSQL_PASSWORD']

Connect to the system using the creditials from above, then select the database for the English Wikipedia.

conn = pymysql.connect(host=host,user=user,password=password,database='enwiki_p',connect_timeout=3600)
conn.cursor().execute('use enwiki_p');

What tables are available for use?

pd.read_sql_query('show tables',conn)['Tables_in_enwiki_p'].values
array(['abuse_filter', 'abuse_filter_action', 'abuse_filter_log',
       'aft_article_answer', 'aft_article_answer_text',
       'aft_article_feedback', 'aft_article_feedback_properties',
       'aft_article_feedback_ratings_rollup',
       'aft_article_feedback_select_rollup', 'aft_article_field',
       'aft_article_field_group', 'aft_article_field_option',
       'aft_article_filter_count',
       'aft_article_revision_feedback_ratings_rollup',
       'aft_article_revision_feedback_select_rollup', 'archive',
       'archive_userindex', 'article_feedback', 'article_feedback_pages',
       'article_feedback_properties', 'article_feedback_ratings',
       'article_feedback_revisions', 'article_feedback_stats',
       'article_feedback_stats_types', 'category', 'categorylinks',
       'change_tag', 'ep_articles', 'ep_cas', 'ep_courses', 'ep_events',
       'ep_instructors', 'ep_oas', 'ep_orgs', 'ep_revisions',
       'ep_students', 'ep_users_per_course', 'externallinks',
       'filearchive', 'filearchive_userindex', 'flaggedimages',
       'flaggedpage_config', 'flaggedpage_pending', 'flaggedpages',
       'flaggedrevs', 'flaggedrevs_promote', 'flaggedrevs_statistics',
       'flaggedrevs_stats', 'flaggedrevs_stats2', 'flaggedrevs_tracking',
       'flaggedtemplates', 'geo_tags', 'global_block_whitelist',
       'hitcounter', 'image', 'imagelinks', 'interwiki', 'ipblocks',
       'ipblocks_ipindex', 'iwlinks', 'l10n_cache', 'langlinks',
       'localisation', 'localisation_file_hash', 'logging',
       'logging_logindex', 'logging_userindex', 'mark_as_helpful', 'math',
       'module_deps', 'msg_resource_links', 'oldimage',
       'oldimage_userindex', 'page', 'page_props', 'page_restrictions',
       'pagelinks', 'pagetriage_log', 'pagetriage_page',
       'pagetriage_page_tags', 'pagetriage_tags', 'pif_edits',
       'povwatch_log', 'povwatch_subscribers', 'protected_titles',
       'recentchanges', 'recentchanges_userindex', 'redirect', 'revision',
       'revision_userindex', 'site_identifiers', 'site_stats', 'sites',
       'tag_summary', 'templatelinks', 'transcode', 'updatelog', 'updates',
       'user', 'user_former_groups', 'user_groups', 'user_properties',
       'user_properties_anon', 'valid_tag', 'watchlist_count',
       'wbc_entity_usage', 'wikilove_image_log', 'wikilove_log'], dtype=object)
s = """
        SELECT 
            *
        FROM 
            revision
        ORDER BY
            rev_timestamp DESC
        LIMIT
            5
    """
    
pd.read_sql_query(s,conn)
rev_id rev_page rev_text_id rev_comment rev_user rev_user_text rev_timestamp rev_minor_edit rev_deleted rev_len rev_parent_id rev_sha1 rev_content_model rev_content_format
0 741667858 11543122 0 b'/* Regular season */http→https for [[Go... 28903366 b'Bender the Bot' b'20160928233641' 1 0 15497 715723257 b'1bo8olwdicg6jcwr2yurupifcabchrf' None None
1 741667859 5610945 0 b'/* The UE Red Warriors Basketball Team* */' 19591522 b'Joseroyal' b'20160928233641' 0 0 42547 741667337 b'fyi1fm7zf2326cy96f8q8uiayox6xw1' None None
2 741667856 51783769 0 b"[[WP:AES|\xe2\x86\x90]]Created page with '{{... 7852030 b'Materialscientist' b'20160928233639' 0 0 441 0 b'gcv1xewvmnlcf78043zkzgtw05s1gvc' None None
3 741667857 22459546 0 b'Reverted to revision 740844275 by [[Special:... 11951198 b'Gzuufy' b'20160928233639' 0 0 8589 741660215 b'ojput91puh0wuw1sawfjhwf52121k4x' None None
4 741667855 50989734 0 b'/* Episodes */' 0 b'2607:FCC8:F6C7:4300:E0A5:328C:32B9:5B4C' b'20160928233638' 0 0 6848 741521075 b'oe7dxm4f1eggdkoqjxa3gjoy46vjiu7' None None

Important note

Article page titles need to be separated by underscores "_" rather than spaces. It's easy to replace spaces with underscores using the replace method called on the string.

page_title = 'Tropical cyclone scales'.replace(' ','_')
page_title
'Tropical_cyclone_scales'

Get revisions for a single article

Write a function get_page_revisions that takes the name of a Wikipedia page and its namespace and use the database connection to return a pandas DataFrame containing all the revisions for the article. Some of the fields inside needs to be cleaned up by converting to other data types to help with analysis in later steps.

def get_page_revisions(page_title,namespace,conn):
    """ Takes a page title and returns its revision history.
      page_title = a string for the page title to get its revisions
      namespace = an int to get the revisions from a specific namespace
      conn = a database connection
      
    Returns:
      A pandas DataFrame containing all the article revisions and page information
    """
    # In case you pass a page title with spaces in it, replace the spaces with underscores
    page_title = page_title.replace(' ','_').encode('utf8').decode('latin1')

    # The MySQL query string used to retrieve the data. By line, it is
    ## selecting all columns 
    ## from the "revisions" table
    ## joining the "page" table on it
    ## using the page_id and rev_page columns as keys
    ## limiting the results to entries that occur in the namespace and pagetitle
    ## and ordering the results by the timestamp
    s = """
            SELECT 
                *
            FROM 
                revision 
            JOIN 
                page ON page.page_id = revision.rev_page
            WHERE 
                page.page_namespace = {1} AND page.page_title = '{0}' 
            ORDER BY 
                revision.rev_timestamp ASC
        """.format(page_title,namespace)
    
    # Use the connection to run the query and return the results as a DataFrame
    _df = pd.read_sql_query(s,conn)

    # Some of the results have a "bytestring" format
    byte_columns = ['rev_comment','rev_user_text','rev_timestamp','rev_sha1','page_content_model',
                    'page_title','page_restrictions','page_touched','page_links_updated']
    
    # For each column, convert it from bytestring to a utf8 string
    for col in byte_columns:
        _df[col] = _df[col].str.decode('utf8')

    # Several of the columns are timestamps. Convert to datetimes
    _df['rev_timestamp'] = _df['rev_timestamp'].apply(lambda x:pd.datetime.strptime(x,'%Y%m%d%H%M%S'))
    _df['page_touched'] = _df['page_touched'].apply(lambda x:pd.datetime.strptime(x,'%Y%m%d%H%M%S'))
    _df['page_links_updated'] = _df['page_links_updated'].apply(lambda x:pd.datetime.strptime(x,'%Y%m%d%H%M%S'))
    
    # Return the data
    return _df

Use the function to get the revisions for a single article and use the .head() method to inspect the first 5 rows in the DataFrame.

conn.ping()
conn.cursor().execute('use enwiki_p')

page_title = 'Tropical_cyclone_scales'
rev_df = get_page_revisions(page_title,0,conn)
rev_df.head()
rev_id rev_page rev_text_id rev_comment rev_user rev_user_text rev_timestamp rev_minor_edit rev_deleted rev_len rev_parent_id rev_sha1 rev_content_model rev_content_format page_id page_namespace page_title page_restrictions page_counter page_is_redirect page_is_new page_random page_touched page_links_updated page_latest page_len page_content_model
0 29973485 3306341 0 stub, w/ graph from the article on the America... 21860 Jerzy 2005-12-03 01:49:37 0 0 943 0 scdznf23xx3nfzcpv4fx9mfjama29j9 None None 3306341 0 Tropical_cyclone_scales 0 0 0 0.397351 2016-10-02 01:25:03 2016-10-01 01:25:06 739511023 31229 wikitext
1 30995467 3306341 0 161837 Jdorje 2005-12-12 01:37:31 0 0 976 29973485 beo9jm3t43zidq76dvk5xyf9r3synt9 None None 3306341 0 Tropical_cyclone_scales 0 0 0 0.397351 2016-10-02 01:25:03 2016-10-01 01:25:06 739511023 31229 wikitext
2 30995486 3306341 0 fix cat 161837 Jdorje 2005-12-12 01:37:46 0 0 985 30995467 pfdknw8c4qu3lpr7ekcegmweu9x475q None None 3306341 0 Tropical_cyclone_scales 0 0 0 0.397351 2016-10-02 01:25:03 2016-10-01 01:25:06 739511023 31229 wikitext
3 36222687 3306341 0 527442 Jamie C 2006-01-22 14:59:02 0 0 977 30995486 rr65xfscgspggur5oluev35ri6201i8 None None 3306341 0 Tropical_cyclone_scales 0 0 0 0.397351 2016-10-02 01:25:03 2016-10-01 01:25:06 739511023 31229 wikitext
4 44683941 3306341 0 remove duplicate link 161837 Jdorje 2006-03-20 17:24:21 0 0 988 36222687 oic55j31hpmbbgo6vxk8in14hpp6khb None None 3306341 0 Tropical_cyclone_scales 0 0 0 0.397351 2016-10-02 01:25:03 2016-10-01 01:25:06 739511023 31229 wikitext

Count the total number of revisions in the article.

len(rev_df)
585

Count the total number of unique users contributing to the article.

len(rev_df['rev_user_text'].unique())
189

When was the first revision made to the article?

rev_df['rev_timestamp'].min()
Timestamp('2005-12-03 01:49:37')

When was the most recent revision made to the article?

rev_df['rev_timestamp'].max()
Timestamp('2016-09-15 03:40:39')

How has the size of the article changed over time?

ax = rev_df['rev_len'].plot()
ax.set_xlabel('Revisions')
ax.set_ylabel('Article size (bytes)')
<matplotlib.text.Text at 0x7fb0274dfdd8>

Who made the largest change to the article?

# Use the idxmax() method to return the index value for the largest value in the series
loc_max_revision = rev_df['rev_len'].idxmax()

# Show the values of a single row
rev_df.ix[loc_max_revision]
rev_id                                      562993276
rev_page                                      3306341
rev_text_id                                         0
rev_comment           /* Atlantic and East Pacific */
rev_user                                     19294761
rev_user_text                      Marco Antonio 2345
rev_timestamp                     2013-07-05 16:28:58
rev_minor_edit                                      1
rev_deleted                                         0
rev_len                                         37427
rev_parent_id                               562988593
rev_sha1              rrtus56k7qelymhne52a1ndxyx2shky
rev_content_model                                None
rev_content_format                               None
page_id                                       3306341
page_namespace                                      0
page_title                    Tropical_cyclone_scales
page_restrictions                                    
page_counter                                        0
page_is_redirect                                    0
page_is_new                                         0
page_random                                  0.397351
page_touched                      2016-10-02 01:25:03
page_links_updated                2016-10-01 01:25:06
page_latest                                 739511023
page_len                                        31229
page_content_model                           wikitext
Name: 447, dtype: object

Go inspect what this largest revision was. Actually this specific example is extremely NSFW.

rev_id = rev_df.ix[loc_max_revision,'rev_id']
print('https://en.wikipedia.org/w/index.php?oldid={0}'.format(rev_id))
https://en.wikipedia.org/w/index.php?oldid=562993276

Group revisions by date

Muliple revisions can be made in a single day. We want to be able to count the number of revisions per day and how the article's length has changed over time.

# For each timestamp extract only its date (ignoring the time of day) 
# and save as a new column "rev_date"
rev_df['rev_date'] = rev_df['rev_timestamp'].apply(lambda x:x.date())

# Compare the full "rev_timestamp" to the new "rev_date"
rev_df[['rev_timestamp','rev_date']].head()
rev_timestamp rev_date
0 2005-12-03 01:49:37 2005-12-03
1 2005-12-12 01:37:31 2005-12-12
2 2005-12-12 01:37:46 2005-12-12
3 2006-01-22 14:59:02 2006-01-22
4 2006-03-20 17:24:21 2006-03-20

What dates had the most revisions in a single day? A simple way of doing this is to count the number of times each "rev_date" occurs in the data with the value_counts method. Let's look at the top 10, 5 of which are in October 2012 in the weeks before the Election Day.

rev_df['rev_date'].value_counts().head(10)
2008-12-23    32
2015-11-07    28
2009-01-16    18
2008-12-26    17
2012-07-26    13
2008-11-29    12
2013-06-09    10
2009-01-18     8
2009-07-28     7
2008-10-26     7
Name: rev_date, dtype: int64

We can use the groupby functionality to create "baby" DataFrames having all the same particular value. For example, we can group by the date and each of the baby DataFrames would be all the revisions that occurred on that date. Store the groupby in the "rev_gb_date" object for subsequent steps.

Inspect a single baby DataFrame by passing the key (the date values we grouped on) to the get_group method and look at the first 5 entries.

rev_gb_date = rev_df.groupby(['rev_date'])

_date = rev_df['rev_date'].value_counts().idxmax()
rev_gb_date.get_group(_date).head(5)
rev_id rev_page rev_text_id rev_comment rev_user rev_user_text rev_timestamp rev_minor_edit rev_deleted rev_len rev_parent_id rev_sha1 rev_content_model rev_content_format page_id page_namespace page_title page_restrictions page_counter page_is_redirect page_is_new page_random page_touched page_links_updated page_latest page_len page_content_model rev_date
127 259661224 3306341 0 /* Atlantic and East Pacific */ 4761569 Nergaal 2008-12-23 04:29:20 0 0 14410 259445228 mndul3pu9aal2nkqfdmqvrd7pb395t4 None None 3306341 0 Tropical_cyclone_scales 0 0 0 0.397351 2016-10-02 01:25:03 2016-10-01 01:25:06 739511023 31229 wikitext 2008-12-23
128 259661394 3306341 0 /* Atlantic and East Pacific */ 4761569 Nergaal 2008-12-23 04:30:43 0 0 14432 259661224 18ipu81zflzw6j96gyxohmj6xsgn977 None None 3306341 0 Tropical_cyclone_scales 0 0 0 0.397351 2016-10-02 01:25:03 2016-10-01 01:25:06 739511023 31229 wikitext 2008-12-23
129 259662244 3306341 0 /* Atlantic and East Pacific */ 4761569 Nergaal 2008-12-23 04:37:39 0 0 14596 259661394 cs6571msogjxbdorf2p9f4lapzo8vh7 None None 3306341 0 Tropical_cyclone_scales 0 0 0 0.397351 2016-10-02 01:25:03 2016-10-01 01:25:06 739511023 31229 wikitext 2008-12-23
130 259662439 3306341 0 4761569 Nergaal 2008-12-23 04:39:14 0 0 14606 259662244 kp2orusyezm2fd0nzvo0f6kkuyptqqv None None 3306341 0 Tropical_cyclone_scales 0 0 0 0.397351 2016-10-02 01:25:03 2016-10-01 01:25:06 739511023 31229 wikitext 2008-12-23
131 259662556 3306341 0 4761569 Nergaal 2008-12-23 04:40:10 0 0 14616 259662439 4mv27nr47gwznocn8w9ikc2toqbcvqd None None 3306341 0 Tropical_cyclone_scales 0 0 0 0.397351 2016-10-02 01:25:03 2016-10-01 01:25:06 739511023 31229 wikitext 2008-12-23

Compute unique revisions per day

We can also use aggregate functionality to perform computations efficiently across all the baby DataFrames.

daily_revs = rev_gb_date.agg({'rev_id':pd.Series.nunique})
daily_revs.head() 
rev_id
rev_date
2005-12-03 1
2005-12-12 2
2006-01-22 1
2006-03-20 3
2006-03-21 1

Reindex the axis to make it continuous over the date range.

# Compute the first and last days revisions were observed
_min, _max = daily_revs.index.min(), daily_revs.index.max()

# Create a new index that is continuous over the whoel date range
_ix = pd.date_range(_min,_max)

# Reindex the daily revisions over the continuous date range filling in the blanks with 0s
daily_revs_reindexed = daily_revs.reindex(_ix).fillna(0)
daily_revs_reindexed.head()
rev_id
2005-12-03 1.0
2005-12-04 0.0
2005-12-05 0.0
2005-12-06 0.0
2005-12-07 0.0

What was the most active day of editing? Here's another way of returning the same values.

daily_revs_reindexed['rev_id'].sort_values(ascending=False).head(5)
2008-12-23    32.0
2015-11-07    28.0
2009-01-16    18.0
2008-12-26    17.0
2012-07-26    13.0
Name: rev_id, dtype: float64

Plot the number of revisions by day.

ax = daily_revs_reindexed['rev_id'].plot(lw=.5)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Revisions')
<matplotlib.text.Text at 0x7fb025195438>

This is pretty noisy. We can clean it up by applying a rolling mean over a 60-day window.

smoothed_s = daily_revs_reindexed['rev_id'].rolling(60).mean()
ax = smoothed_s.plot(lw=3)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Revisions')
<matplotlib.text.Text at 0x7fb027b5da20>

Plot the cumulative sum of the number of revisions over time using the cumsum method.

ax = daily_revs_reindexed['rev_id'].cumsum().plot(lw=2)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Cumulative Revisions')
<matplotlib.text.Text at 0x7fb027aa88d0>

Compute article length by day

Using the same "rev_gb_date" object, aggregate each of the baby DataFrames and compute the median article size ("rev_len"). Reindex again to make a continuous time axis.

daily_length = rev_gb_date.agg({'rev_len':np.median})

# Reindex
daily_length_reindexed = daily_length.reindex(_ix).fillna(method='ffill')
daily_length_reindexed.head()
rev_len
2005-12-03 943.0
2005-12-04 943.0
2005-12-05 943.0
2005-12-06 943.0
2005-12-07 943.0

Plot the reindexed series, making the y-axis be on a log scale to pick up detail in the early days when the article was orders of magnitude smaller.

ax = daily_length_reindexed['rev_len'].plot(lw=2,logy=True)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Article size  (bytes)')
<matplotlib.text.Text at 0x7fb0279ee668>

Compute diffs on time and article size

The "rev_len" only records the length of the revision at that point in time. By taking the difference with the previous observation, we can compute how much net content the user added or removed from the previous revision. Using the diff method on the "rev_len" series, we store the resulting differences as "rev_len_diff".

Compare the first five elements of the two series to verify it's worked as intended.

# Compute the diff and store as a new series
rev_df['rev_len_diff'] = rev_df['rev_len'].diff()

# Verify the differences are accurate
rev_df[['rev_len','rev_len_diff']].head()
rev_len rev_len_diff
0 943 NaN
1 976 33.0
2 985 9.0
3 977 -8.0
4 988 11.0

Do some more cleanup of the diffs. In particular, when a revision is deleted (which is rare) the "rev_len" value is null, which makes the diff null as well. Make the first revision's "rev_len_diff" equal to the size of the article and then ffill the other nan values.

# Set the revision lengths for the null rows to be the previous value
rev_df['rev_len'] = rev_df['rev_len'].fillna(method='ffill')

# Compute the diffs again 
rev_df['rev_len_diff'] = rev_df['rev_len'].diff()

# Set the value for the first rev_len_diff to be the size of the rev_len
rev_df.loc[0,'rev_len_diff'] = rev_df.loc[0,'rev_len']

# Verify the differences are accurate
rev_df[['rev_len','rev_len_diff']].head()
rev_len rev_len_diff
0 943 943.0
1 976 33.0
2 985 9.0
3 977 -8.0
4 988 11.0

Similarly, the "rev_timestamp" series only records the time when the revision occurred. By taking the difference with the previous revision, we can compute how much time has elapsed since the last edit. Using the diff method again on the "rev_timestamp" series, we store the resulting differences as "ref_timestamp_diff".

The resulting values are Timedeltas which we convert back to number of day floats by dividing by np.timedelta64.

Compare the first five elements of the two series to verify it's worked as intended.

# Compute the diff and store as a new series
rev_df['rev_timestamp_diff'] = rev_df['rev_timestamp'].diff()

# Convert the values of the new series to floats by dividing by 1 day time delta.
rev_df['rev_timestamp_diff'] = rev_df['rev_timestamp_diff'].apply(lambda x:x/np.timedelta64(1,'D'))

# Verify the differences are accurate.
rev_df[['rev_timestamp','rev_timestamp_diff']].head()
rev_timestamp rev_timestamp_diff
0 2005-12-03 01:49:37 NaN
1 2005-12-12 01:37:31 8.991597
2 2005-12-12 01:37:46 0.000174
3 2006-01-22 14:59:02 41.556435
4 2006-03-20 17:24:21 57.100914

Combine all the computations with one daily DataFrame

Because we added new series to the "rev_df" we need to re-run the groupby to update it. We can also combine all the different aggregate functions we did before into one step by defining an "agg_dict" that applies an aggregation function over the values in a specific column. This will return a DataFrame with only a subset of columns with the aggregated values we specified.

# Groupby again
rev_gb_date = rev_df.groupby(['rev_date'])

# Define an aggregation mapping of functions to columns names
agg_dict_date = {'rev_id':pd.Series.nunique,
                  'rev_len':np.median,
                  'rev_len_diff':np.median,
                  'rev_timestamp_diff':np.mean
                 }

# Execiute the aggregation functions on the groupby object
rev_agg_date = rev_gb_date.agg(agg_dict_date)

# Reindex to continuous dates
rev_agg_date = rev_agg_date.reindex(_ix)

# Fill in the resulting values of the reindexed and aggregated "rev_len" from the last value
rev_agg_date[['rev_len']] = rev_agg_date[['rev_len']].fillna(method='ffill')

# Fill in the resulting values of the reindexed and aggregated other series as 0s.
rev_agg_date[['rev_id','rev_len_diff']] = rev_agg_date[['rev_id','rev_len_diff']].fillna(0)
#rev_agg_date[['rev_id','rev_len_diff','rev_timestamp_diff']] = rev_agg_date[['rev_id','rev_len_diff','rev_timestamp_diff']].fillna(0)

# Show the first 10 values to verify it's worked as intended
rev_agg_date.tail(10)
rev_id rev_timestamp_diff rev_len rev_len_diff
2016-09-06 0.0 NaN 31214.0 0.0
2016-09-07 0.0 NaN 31214.0 0.0
2016-09-08 0.0 NaN 31214.0 0.0
2016-09-09 0.0 NaN 31214.0 0.0
2016-09-10 0.0 NaN 31214.0 0.0
2016-09-11 0.0 NaN 31214.0 0.0
2016-09-12 0.0 NaN 31214.0 0.0
2016-09-13 0.0 NaN 31214.0 0.0
2016-09-14 0.0 NaN 31214.0 0.0
2016-09-15 2.0 31.380874 31222.5 8.0

Plot the changes in the average time since the last revision by day.

_s = rev_agg_date['rev_timestamp_diff']
ax = _s.plot(logy=True)
ax.set_xlabel('Date')
ax.set_ylabel('Average time since last edit (Days)')
<matplotlib.text.Text at 0x7fb02786fbe0>

That's pretty noisy. We can apply the rolling method over a 60-day window and get the mean to smooth out some of the noise.

_s = rev_agg_date['rev_timestamp_diff'].fillna(0).rolling(60).mean()
ax = _s.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Average time since last edit (Days)')
<matplotlib.text.Text at 0x7fb02750ec18>

Similarly, what was the median size of a change to the article over time?

_s = rev_agg_date['rev_len_diff']
ax = _s.plot(lw=.5)
ax.set_xlabel('Date')
ax.set_ylabel('Median size of revision')
ax.set_yscale('symlog')

Again, pretty noisy. We can cleanup with a rolling mean over a 60-day window.

_s = rev_agg_date['rev_len_diff'].rolling(60).mean()
ax = _s.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Median size of revision')
<matplotlib.text.Text at 0x7fb025069668>

Top editors on the article

Which editors made the most contributions to the article?

top_users = rev_df['rev_user_text'].value_counts()
top_users.head(10)
Jason Rees        189
Titoxd             43
Nergaal            33
Thegreatdr         10
Meow               10
200.104.70.140      9
ClueBot NG          8
DocWatson42         7
Dustin V. S.        6
Momoko              6
Name: rev_user_text, dtype: int64

This extremely skewed distribution of activity is characteristic of many online participation patterns: the "long tail" where less than 20% of users are responsible for more than 20% of contributions.

rev_count_occurrences = top_users.value_counts().reset_index()
rev_count_occurrences = rev_count_occurrences.rename(columns={'index':'Revision count','rev_user_text':'Occurrences'})
rev_count_occurrences.head()
Revision count Occurrences
0 1 129
1 2 29
2 3 10
3 4 8
4 5 3

Plot the resulting distribution on log-log axes to see the shape of the long-tail distribution.

rev_count_occurrences.plot(x='Revision count',y='Occurrences',kind='scatter',logy=True,logx=True,s=50)
<matplotlib.axes._subplots.AxesSubplot at 0x7fb024fe52b0>

Group revisions by users

We can also use the groupby-aggregate framework to group all the revisions together by the users who made them, rather than grouping revisions by the date as we did before.

rev_gb_user = rev_df.groupby(['rev_user_text'])

Look at a baby DataFrame within the groupby for a single user. These are all the revisions made by this user to this article.

rev_gb_user.get_group(top_users.idxmax()).head()
rev_id rev_page rev_text_id rev_comment rev_user rev_user_text rev_timestamp rev_minor_edit rev_deleted rev_len rev_parent_id rev_sha1 rev_content_model rev_content_format page_id page_namespace page_title page_restrictions page_counter page_is_redirect page_is_new page_random page_touched page_links_updated page_latest page_len page_content_model rev_date rev_len_diff rev_timestamp_diff
91 229291630 3306341 0 /* Comparisons across basins */ Changed SPAC S... 1977478 Jason Rees 2008-08-01 20:40:10 0 0 12291 227941364 l2b4p0jrilvr049tg0e4nz3p2x4ttq0 None None 3306341 0 Tropical_cyclone_scales 0 0 0 0.397351 2016-10-02 01:25:03 2016-10-01 01:25:06 739511023 31229 wikitext 2008-08-01 152.0 6.758160
101 247675948 3306341 0 /* Atlantic and East Pacific */ Rewrote Top Pa... 1977478 Jason Rees 2008-10-26 00:12:23 0 0 12506 247515738 dwy135fs46o95kx23jtbpal34ciyoud None None 3306341 0 Tropical_cyclone_scales 0 0 0 0.397351 2016-10-02 01:25:03 2016-10-01 01:25:06 739511023 31229 wikitext 2008-10-26 65.0 0.909826
102 247691080 3306341 0 Rewrote Australian Scale 1977478 Jason Rees 2008-10-26 01:50:29 0 0 11658 247675948 1axs68nw0va6kwm7ipv2thkjuh4a5zv None None 3306341 0 Tropical_cyclone_scales 0 0 0 0.397351 2016-10-02 01:25:03 2016-10-01 01:25:06 739511023 31229 wikitext 2008-10-26 -848.0 0.068125
103 247692184 3306341 0 /* Australian Tropical Cyclone Intensity Scale */ 1977478 Jason Rees 2008-10-26 01:58:18 0 0 11659 247691080 d0hxx9asl2222ujy40m8st46lph1wa8 None None 3306341 0 Tropical_cyclone_scales 0 0 0 0.397351 2016-10-02 01:25:03 2016-10-01 01:25:06 739511023 31229 wikitext 2008-10-26 1.0 0.005428
104 247692506 3306341 0 /* Australian Tropical Cyclone Intensity Scale */ 1977478 Jason Rees 2008-10-26 02:00:43 0 0 11664 247692184 qweq5ye3klkz78irb6cpbfw2xw9niop None None 3306341 0 Tropical_cyclone_scales 0 0 0 0.397351 2016-10-02 01:25:03 2016-10-01 01:25:06 739511023 31229 wikitext 2008-10-26 5.0 0.001678

As before we can aggregate all the information in these baby DFs into a user-level dataframe that summarizes different attributes.

# Define the aggregation function mappings to variables.
# We can apply multiple functions to the same column/series by nesting a second dictionary within
agg_dict_user = {'rev_id':{'revs_made':pd.Series.nunique},
                  'rev_timestamp':{'first_rev':np.min,'last_rev':np.max},
                  'rev_len_diff':{'rev_len_min':np.min,'rev_len_max':np.max,
                                  'rev_len_med':np.median,'rev_len_sum':np.sum},
                  'rev_timestamp_diff':{'rev_ts_min':np.min,'rev_ts_max':np.max,
                                        'rev_ts_avg':np.mean}
                 }

# Perform the aggregation
rev_agg_user = rev_gb_user.agg(agg_dict_user)

# Returns a multi-index column, simplify by dropping the top level
rev_agg_user.columns = rev_agg_user.columns.droplevel(0)

# Sort by number of revisions, and look at the first 5 rows
rev_agg_user.sort_values('revs_made',ascending=False).head()
revs_made rev_ts_max rev_ts_min rev_ts_avg first_rev last_rev rev_len_min rev_len_max rev_len_sum rev_len_med
rev_user_text
Jason Rees 189 120.757708 0.000162 2.530478 2008-08-01 20:40:10 2016-05-06 10:21:55 -2979.0 1740.0 6309.0 4.0
Titoxd 43 86.782894 0.000069 6.113390 2006-03-27 05:29:33 2011-08-16 01:06:20 -1967.0 6413.0 11371.0 27.0
Nergaal 33 45.360683 0.000150 2.537751 2008-10-25 00:50:39 2009-07-24 23:31:56 -223.0 335.0 1477.0 10.0
Thegreatdr 10 8.614398 0.000255 1.210505 2007-03-18 11:58:44 2009-03-17 15:30:26 -741.0 2541.0 2770.0 26.5
Meow 10 46.822604 0.001065 11.862049 2011-08-17 10:52:08 2016-05-18 07:47:12 -2.0 555.0 1214.0 15.0

Define some additional user-level statistics based on this aggregated data.

Tenure on article

# Compute tenure as difference between last and first revision, convert back to 1-day floats by dividing by timedelta64
rev_agg_user['tenure'] = (rev_agg_user['last_rev'] - rev_agg_user['first_rev'])/np.timedelta64(1,'D')

# Round values up to whole integers
rev_agg_user['tenure'] = rev_agg_user['tenure'].apply(np.ceil)

# Look at the longest-tenure editors
rev_agg_user['tenure'].sort_values(ascending=False).head()
rev_user_text
DocWatson42    3497.0
Jason Rees     2835.0
AySz88         1990.0
Titoxd         1968.0
Meow           1736.0
Name: tenure, dtype: float64
ax = rev_agg_user['tenure'].plot('hist',logy=True,bins=25)
ax.set_xlabel('Tenure (Days)')
ax.set_ylabel('Number of editors')
<matplotlib.text.Text at 0x7fb024d25710>
logspace_bins = np.logspace(start=0, stop=4, num=25)
ax = rev_agg_user['tenure'].plot('hist',logy=True,logx=True,bins=logspace_bins)
ax.set_xlabel('Tenure (Days)')
ax.set_ylabel('Number of editors')
<matplotlib.text.Text at 0x7fb024ba6f60>
g = sb.lmplot(x='tenure',y='revs_made',data=rev_agg_user.query('tenure > 1'),aspect=1.5)
ax = g.axes[0][0]
ax.set_xlim((0,1e4))
ax.set_ylim((0,1e4))
ax.set_xscale('symlog')
ax.set_yscale('symlog')

Get user info

Define a function get_user_info that takes a revision DataFrame containing a "rev_user" column/series. For all the users in the revision history, get their user information like regristration date and current edit count.

 
def get_user_info(rev_df,conn):
    if 'rev_user' in rev_df.columns:
        
        # Get all the user IDs from the revision DF
        _ids = rev_df['rev_user'].unique().tolist() 
        
        # Remove the 0s and convert the rest to strings
        _ids = [str(i) for i in _ids if i != 0 and not np.isnan(i)]
        
        # Join the list of strings into a big comma-separated string
        user_string = ', '.join(_ids)
        
        # Write the query
        ## Select specific columns
        ## from the user table
        ## where the user_ids are among the users revising the article and have more than 0 edits
        s = """
                SELECT 
                    user_id, user_name, user_registration, user_editcount
                FROM 
                    user
                WHERE
                    user_id IN ({0}) AND user_editcount > 0
                ORDER BY
                    user_editcount DESC
            """.format(user_string)

        # Run the query
        _df = pd.read_sql_query(s,conn)
        
        # Some of the results have a "bytestring" format
        byte_columns = ['user_name','user_registration']
    
        # For each column, convert it from bytestring to a utf8 string
        for col in byte_columns:
            _df[col] = _df[col].str.decode('utf8')

        # Some of the values are missing, recode them as Jan 1, 2000
        _df['user_registration'] = _df['user_registration'].replace({None:'20000101000000'})
        
        # Convert timestamps to datetimes
        _df['user_registration'] = _df['user_registration'].apply(lambda x:pd.datetime.strptime(x,'%Y%m%d%H%M%S'))
        
        # Return the results
        return _df

Run the get_user_info function on the "rev_df" DataFrame to get information about each of the users who contributed to the article.

# Refresh the database connection since too much time has passed since the last query
conn.ping()
conn.cursor().execute('use enwiki_p');

# Get the user information.
rev_user_info_df = get_user_info(rev_df,conn)
rev_user_info_df.head()
user_id user_name user_registration user_editcount
0 13286072 ClueBot NG 2010-10-20 17:01:19 4108393
1 433328 SmackBot 2005-09-13 22:47:59 3734324
2 6569922 Addbot 2008-03-01 21:58:03 2838809
3 7611264 AnomieBOT 2008-08-07 17:09:33 2564270
4 4928500 ClueBot 2007-07-23 18:45:42 1596818

Merge the "rev_agg_user" and the "rev_user_info_df" DataFrames together by joining on the user name.

# Merge the two DataFrames together using the index of "reg_agg_user"
# and the "user_name" field from rev_user_info_df, joining on the left values which should be more inclusive
merged_rev_agg_user = pd.merge(rev_agg_user,rev_user_info_df,left_index=True,right_on='user_name',how='left')

# Check if the two DataFrames have the same number of users, if False then there will be NaNs
print(len(rev_agg_user) == len(rev_user_info_df))

# Chris discovered this exception: some users edited articles but have 0 user_editcount in the user table
# We should just delete them!
#bad_users = merged_rev_agg_user[merged_rev_agg_user['user_editcount'] == 0]

# Show the first five rows
merged_rev_agg_user.head()
False
revs_made rev_ts_max rev_ts_min rev_ts_avg first_rev last_rev rev_len_min rev_len_max rev_len_sum rev_len_med tenure user_id user_name user_registration user_editcount
126 1 28.716447 28.716447 28.716447 2011-12-08 09:48:19 2011-12-08 09:48:19 34.0 34.0 34.0 34.0 0.0 NaN 109.154.156.244 NaT NaN
126 1 5.931655 5.931655 5.931655 2014-12-13 08:17:28 2014-12-13 08:17:28 -9.0 -9.0 -9.0 -9.0 0.0 NaN 114.45.36.220 NaT NaN
126 1 7.573414 7.573414 7.573414 2008-09-09 07:45:58 2008-09-09 07:45:58 18.0 18.0 18.0 18.0 0.0 NaN 12.210.198.245 NaT NaN
126 1 5.650880 5.650880 5.650880 2008-02-13 03:40:13 2008-02-13 03:40:13 -8277.0 -8277.0 -8277.0 -8277.0 0.0 NaN 122.104.34.87 NaT NaN
126 1 13.054074 13.054074 13.054074 2007-10-13 09:55:38 2007-10-13 09:55:38 0.0 0.0 0.0 0.0 0.0 NaN 122.53.89.101 NaT NaN

What fraction of the user's total edits are devoted to this single article?

merged_rev_agg_user['rev_fraction'] = (merged_rev_agg_user['revs_made']/merged_rev_agg_user['user_editcount']).fillna(.1)

# Plot the distribution. Use log-bins since the data is so skewed
bin_min = np.round(np.log10(merged_rev_agg_user['rev_fraction'].min()))
bin_max = np.round(np.log10(merged_rev_agg_user['rev_fraction'].max()))
logspace_bins = np.logspace(start=bin_min-1, stop=bin_max+1, num=25)
ax = merged_rev_agg_user['rev_fraction'].plot('hist',logy=False,logx=True,bins=logspace_bins)
ax.set_xlabel('Fraction of total edits')
ax.set_ylabel('Count')
<matplotlib.text.Text at 0x7fb027d839e8>

How long after a user's account registration did they make their first edit to this article?

merged_rev_agg_user['first_rev_account_age'] = merged_rev_agg_user['first_rev'] - merged_rev_agg_user['user_registration']
merged_rev_agg_user['first_rev_account_age'] = merged_rev_agg_user['first_rev_account_age']/np.timedelta64(1,'D')

# Plot the distribution. Use log-bins since the data is so skewed
logspace_bins = np.logspace(start=-1, stop=4, num=25)
ax = merged_rev_agg_user['first_rev_account_age'].plot('hist',logy=False,logx=True,bins=logspace_bins)
ax.set_xlabel('Account age at first article revision')
ax.set_ylabel('Count')
<matplotlib.text.Text at 0x7fb0244e1828>

Do users editing the article earlier in their career end up contributing more to the article?

g = sb.lmplot(x='first_rev_account_age',y='revs_made',data=merged_rev_agg_user.query('tenure > 1'),aspect=1.5)
ax = g.axes[0][0]
ax.set_xlim((0,1e4))
ax.set_ylim((0,1e4))
ax.set_xscale('symlog')
ax.set_yscale('symlog')
ax.set_xlabel('Account age at first article revision')
ax.set_ylabel('Total article revisions')
<matplotlib.text.Text at 0x7fb02443a860>

Compare two articles

# Refresh the database connection since too much time has passed since the last query
conn.ping()
conn.cursor().execute('use enwiki_p');

# Get the second set of revisions
rev2_df = get_page_revisions('Cyclone_Pam',0,conn)

Do the data cleanup and feature engineering for later steps. Define a function to do all the feature engineering steps with one call.

def data_cleanup(rev_df):
    # Compute the rev_date from the timestamps
    rev_df['rev_date'] = rev_df['rev_timestamp'].apply(lambda x:x.date())

    # Set the revision lengths for the null rows to be the previous value
    rev_df['rev_len'] = rev_df['rev_len'].fillna(method='ffill')

    # Compute the diffs and store as a new series 
    rev_df['rev_len_diff'] = rev_df['rev_len'].diff()

    # Set the value for the first rev_len_diff to be the size of the rev_len
    rev_df.loc[0,'rev_len_diff'] = rev_df.loc[0,'rev_len']

    # Compute the diffs and store as a new series
    rev_df['rev_timestamp_diff'] = rev_df['rev_timestamp'].diff()

    # Convert the values of the new series to floats by dividing by 1 day time delta.
    rev_df['rev_timestamp_diff'] = rev_df['rev_timestamp_diff'].apply(lambda x:x/np.timedelta64(1,'D'))
    
    return rev_df

Run the cleanup on rev2_df and confirm it has all the variables (columns) we need.

rev2_df = data_cleanup(rev2_df)
rev2_df.head()
rev_id rev_page rev_text_id rev_comment rev_user rev_user_text rev_timestamp rev_minor_edit rev_deleted rev_len rev_parent_id rev_sha1 rev_content_model rev_content_format page_id page_namespace page_title page_restrictions page_counter page_is_redirect page_is_new page_random page_touched page_links_updated page_latest page_len page_content_model rev_date rev_len_diff rev_timestamp_diff
0 466800178 34132047 0 [[WP:AES|←]]Redirected page to [[1997-98 South... 1977478 Jason Rees 2011-12-20 02:44:40 0 0 71 0 ohfp8b9v13k0392n0up4h88750yhsal None None 34132047 0 Cyclone_Pam 0 0 0 0.079168 2016-10-01 06:20:52 2016-09-20 11:02:13 740326892 57032 wikitext 2011-12-20 71.0 NaN
1 466806469 34132047 0 Robot: Fixing double redirect to [[1997–98 Sou... 11952314 AvicBot 2011-12-20 03:41:18 1 0 73 466800178 r5u1gwisy7zg7ext5fdsy5wi8e9fi0x None None 34132047 0 Cyclone_Pam 0 0 0 0.079168 2016-10-01 06:20:52 2016-09-20 11:02:13 740326892 57032 wikitext 2011-12-20 2.0 0.039329
2 504117784 34132047 0 [[WP:AES|←]]Redirected page to [[Cyclone Pam (... 1977478 Jason Rees 2012-07-25 14:26:18 0 0 32 466806469 3f8qj6td6434doygmgk7ivguj27ot6e None None 34132047 0 Cyclone_Pam 0 0 0 0.079168 2016-10-01 06:20:52 2016-09-20 11:02:13 740326892 57032 wikitext 2012-07-25 -41.0 218.447917
3 505957705 34132047 0 Robot: Fixing double redirect to [[1997–98 Sou... 8066546 Xqbot 2012-08-05 20:35:02 1 0 52 504117784 5mu4a1vwgi0xjgnzebb747xsuwkf667 None None 34132047 0 Cyclone_Pam 0 0 0 0.079168 2016-10-01 06:20:52 2016-09-20 11:02:13 740326892 57032 wikitext 2012-08-05 20.0 11.256065
4 650791673 34132047 0 more likely redirect 379243 Hurricanehink 2015-03-10 18:21:52 1 0 80 505957705 nwsox5ewnnyasdztl7toska11cmslj6 None None 34132047 0 Cyclone_Pam 0 0 0 0.079168 2016-10-01 06:20:52 2016-09-20 11:02:13 740326892 57032 wikitext 2015-03-10 28.0 946.907523

Group by the "rev_date", compute the daily_revs, re-index to make the time continuous.

rev2_gb_date = rev2_df.groupby(['rev_date'])

daily_revs2 = rev2_gb_date.agg({'rev_id':pd.Series.nunique})

# Compute the first and last days revisions were observed
_min, _max = daily_revs2.index.min(), daily_revs2.index.max()

# Create a new index that is continuous over the whoel date range
_ix = pd.date_range(_min,_max)

# Reindex the daily revisions over the continuous date range filling in the blanks with 0s
daily_revs2_reindexed = daily_revs2.reindex(_ix).fillna(0)

Combine both of the daily_revs series into a DataFrame by merging, renaming the columns, and filling the blank values with 0s.

daily_revs_compared_df = pd.merge(daily_revs_reindexed,daily_revs2_reindexed,
                                  how='outer',left_index=True,right_index=True)

# The left DF was Mitt Romney and the right DF was John McCain. Rename to that effect
article_name_1 = 'Cyclone_Pam'
article_name_2 = 'Tropical_cyclone_scales'
daily_revs_compared_df = daily_revs_compared_df.rename(columns={'rev_id_x':article_name_1,
                                                                'rev_id_y':article_name_2
                                                               })

# Fill the null values with 0s
daily_revs_compared_df.fillna(0,inplace=True)

# Look at the first 5 rows
daily_revs_compared_df.head()
Cyclone_Pam Tropical_cyclone_scales
2005-12-03 1.0 0.0
2005-12-04 0.0 0.0
2005-12-05 0.0 0.0
2005-12-06 0.0 0.0
2005-12-07 0.0 0.0

Plot the comparisong of both results in a single chart. Use a rolling average of 60 days to smooth out noise.

ax = daily_revs_compared_df.rolling(60).mean().plot(lw=3)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Revisions')
<matplotlib.text.Text at 0x7fb0242fa940>

Zoom in on the 2008 election season. Change these dates for your specific case.

# Define two rough dates marking the start and stop of the election season
_min = pd.Timestamp('2004-01-01').date()
_max = pd.Timestamp('2016-10-04').date()

# Create a subset_df based on the 2-year window of dates
subset_df = daily_revs_compared_df.ix[_min:_max]

# Use a window of 28 days to compute the rolling average instead
ax = subset_df.rolling(28).mean().plot(lw=3)

# Draw a vertical lines for major inflection points
ax.axvline(pd.Timestamp('2010-01-01').date(),c='black',lw=1,ls='--')   # Midpoint from earliest stub to present
ax.axvline(pd.Timestamp('2011-12-20').date(),c='orange',lw=1,ls='--')  # 1997 Cyclone Pam article is redirected
ax.axvline(pd.Timestamp('2015-03-13').date(),c='red',lw=1,ls='--')     # 2015 Cyclone Pam makes landfall

ax.set_xlabel('Date')
ax.set_ylabel('Revisions')
<matplotlib.text.Text at 0x7fb01fafcb70>