Lab 1 - Revision Histories

Professor Brian Keegan
Department of Information Science, CU Boulder
This notebook is copyright and made available under the Apache License v2.0 license.

This is the first of five lab notebooks that will explore how to do some introductory data extraction and analysis from Wikipedia data. This lab will use a combination of MySQL and Python to retrieve the revision histories for a selection of articles. You do not need to be fluent in either to complete the lab, but there are many options for extending the analyses we do here by using more advanced queries and scripting methods.

Acknowledgements
I'd like to thank the Wikimedia Foundation for the PAWS system and related Wikitech infrastructure that this workbook runs within. Yuvi Panda, Aaron Halfaker, Jonathan Morgan, and Dario Taraborelli have all provided crucial support and feedback.

Confirm basic Python commands work

a = 1
b = 2
a+b
3
list(range(0,5))
[0, 1, 2, 3, 4]
your_name = 'Brian Keegan'
print("Hello, {0}".format(your_name))
Hello, Brian Keegan

Import modules and setup environment

Load up all the libraries we'll need to connect to the database, retrieve information for analysis, and visualizing results.

# Makes the plots appear within the notebook
%matplotlib inline

# Two fundamental packages for doing data manipulation
import numpy as np                   # http://www.numpy.org/
import pandas as pd                  # http://pandas.pydata.org/

# Two related packages for plotting data
import matplotlib.pyplot as plt      # http://matplotlib.org/
import seaborn as sb                 # https://stanford.edu/~mwaskom/software/seaborn/

# Two packages for accessing the MySQL server
import pymysql                       # http://pymysql.readthedocs.io/en/latest/
import os                            # https://docs.python.org/3.4/library/os.html

# Setup the code environment to use plots with a white background and DataFrames show more columns and rows
sb.set_style('whitegrid')
pd.options.display.max_columns = 100
pd.options.display.max_rows = 110

Setup a connection to the English Wikipedia database

Retrieve the host name, username, and (encrypted) password associated with your account so you can login to the database.

host, user, password = os.environ['MYSQL_HOST'], os.environ['MYSQL_USERNAME'], os.environ['MYSQL_PASSWORD']

Connect to the system using the creditials from above, then select the database for the English Wikipedia.

conn = pymysql.connect(host=host,user=user,password=password,database='enwiki_p',connect_timeout=3600)
conn.cursor().execute('use enwiki_p');

What tables are available for use?

pd.read_sql_query('show tables',conn)['Tables_in_enwiki_p'].values
array(['abuse_filter', 'abuse_filter_action', 'abuse_filter_log',
       'aft_article_answer', 'aft_article_answer_text',
       'aft_article_feedback', 'aft_article_feedback_properties',
       'aft_article_feedback_ratings_rollup',
       'aft_article_feedback_select_rollup', 'aft_article_field',
       'aft_article_field_group', 'aft_article_field_option',
       'aft_article_filter_count',
       'aft_article_revision_feedback_ratings_rollup',
       'aft_article_revision_feedback_select_rollup', 'archive',
       'archive_userindex', 'article_feedback', 'article_feedback_pages',
       'article_feedback_properties', 'article_feedback_ratings',
       'article_feedback_revisions', 'article_feedback_stats',
       'article_feedback_stats_types', 'category', 'categorylinks',
       'change_tag', 'ep_articles', 'ep_cas', 'ep_courses', 'ep_events',
       'ep_instructors', 'ep_oas', 'ep_orgs', 'ep_revisions',
       'ep_students', 'ep_users_per_course', 'externallinks',
       'filearchive', 'filearchive_userindex', 'flaggedimages',
       'flaggedpage_config', 'flaggedpage_pending', 'flaggedpages',
       'flaggedrevs', 'flaggedrevs_promote', 'flaggedrevs_statistics',
       'flaggedrevs_stats', 'flaggedrevs_stats2', 'flaggedrevs_tracking',
       'flaggedtemplates', 'geo_tags', 'global_block_whitelist',
       'hitcounter', 'image', 'imagelinks', 'interwiki', 'ipblocks',
       'ipblocks_ipindex', 'iwlinks', 'l10n_cache', 'langlinks',
       'localisation', 'localisation_file_hash', 'logging',
       'logging_logindex', 'logging_userindex', 'mark_as_helpful', 'math',
       'module_deps', 'msg_resource_links', 'oldimage',
       'oldimage_userindex', 'page', 'page_props', 'page_restrictions',
       'pagelinks', 'pagetriage_log', 'pagetriage_page',
       'pagetriage_page_tags', 'pagetriage_tags', 'pif_edits',
       'povwatch_log', 'povwatch_subscribers', 'protected_titles',
       'recentchanges', 'recentchanges_userindex', 'redirect', 'revision',
       'revision_userindex', 'site_identifiers', 'site_stats', 'sites',
       'tag_summary', 'templatelinks', 'transcode', 'updatelog', 'updates',
       'user', 'user_former_groups', 'user_groups', 'user_properties',
       'user_properties_anon', 'valid_tag', 'watchlist_count',
       'wbc_entity_usage', 'wikilove_image_log', 'wikilove_log'], dtype=object)
s = """
        SELECT 
            *
        FROM 
            revision
        ORDER BY
            rev_timestamp DESC
        LIMIT
            5
    """
    
pd.read_sql_query(s,conn)
rev_id rev_page rev_text_id rev_comment rev_user rev_user_text rev_timestamp rev_minor_edit rev_deleted rev_len rev_parent_id rev_sha1 rev_content_model rev_content_format
0 741667858 11543122 0 b'/* Regular season */http→https for [[Go... 28903366 b'Bender the Bot' b'20160928233641' 1 0 15497 715723257 b'1bo8olwdicg6jcwr2yurupifcabchrf' None None
1 741667859 5610945 0 b'/* The UE Red Warriors Basketball Team* */' 19591522 b'Joseroyal' b'20160928233641' 0 0 42547 741667337 b'fyi1fm7zf2326cy96f8q8uiayox6xw1' None None
2 741667856 51783769 0 b"[[WP:AES|\xe2\x86\x90]]Created page with '{{... 7852030 b'Materialscientist' b'20160928233639' 0 0 441 0 b'gcv1xewvmnlcf78043zkzgtw05s1gvc' None None
3 741667857 22459546 0 b'Reverted to revision 740844275 by [[Special:... 11951198 b'Gzuufy' b'20160928233639' 0 0 8589 741660215 b'ojput91puh0wuw1sawfjhwf52121k4x' None None
4 741667855 50989734 0 b'/* Episodes */' 0 b'2607:FCC8:F6C7:4300:E0A5:328C:32B9:5B4C' b'20160928233638' 0 0 6848 741521075 b'oe7dxm4f1eggdkoqjxa3gjoy46vjiu7' None None

Important note

Article page titles need to be separated by underscores "_" rather than spaces. It's easy to replace spaces with underscores using the replace method called on the string.

page_title = 'Mitt Romney'.replace(' ','_')
page_title
'Mitt_Romney'

Get revisions for a single article

Write a function get_page_revisions that takes the name of a Wikipedia page and its namespace and use the database connection to return a pandas DataFrame containing all the revisions for the article. Some of the fields inside needs to be cleaned up by converting to other data types to help with analysis in later steps.

def get_page_revisions(page_title,namespace,conn):
    """ Takes a page title and returns its revision history.
      page_title = a string for the page title to get its revisions
      namespace = an int to get the revisions from a specific namespace
      conn = a database connection
      
    Returns:
      A pandas DataFrame containing all the article revisions and page information
    """
    # In case you pass a page title with spaces in it, replace the spaces with underscores
    page_title = page_title.replace(' ','_').encode('utf8').decode('latin1')

    # The MySQL query string used to retrieve the data. By line, it is
    ## selecting all columns 
    ## from the "revisions" table
    ## joining the "page" table on it
    ## using the page_id and rev_page columns as keys
    ## limiting the results to entries that occur in the namespace and pagetitle
    ## and ordering the results by the timestamp
    s = """
            SELECT 
                *
            FROM 
                revision 
            JOIN 
                page ON page.page_id = revision.rev_page
            WHERE 
                page.page_namespace = {1} AND page.page_title = '{0}' 
            ORDER BY 
                revision.rev_timestamp ASC
        """.format(page_title,namespace)
    
    # Use the connection to run the query and return the results as a DataFrame
    _df = pd.read_sql_query(s,conn)

    # Some of the results have a "bytestring" format
    byte_columns = ['rev_comment','rev_user_text','rev_timestamp','rev_sha1','page_content_model',
                    'page_title','page_restrictions','page_touched','page_links_updated']
    
    # For each column, convert it from bytestring to a utf8 string
    for col in byte_columns:
        _df[col] = _df[col].str.decode('utf8')

    # Several of the columns are timestamps. Convert to datetimes
    _df['rev_timestamp'] = _df['rev_timestamp'].apply(lambda x:pd.datetime.strptime(x,'%Y%m%d%H%M%S'))
    _df['page_touched'] = _df['page_touched'].apply(lambda x:pd.datetime.strptime(x,'%Y%m%d%H%M%S'))
    _df['page_links_updated'] = _df['page_links_updated'].apply(lambda x:pd.datetime.strptime(x,'%Y%m%d%H%M%S'))
    
    # Return the data
    return _df

Use the function to get the revisions for a single article and use the .head() method to inspect the first 5 rows in the DataFrame.

conn.ping()
conn.cursor().execute('use enwiki_p')

page_title = 'Mitt Romney'
rev_df = get_page_revisions(page_title,0,conn)
rev_df.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-6ded8b05af04> in <module>()
----> 1 conn.ping()
      2 conn.cursor().execute('use enwiki_p')
      3 
      4 page_title = 'Mitt Romney'
      5 rev_df = get_page_revisions(page_title,0,conn)

NameError: name 'conn' is not defined

Count the total number of revisions in the article.

len(rev_df)
7409

Count the total number of unique users contributing to the article.

len(rev_df['rev_user_text'].unique())
3340

When was the first revision made to the article?

rev_df['rev_timestamp'].min()
Timestamp('2001-10-31 18:09:59')

When was the most recent revision made to the article?

rev_df['rev_timestamp'].max()
Timestamp('2016-09-26 14:23:26')

How has the size of the article changed over time?

ax = rev_df['rev_len'].plot()
ax.set_xlabel('Revisions')
ax.set_ylabel('Article size (bytes)')
<matplotlib.text.Text at 0x7fb8d31fbb00>

Who made the largest change to the article?

# Use the idxmax() method to return the index value for the largest value in the series
loc_max_revision = rev_df['rev_len'].idxmax()

# Show the values of a single row
rev_df.ix[loc_max_revision]
rev_id                                      267973394
rev_page                                        30012
rev_text_id                                         0
rev_comment                                          
rev_user                                            0
rev_user_text                           206.116.119.8
rev_timestamp                     2009-02-02 04:13:33
rev_minor_edit                                      0
rev_deleted                                         0
rev_len                                        877891
rev_parent_id                               267972864
rev_sha1              neogjqbomo19jithikaowl9lp2tswcz
rev_content_model                                None
rev_content_format                               None
page_id                                         30012
page_namespace                                      0
page_title                                       Time
page_restrictions                                    
page_counter                                     1686
page_is_redirect                                    0
page_is_new                                         0
page_random                                  0.519609
page_touched                      2016-09-28 21:12:46
page_links_updated                2016-09-26 14:23:58
page_latest                                 741280708
page_len                                       113242
page_content_model                           wikitext
Name: 3860, dtype: object

Go inspect what this largest revision was. Actually this specific example is extremely NSFW.

rev_id = rev_df.ix[loc_max_revision,'rev_id']
print('https://en.wikipedia.org/w/index.php?oldid={0}'.format(rev_id))
https://en.wikipedia.org/w/index.php?oldid=267973394

Group revisions by date

Muliple revisions can be made in a single day. We want to be able to count the number of revisions per day and how the article's length has changed over time.

# For each timestamp extract only its date (ignoring the time of day) 
# and save as a new column "rev_date"
rev_df['rev_date'] = rev_df['rev_timestamp'].apply(lambda x:x.date())

# Compare the full "rev_timestamp" to the new "rev_date"
rev_df[['rev_timestamp','rev_date']].head()
rev_timestamp rev_date
0 2001-10-31 18:09:59 2001-10-31
1 2001-12-06 00:06:18 2001-12-06
2 2001-12-10 14:24:22 2001-12-10
3 2001-12-17 00:25:43 2001-12-17
4 2001-12-17 00:26:55 2001-12-17

What dates had the most revisions in a single day? A simple way of doing this is to count the number of times each "rev_date" occurs in the data with the value_counts method. Let's look at the top 10, 5 of which are in October 2012 in the weeks before the Election Day.

rev_df['rev_date'].value_counts().head(10)
2008-01-09    34
2012-01-24    31
2010-07-12    31
2007-10-26    29
2007-09-11    29
2007-09-27    26
2007-05-03    25
2008-01-10    24
2012-07-07    23
2005-06-16    23
Name: rev_date, dtype: int64

We can use the groupby functionality to create "baby" DataFrames having all the same particular value. For example, we can group by the date and each of the baby DataFrames would be all the revisions that occurred on that date. Store the groupby in the "rev_gb_date" object for subsequent steps.

Inspect a single baby DataFrame by passing the key (the date values we grouped on) to the get_group method and look at the first 5 entries.

rev_gb_date = rev_df.groupby(['rev_date'])

_date = rev_df['rev_date'].value_counts().idxmax()
rev_gb_date.get_group(_date).head(5)
rev_id rev_page rev_text_id rev_comment rev_user rev_user_text rev_timestamp rev_minor_edit rev_deleted rev_len rev_parent_id rev_sha1 rev_content_model rev_content_format page_id page_namespace page_title page_restrictions page_counter page_is_redirect page_is_new page_random page_touched page_links_updated page_latest page_len page_content_model rev_date
2810 183068174 30012 0.0 longcase clock image fix 291880.0 Yamara 2008-01-09 00:00:30 0 0 56775.0 183010633 0f2miv8561jhpis71vay0l04v4zg0ms None None 30012 0 Time 1686 0 0 0.519609 2016-09-28 21:12:46 2016-09-26 14:23:58 741280708 113242 wikitext 2008-01-09
2811 183069109 30012 0.0 another longcase clock tweak to test in IE 291880.0 Yamara 2008-01-09 00:04:58 0 0 56781.0 183068174 0739ixqa0ml673pvmn86t8yhkvm35mj None None 30012 0 Time 1686 0 0 0.519609 2016-09-28 21:12:46 2016-09-26 14:23:58 741280708 113242 wikitext 2008-01-09
2812 183069355 30012 0.0 291880.0 Yamara 2008-01-09 00:06:09 0 0 56781.0 183069109 72u8dqks39pw7k8mu55lr31pzjcdz7k None None 30012 0 Time 1686 0 0 0.519609 2016-09-28 21:12:46 2016-09-26 14:23:58 741280708 113242 wikitext 2008-01-09
2813 183070343 30012 0.0 vanished in IE again. weird. trying it at 220 px 291880.0 Yamara 2008-01-09 00:11:00 0 0 56781.0 183069355 6d9jqkekg3u97jpsk9mz533ml4vxjm5 None None 30012 0 Time 1686 0 0 0.519609 2016-09-28 21:12:46 2016-09-26 14:23:58 741280708 113242 wikitext 2008-01-09
2814 183070625 30012 0.0 225 px... 291880.0 Yamara 2008-01-09 00:12:12 0 0 56781.0 183070343 hagdp4memczmvoh7pf2ihawr6lxkfl7 None None 30012 0 Time 1686 0 0 0.519609 2016-09-28 21:12:46 2016-09-26 14:23:58 741280708 113242 wikitext 2008-01-09

Compute unique revisions per day

We can also use aggregate functionality to perform computations efficiently across all the baby DataFrames.

daily_revs = rev_gb_date.agg({'rev_id':pd.Series.nunique})
daily_revs.head() 
rev_id
rev_date
2001-10-31 1
2001-12-06 1
2001-12-10 1
2001-12-17 3
2002-01-13 1

Reindex the axis to make it continuous over the date range.

# Compute the first and last days revisions were observed
_min, _max = daily_revs.index.min(), daily_revs.index.max()

# Create a new index that is continuous over the whoel date range
_ix = pd.date_range(_min,_max)

# Reindex the daily revisions over the continuous date range filling in the blanks with 0s
daily_revs_reindexed = daily_revs.reindex(_ix).fillna(0)
daily_revs_reindexed.head()
rev_id
2001-10-31 1.0
2001-11-01 0.0
2001-11-02 0.0
2001-11-03 0.0
2001-11-04 0.0

What was the most active day of editing? Here's another way of returning the same values.

daily_revs_reindexed['rev_id'].sort_values(ascending=False).head(5)
2008-01-09    34.0
2012-01-24    31.0
2010-07-12    31.0
2007-09-11    29.0
2007-10-26    29.0
Name: rev_id, dtype: float64

Plot the number of revisions by day.

ax = daily_revs_reindexed['rev_id'].plot(lw=.5)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Revisions')
<matplotlib.text.Text at 0x7fb8d35cda58>

This is pretty noisy. We can clean it up by applying a rolling mean over a 60-day window.

smoothed_s = daily_revs_reindexed['rev_id'].rolling(60).mean()
ax = smoothed_s.plot(lw=3)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Revisions')
<matplotlib.text.Text at 0x7fb8dec7afd0>

Plot the cumulative sum of the number of revisions over time using the cumsum method.

ax = daily_revs_reindexed['rev_id'].cumsum().plot(lw=2)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Cumulative Revisions')
<matplotlib.text.Text at 0x7fb8da85b860>

Compute article length by day

Using the same "rev_gb_date" object, aggregate each of the baby DataFrames and compute the median article size ("rev_len"). Reindex again to make a continuous time axis.

daily_length = rev_gb_date.agg({'rev_len':np.median})

# Reindex
daily_length_reindexed = daily_length.reindex(_ix).fillna(method='ffill')
daily_length_reindexed.head()
rev_len
2001-10-31 1026.0
2001-11-01 1026.0
2001-11-02 1026.0
2001-11-03 1026.0
2001-11-04 1026.0

Plot the reindexed series, making the y-axis be on a log scale to pick up detail in the early days when the article was orders of magnitude smaller.

ax = daily_length_reindexed['rev_len'].plot(lw=2,logy=True)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Article size  (bytes)')
<matplotlib.text.Text at 0x7fb8d286c048>

Compute diffs on time and article size

The "rev_len" only records the length of the revision at that point in time. By taking the difference with the previous observation, we can compute how much net content the user added or removed from the previous revision. Using the diff method on the "rev_len" series, we store the resulting differences as "rev_len_diff".

Compare the first five elements of the two series to verify it's worked as intended.

# Compute the diff and store as a new series
rev_df['rev_len_diff'] = rev_df['rev_len'].diff()

# Verify the differences are accurate
rev_df[['rev_len','rev_len_diff']].head()
rev_len rev_len_diff
0 1026.0 NaN
1 1030.0 4.0
2 1086.0 56.0
3 1102.0 16.0
4 1101.0 -1.0

Do some more cleanup of the diffs. In particular, when a revision is deleted (which is rare) the "rev_len" value is null, which makes the diff null as well. Make the first revision's "rev_len_diff" equal to the size of the article and then ffill the other nan values.

# Set the revision lengths for the null rows to be the previous value
rev_df['rev_len'] = rev_df['rev_len'].fillna(method='ffill')

# Compute the diffs again 
rev_df['rev_len_diff'] = rev_df['rev_len'].diff()

# Set the value for the first rev_len_diff to be the size of the rev_len
rev_df.loc[0,'rev_len_diff'] = rev_df.loc[0,'rev_len']

# Verify the differences are accurate
rev_df[['rev_len','rev_len_diff']].head()
rev_len rev_len_diff
0 1026.0 1026.0
1 1030.0 4.0
2 1086.0 56.0
3 1102.0 16.0
4 1101.0 -1.0

Similarly, the "rev_timestamp" series only records the time when the revision occurred. By taking the difference with the previous revision, we can compute how much time has elapsed since the last edit. Using the diff method again on the "rev_timestamp" series, we store the resulting differences as "ref_timestamp_diff".

The resulting values are Timedeltas which we convert back to number of day floats by dividing by np.timedelta64.

Compare the first five elements of the two series to verify it's worked as intended.

# Compute the diff and store as a new series
rev_df['rev_timestamp_diff'] = rev_df['rev_timestamp'].diff()

# Convert the values of the new series to floats by dividing by 1 day time delta.
rev_df['rev_timestamp_diff'] = rev_df['rev_timestamp_diff'].apply(lambda x:x/np.timedelta64(1,'D'))

# Verify the differences are accurate.
rev_df[['rev_timestamp','rev_timestamp_diff']].head()
rev_timestamp rev_timestamp_diff
0 2001-10-31 18:09:59 NaN
1 2001-12-06 00:06:18 35.247442
2 2001-12-10 14:24:22 4.595880
3 2001-12-17 00:25:43 6.417604
4 2001-12-17 00:26:55 0.000833

Combine all the computations with one daily DataFrame

Because we added new series to the "rev_df" we need to re-run the groupby to update it. We can also combine all the different aggregate functions we did before into one step by defining an "agg_dict" that applies an aggregation function over the values in a specific column. This will return a DataFrame with only a subset of columns with the aggregated values we specified.

# Groupby again
rev_gb_date = rev_df.groupby(['rev_date'])

# Define an aggregation mapping of functions to columns names
agg_dict_date = {'rev_id':pd.Series.nunique,
                  'rev_len':np.median,
                  'rev_len_diff':np.median,
                  'rev_timestamp_diff':np.mean
                 }

# Execiute the aggregation functions on the groupby object
rev_agg_date = rev_gb_date.agg(agg_dict_date)

# Reindex to continuous dates
rev_agg_date = rev_agg_date.reindex(_ix)

# Fill in the resulting values of the reindexed and aggregated "rev_len" from the last value
rev_agg_date[['rev_len']] = rev_agg_date[['rev_len']].fillna(method='ffill')

# Fill in the resulting values of the reindexed and aggregated other series as 0s.
rev_agg_date[['rev_id','rev_len_diff']] = rev_agg_date[['rev_id','rev_len_diff']].fillna(0)
#rev_agg_date[['rev_id','rev_len_diff','rev_timestamp_diff']] = rev_agg_date[['rev_id','rev_len_diff','rev_timestamp_diff']].fillna(0)

# Show the first 10 values to verify it's worked as intended
rev_agg_date.tail(10)
rev_id rev_timestamp_diff rev_len rev_len_diff
2016-09-17 0.0 NaN 113237.0 0.0
2016-09-18 2.0 0.890046 113233.5 0.0
2016-09-19 0.0 NaN 113233.5 0.0
2016-09-20 7.0 0.274578 113231.0 14.0
2016-09-21 0.0 NaN 113231.0 0.0
2016-09-22 2.0 0.922176 113238.0 5.5
2016-09-23 0.0 NaN 113238.0 0.0
2016-09-24 0.0 NaN 113238.0 0.0
2016-09-25 0.0 NaN 113238.0 0.0
2016-09-26 2.0 2.072106 56635.0 0.0

Plot the changes in the average time since the last revision by day.

_s = rev_agg_date['rev_timestamp_diff']
ax = _s.plot(logy=True)
ax.set_xlabel('Date')
ax.set_ylabel('Average time since last edit (Days)')
<matplotlib.text.Text at 0x7fb8d2701668>

That's pretty noisy. We can apply the rolling method over a 60-day window and get the mean to smooth out some of the noise.

_s = rev_agg_date['rev_timestamp_diff'].fillna(0).rolling(60).mean()
ax = _s.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Average time since last edit (Days)')
<matplotlib.text.Text at 0x7fb8d283ceb8>

Similarly, what was the median size of a change to the article over time?

_s = rev_agg_date['rev_len_diff']
ax = _s.plot(lw=.5)
ax.set_xlabel('Date')
ax.set_ylabel('Median size of revision')
ax.set_yscale('symlog')

Again, pretty noisy. We can cleanup with a rolling mean over a 60-day window.

_s = rev_agg_date['rev_len_diff'].rolling(60).mean()
ax = _s.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Median size of revision')
<matplotlib.text.Text at 0x7fb8d2563198>

Top editors on the article

Which editors made the most contributions to the article?

top_users = rev_df['rev_user_text'].value_counts()
top_users.head(10)
JimWae         776
DVdm           245
Stevertigo     136
ClueBot NG     109
Yamara          97
Steve Quinn     69
ClueBot         58
Banno           54
Favonian        50
Pfhorrest       48
Name: rev_user_text, dtype: int64

This extremely skewed distribution of activity is characteristic of many online participation patterns: the "long tail" where less than 20% of users are responsible for more than 20% of contributions.

rev_count_occurrences = top_users.value_counts().reset_index()
rev_count_occurrences = rev_count_occurrences.rename(columns={'index':'Revision count','rev_user_text':'Occurrences'})
rev_count_occurrences.head()
Revision count Occurrences
0 1 2410
1 2 526
2 3 168
3 4 79
4 6 32

Plot the resulting distribution on log-log axes to see the shape of the long-tail distribution.

rev_count_occurrences.plot(x='Revision count',y='Occurrences',kind='scatter',logy=True,logx=True,s=50)
<matplotlib.axes._subplots.AxesSubplot at 0x7fb8de8a1b38>

Group revisions by users

We can also use the groupby-aggregate framework to group all the revisions together by the users who made them, rather than grouping revisions by the date as we did before.

rev_gb_user = rev_df.groupby(['rev_user_text'])

Look at a baby DataFrame within the groupby for a single user. These are all the revisions made by this user to this article.

rev_gb_user.get_group(top_users.idxmax()).head()
rev_id rev_page rev_text_id rev_comment rev_user rev_user_text rev_timestamp rev_minor_edit rev_deleted rev_len rev_parent_id rev_sha1 rev_content_model rev_content_format page_id page_namespace page_title page_restrictions page_counter page_is_redirect page_is_new page_random page_touched page_links_updated page_latest page_len page_content_model rev_date rev_len_diff rev_timestamp_diff
258 7528478 30012 0.0 /* Measurement of time */ 61906.0 JimWae 2004-11-16 10:43:23 0 0 12387.0 7519232 ln9ydmhgi7pto8v9qcjlbsmncfp6wnt None None 30012 0 Time 1686 0 0 0.519609 2016-09-28 21:12:46 2016-09-26 14:23:58 741280708 113242 wikitext 2004-11-16 2.0 5.915301
261 7588868 30012 0.0 /* Time in philosophy and theoretical physics */ 61906.0 JimWae 2004-11-17 20:24:26 0 0 11891.0 7588816 kd3hgeuek5epatpizjgbf7rmhf3iagf None None 30012 0 Time 1686 0 0 0.519609 2016-09-28 21:12:46 2016-09-26 14:23:58 741280708 113242 wikitext 2004-11-17 126.0 0.110417
262 7676282 30012 0.0 /* Time in philosophy and theoretical physics */ 61906.0 JimWae 2004-11-17 20:27:00 0 0 11891.0 7588868 q4apbvpcdfdzs7l5qkd4q9wvx2kt0ef None None 30012 0 Time 1686 0 0 0.519609 2016-09-28 21:12:46 2016-09-26 14:23:58 741280708 113242 wikitext 2004-11-17 0.0 0.001782
263 7588881 30012 0.0 /* Time in engineering and applied physics */ 61906.0 JimWae 2004-11-17 20:27:23 0 0 12017.0 7676282 h34mw9gf5tqv3u64r8hpa9hzr8k0pik None None 30012 0 Time 1686 0 0 0.519609 2016-09-28 21:12:46 2016-09-26 14:23:58 741280708 113242 wikitext 2004-11-17 126.0 0.000266
267 7761179 30012 0.0 61906.0 JimWae 2004-11-22 21:32:37 0 0 12292.0 7760881 fa1aain050sixrv81ax05ix4ieonkg2 None None 30012 0 Time 1686 0 0 0.519609 2016-09-28 21:12:46 2016-09-26 14:23:58 741280708 113242 wikitext 2004-11-22 125.0 0.296979

As before we can aggregate all the information in these baby DFs into a user-level dataframe that summarizes different attributes.

# Define the aggregation function mappings to variables.
# We can apply multiple functions to the same column/series by nesting a second dictionary within
agg_dict_user = {'rev_id':{'revs_made':pd.Series.nunique},
                  'rev_timestamp':{'first_rev':np.min,'last_rev':np.max},
                  'rev_len_diff':{'rev_len_min':np.min,'rev_len_max':np.max,
                                  'rev_len_med':np.median,'rev_len_sum':np.sum},
                  'rev_timestamp_diff':{'rev_ts_min':np.min,'rev_ts_max':np.max,
                                        'rev_ts_avg':np.mean}
                 }

# Perform the aggregation
rev_agg_user = rev_gb_user.agg(agg_dict_user)

# Returns a multi-index column, simplify by dropping the top level
rev_agg_user.columns = rev_agg_user.columns.droplevel(0)

# Sort by number of revisions, and look at the first 5 rows
rev_agg_user.sort_values('revs_made',ascending=False).head()
revs_made first_rev last_rev rev_ts_max rev_ts_avg rev_ts_min rev_len_min rev_len_max rev_len_sum rev_len_med
rev_user_text
JimWae 776 2004-11-16 10:43:23 2014-06-15 02:13:06 10.637685 0.139416 0.000162 -52726.0 6081.0 -80368.0 0.0
DVdm 245 2008-06-28 08:55:39 2016-09-03 09:20:37 1.469109 0.075962 0.000208 -14377.0 104630.0 13759.0 -21.0
Stevertigo 136 2003-08-23 00:24:44 2012-09-24 06:30:47 18.232951 0.333091 0.000336 -437.0 886.0 8517.0 10.5
ClueBot NG 109 2010-11-03 12:37:22 2016-09-26 14:23:26 0.000706 0.000141 0.000035 -2122.0 113214.0 1263966.0 -26.0
Yamara 97 2008-01-02 20:19:28 2008-06-22 13:31:41 1.134942 0.150384 0.000359 -979.0 1561.0 -2504.0 -1.0

Define some additional user-level statistics based on this aggregated data.

Tenure on article

# Compute tenure as difference between last and first revision, convert back to 1-day floats by dividing by timedelta64
rev_agg_user['tenure'] = (rev_agg_user['last_rev'] - rev_agg_user['first_rev'])/np.timedelta64(1,'D')

# Round values up to whole integers
rev_agg_user['tenure'] = rev_agg_user['tenure'].apply(np.ceil)

# Look at the longest-tenure editors
rev_agg_user['tenure'].sort_values(ascending=False).head()
rev_user_text
Patrick             4235.0
Charles Matthews    3897.0
Gogo Dodo           3604.0
Jiohdi              3582.0
JimWae              3498.0
Name: tenure, dtype: float64
ax = rev_agg_user['tenure'].plot('hist',logy=True,bins=25)
ax.set_xlabel('Tenure (Days)')
ax.set_ylabel('Number of editors')
<matplotlib.text.Text at 0x7fb8de9cf5c0>
logspace_bins = np.logspace(start=0, stop=4, num=25)
ax = rev_agg_user['tenure'].plot('hist',logy=True,logx=True,bins=logspace_bins)
ax.set_xlabel('Tenure (Days)')
ax.set_ylabel('Number of editors')
<matplotlib.text.Text at 0x7fb8d22ddfd0>
g = sb.lmplot(x='tenure',y='revs_made',data=rev_agg_user.query('tenure > 1'),aspect=1.5)
ax = g.axes[0][0]
ax.set_xlim((0,1e4))
ax.set_ylim((0,1e4))
ax.set_xscale('symlog')
ax.set_yscale('symlog')

Get user info

Define a function get_user_info that takes a revision DataFrame containing a "rev_user" column/series. For all the users in the revision history, get their user information like regristration date and current edit count.

 
def get_user_info(rev_df,conn):
    if 'rev_user' in rev_df.columns:
        
        # Get all the user IDs from the revision DF
        _ids = rev_df['rev_user'].unique().tolist() 
        
        # Remove the 0s and convert the rest to strings
        _ids = [str(i) for i in _ids if i != 0 and not np.isnan(i)]
        
        # Join the list of strings into a big comma-separated string
        user_string = ', '.join(_ids)
        
        # Write the query
        ## Select specific columns
        ## from the user table
        ## where the user_ids are among the users revising the article and have more than 0 edits
        s = """
                SELECT 
                    user_id, user_name, user_registration, user_editcount
                FROM 
                    user
                WHERE
                    user_id IN ({0}) AND user_editcount > 0
                ORDER BY
                    user_editcount DESC
            """.format(user_string)

        # Run the query
        _df = pd.read_sql_query(s,conn)
        
        # Some of the results have a "bytestring" format
        byte_columns = ['user_name','user_registration']
    
        # For each column, convert it from bytestring to a utf8 string
        for col in byte_columns:
            _df[col] = _df[col].str.decode('utf8')

        # Some of the values are missing, recode them as Jan 1, 2000
        _df['user_registration'] = _df['user_registration'].replace({None:'20000101000000'})
        
        # Convert timestamps to datetimes
        _df['user_registration'] = _df['user_registration'].apply(lambda x:pd.datetime.strptime(x,'%Y%m%d%H%M%S'))
        
        # Return the results
        return _df

Run the get_user_info function on the "rev_df" DataFrame to get information about each of the users who contributed to the article.

# Refresh the database connection since too much time has passed since the last query
conn.ping()
conn.cursor().execute('use enwiki_p');

# Get the user information.
rev_user_info_df = get_user_info(rev_df,conn)
rev_user_info_df.head()
user_id user_name user_registration user_editcount
0 1215485 Cydebot 2006-04-07 01:24:50 5343935
1 7328338 Yobot 2008-06-17 22:33:17 4560883
2 13286072 ClueBot NG 2010-10-20 17:01:19 4097533
3 433328 SmackBot 2005-09-13 22:47:59 3734324
4 6569922 Addbot 2008-03-01 21:58:03 2838809

Merge the "rev_agg_user" and the "rev_user_info_df" DataFrames together by joining on the user name.

# Merge the two DataFrames together using the index of "reg_agg_user"
# and the "user_name" field from rev_user_info_df, joining on the left values which should be more inclusive
merged_rev_agg_user = pd.merge(rev_agg_user,rev_user_info_df,left_index=True,right_on='user_name',how='left')

# Check if the two DataFrames have the same number of users, if False then there will be NaNs
print(len(rev_agg_user) == len(rev_user_info_df))

# Chris discovered this exception: some users edited articles but have 0 user_editcount in the user table
# We should just delete them!
#bad_users = merged_rev_agg_user[merged_rev_agg_user['user_editcount'] == 0]

# Show the first five rows
merged_rev_agg_user.head()
False
revs_made first_rev last_rev rev_ts_max rev_ts_avg rev_ts_min rev_len_min rev_len_max rev_len_sum rev_len_med tenure user_id user_name user_registration user_editcount
615 3 2005-06-17 19:53:29 2005-07-10 22:54:08 7.031655 2.506998 0.000208 8.0 129.0 146.0 9.0 24.0 172235.0 -Ril- 2005-01-19 19:52:41 10465.0
1122 1 2008-02-15 17:52:05 2008-02-15 17:52:05 1.168808 1.168808 1.168808 275.0 275.0 275.0 275.0 0.0 1722914.0 .alyn.post. 2006-07-03 00:00:17 186.0
1514 3 2013-08-11 11:03:28 2013-08-11 11:21:46 2.307558 0.773032 0.002986 27.0 121.0 195.0 47.0 1.0 19509937.0 007naved 2013-08-11 10:59:46 3.0
757 1 2016-08-08 19:58:00 2016-08-08 19:58:00 1.081921 1.081921 1.081921 -10.0 -10.0 -10.0 -10.0 0.0 22197028.0 0xF8E8 2014-08-14 03:02:27 4948.0
1619 1 2013-08-17 01:29:45 2013-08-17 01:29:45 3.326806 3.326806 3.326806 12.0 12.0 12.0 12.0 0.0 NaN 1.120.150.75 NaT NaN

What fraction of the user's total edits are devoted to this single article?

merged_rev_agg_user['rev_fraction'] = (merged_rev_agg_user['revs_made']/merged_rev_agg_user['user_editcount']).fillna(.1)

# Plot the distribution. Use log-bins since the data is so skewed
bin_min = np.round(np.log10(merged_rev_agg_user['rev_fraction'].min()))
bin_max = np.round(np.log10(merged_rev_agg_user['rev_fraction'].max()))
logspace_bins = np.logspace(start=bin_min-1, stop=bin_max+1, num=25)
ax = merged_rev_agg_user['rev_fraction'].plot('hist',logy=False,logx=True,bins=logspace_bins)
ax.set_xlabel('Fraction of total edits')
ax.set_ylabel('Count')
<matplotlib.text.Text at 0x7fb8ddcd8c50>

How long after a user's account registration did they make their first edit to this article?

merged_rev_agg_user['first_rev_account_age'] = merged_rev_agg_user['first_rev'] - merged_rev_agg_user['user_registration']
merged_rev_agg_user['first_rev_account_age'] = merged_rev_agg_user['first_rev_account_age']/np.timedelta64(1,'D')

# Plot the distribution. Use log-bins since the data is so skewed
logspace_bins = np.logspace(start=-1, stop=4, num=25)
ax = merged_rev_agg_user['first_rev_account_age'].plot('hist',logy=False,logx=True,bins=logspace_bins)
ax.set_xlabel('Account age at first article revision')
ax.set_ylabel('Count')
<matplotlib.text.Text at 0x7fb8d8450ba8>

Do users editing the article earlier in their career end up contributing more to the article?

g = sb.lmplot(x='first_rev_account_age',y='revs_made',data=merged_rev_agg_user.query('tenure > 1'),aspect=1.5)
ax = g.axes[0][0]
ax.set_xlim((0,1e4))
ax.set_ylim((0,1e4))
ax.set_xscale('symlog')
ax.set_yscale('symlog')
ax.set_xlabel('Account age at first article revision')
ax.set_ylabel('Total article revisions')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-515418be96a0> in <module>()
----> 1 g = sb.lmplot(x='first_rev_account_age',y='revs_made',data=merged_rev_agg_user.query('tenure > 1'),aspect=1.5)
      2 ax = g.axes[0][0]
      3 ax.set_xlim((0,1e4))
      4 ax.set_ylim((0,1e4))
      5 ax.set_xscale('symlog')

NameError: name 'sb' is not defined

Compare two articles

# Refresh the database connection since too much time has passed since the last query
conn.ping()
conn.cursor().execute('use enwiki_p');

# Get the second set of revisions
rev2_df = get_page_revisions('John McCain',0,conn)

Do the data cleanup and feature engineering for later steps. Define a function to do all the feature engineering steps with one call.

def data_cleanup(rev_df):
    # Compute the rev_date from the timestamps
    rev_df['rev_date'] = rev_df['rev_timestamp'].apply(lambda x:x.date())

    # Set the revision lengths for the null rows to be the previous value
    rev_df['rev_len'] = rev_df['rev_len'].fillna(method='ffill')

    # Compute the diffs and store as a new series 
    rev_df['rev_len_diff'] = rev_df['rev_len'].diff()

    # Set the value for the first rev_len_diff to be the size of the rev_len
    rev_df.loc[0,'rev_len_diff'] = rev_df.loc[0,'rev_len']

    # Compute the diffs and store as a new series
    rev_df['rev_timestamp_diff'] = rev_df['rev_timestamp'].diff()

    # Convert the values of the new series to floats by dividing by 1 day time delta.
    rev_df['rev_timestamp_diff'] = rev_df['rev_timestamp_diff'].apply(lambda x:x/np.timedelta64(1,'D'))
    
    return rev_df

Run the cleanup on rev2_df and confirm it has all the variables (columns) we need.