# Lab 1 - Revision Histories¶

Professor Brian Keegan
Department of Information Science, CU Boulder

This is the first of five lab notebooks that will explore how to do some introductory data extraction and analysis from Wikipedia data. This lab will use a combination of MySQL and Python to retrieve the revision histories for a selection of articles. You do not need to be fluent in either to complete the lab, but there are many options for extending the analyses we do here by using more advanced queries and scripting methods.

Acknowledgements
I'd like to thank the Wikimedia Foundation for the PAWS system and related Wikitech infrastructure that this workbook runs within. Yuvi Panda, Aaron Halfaker, Jonathan Morgan, and Dario Taraborelli have all provided crucial support and feedback.

## Confirm basic Python commands work¶

a = 1
b = 2
a+b

3
list(range(0,5))

[0, 1, 2, 3, 4]
your_name = 'Brian Keegan'
print("Hello, {0}".format(your_name))

Hello, Brian Keegan


## Import modules and setup environment¶

Load up all the libraries we'll need to connect to the database, retrieve information for analysis, and visualizing results.

# Makes the plots appear within the notebook
%matplotlib inline

# Two fundamental packages for doing data manipulation
import numpy as np                   # http://www.numpy.org/
import pandas as pd                  # http://pandas.pydata.org/

# Two related packages for plotting data
import matplotlib.pyplot as plt      # http://matplotlib.org/
import seaborn as sb                 # https://stanford.edu/~mwaskom/software/seaborn/

# Two packages for accessing the MySQL server
import os                            # https://docs.python.org/3.4/library/os.html

# Setup the code environment to use plots with a white background and DataFrames show more columns and rows
sb.set_style('whitegrid')
pd.options.display.max_columns = 100
pd.options.display.max_rows = 110


## Setup a connection to the English Wikipedia database¶

host, user, password = os.environ['MYSQL_HOST'], os.environ['MYSQL_USERNAME'], os.environ['MYSQL_PASSWORD']


Connect to the system using the creditials from above, then select the database for the English Wikipedia.

conn = pymysql.connect(host=host,user=user,password=password,database='enwiki_p',connect_timeout=3600)
conn.cursor().execute('use enwiki_p');
#Run this code if you are unable to connect


## What tables are available for use?¶

pd.read_sql_query('show tables',conn)

Tables_in_enwiki_p
0 abuse_filter
1 abuse_filter_action
2 abuse_filter_log
5 aft_article_feedback
6 aft_article_feedback_properties
7 aft_article_feedback_ratings_rollup
8 aft_article_feedback_select_rollup
9 aft_article_field
10 aft_article_field_group
11 aft_article_field_option
12 aft_article_filter_count
13 aft_article_revision_feedback_ratings_rollup
14 aft_article_revision_feedback_select_rollup
15 archive
16 archive_userindex
17 article_feedback
18 article_feedback_pages
19 article_feedback_properties
20 article_feedback_ratings
21 article_feedback_revisions
22 article_feedback_stats
23 article_feedback_stats_types
24 category
26 change_tag
27 ep_articles
28 ep_cas
29 ep_courses
30 ep_events
31 ep_instructors
32 ep_oas
33 ep_orgs
34 ep_revisions
35 ep_students
36 ep_users_per_course
38 filearchive
39 filearchive_userindex
40 flaggedimages
41 flaggedpage_config
42 flaggedpage_pending
43 flaggedpages
44 flaggedrevs
45 flaggedrevs_promote
46 flaggedrevs_statistics
47 flaggedrevs_stats
48 flaggedrevs_stats2
49 flaggedrevs_tracking
50 flaggedtemplates
51 geo_tags
52 global_block_whitelist
53 hitcounter
54 image
56 interwiki
57 ipblocks
58 ipblocks_ipindex
60 l10n_cache
62 localisation
63 localisation_file_hash
64 logging
66 logging_userindex
68 math
69 module_deps
71 oldimage
72 oldimage_userindex
73 page
74 page_props
75 page_restrictions
77 pagetriage_log
78 pagetriage_page
79 pagetriage_page_tags
80 pagetriage_tags
81 pif_edits
82 povwatch_log
83 povwatch_subscribers
84 protected_titles
85 recentchanges
86 recentchanges_userindex
87 redirect
88 revision
89 revision_userindex
90 site_identifiers
91 site_stats
92 sites
93 tag_summary
95 transcode
96 updatelog
98 user
99 user_former_groups
100 user_groups
101 user_properties
102 user_properties_anon
103 valid_tag
104 watchlist_count
105 wbc_entity_usage
106 wikilove_image_log
107 wikilove_log
s = """
SELECT
*
FROM
wikilove_log
LIMIT
5
"""


0 1 b'20110630183430' 10796336 b'20091016234100' 92 10796336 b'20091016234100' 92 b'food-pie' b'A pie for you!' b'here is another pie for me' 0
1 2 b'20110630183437' 59944 b'20040419234618' 34358 59944 b'20040419234618' 34358 b'food-baklava' b'Some baklava for you!' b'testing' 0
2 3 b'20110630183525' 10796336 b'20091016234100' 93 10796336 b'20091016234100' 93 b'food-baklava' b'Some baklava for you!' b'here is another wikilove!' 0
3 4 b'20110630183537' 59944 b'20040419234618' 34359 59944 b'20040419234618' 34359 b'makeyourown' b'test' b'Howdy' 0
4 5 b'20110630183611' 12732325 b'20100714203639' 119 59944 b'20040419234618' 34360 b'kitten' b'A kitten for you!' b'Hurky hurky test' 0

## Important note¶

Article page titles need to be separated by underscores "_" rather than spaces. It's easy to replace spaces with underscores using the replace method called on the string.

page_title = 'Mitt Romney'.replace(' ','_')
page_title

'Mitt_Romney'

## Get revisions for a single article¶

Write a function get_page_revisions that takes the name of a Wikipedia page and its namespace and use the database connection to return a pandas DataFrame containing all the revisions for the article. Some of the fields inside needs to be cleaned up by converting to other data types to help with analysis in later steps.

def get_page_revisions(page_title,namespace,conn):
""" Takes a page title and returns its revision history.
page_title = a string for the page title to get its revisions
namespace = an int to get the revisions from a specific namespace
conn = a database connection

Returns:
A pandas DataFrame containing all the article revisions and page information
"""
# In case you pass a page title with spaces in it, replace the spaces with underscores
page_title = page_title.replace(' ','_')

# The MySQL query string used to retrieve the data. By line, it is
## selecting all columns
## from the "revisions" table
## joining the "page" table on it
## using the page_id and rev_page columns as keys
## limiting the results to entries that occur in the namespace and pagetitle
## and ordering the results by the timestamp
s = """
SELECT
*
FROM
revision
JOIN
page ON page.page_id = revision.rev_page
WHERE
page.page_namespace = {1} AND page.page_title = '{0}'
ORDER BY
revision.rev_timestamp ASC
""".format(page_title,namespace)

# Use the connection to run the query and return the results as a DataFrame

# Some of the results have a "bytestring" format
byte_columns = ['rev_comment','rev_user_text','rev_timestamp','rev_sha1','page_content_model',

# For each column, convert it from bytestring to a utf8 string
for col in byte_columns:
_df[col] = _df[col].str.decode('utf8')

# Several of the columns are timestamps. Convert to datetimes
_df['rev_timestamp'] = _df['rev_timestamp'].apply(lambda x:pd.datetime.strptime(x,'%Y%m%d%H%M%S'))
_df['page_touched'] = _df['page_touched'].apply(lambda x:pd.datetime.strptime(x,'%Y%m%d%H%M%S'))

# Return the data
return _df


Use the function to get the revisions for a single article and use the .head() method to inspect the first 5 rows in the DataFrame.

conn.ping()
conn.cursor().execute('use enwiki_p');

rev_df = get_page_revisions('Scientology',0,conn)

rev_id rev_page rev_text_id rev_comment rev_user rev_user_text rev_timestamp rev_minor_edit rev_deleted rev_len rev_parent_id rev_sha1 rev_content_model rev_content_format page_id page_namespace page_title page_restrictions page_counter page_is_redirect page_is_new page_random page_touched page_links_updated page_latest page_len page_content_model
0 332850157 13118744 0.0 * 31 The Cunctator 2001-10-10 23:25:41 1 0 13482.0 332605804 2ykvg829c4i9klhrwr6esxfi3viez21 None None 13118744 0 Scientology 0 0 0 0.609479 2016-10-04 13:22:54 2016-10-04 12:04:55 741555064 218878 wikitext
1 332850158 13118744 0.0 *Changes instances of religion to cult 0 63.20.33.xxx 2001-11-19 23:39:39 0 0 13462.0 332605804 5hkkt3e36lrga14gugpx62tnzhdfv80 None None 13118744 0 Scientology 0 0 0 0.609479 2016-10-04 13:22:54 2016-10-04 12:04:55 741555064 218878 wikitext
2 332850159 13118744 0.0 changed back: it *is* considered a religion by... 32 Dreamyshade 2001-11-19 23:48:28 0 0 13506.0 332605804 3w7r29q10kmd7i8tuglehs6nodb3lyi None None 13118744 0 Scientology 0 0 0 0.609479 2016-10-04 13:22:54 2016-10-04 12:04:55 741555064 218878 wikitext
3 332850160 13118744 0.0 Just copyediting; note: this is one of our bes... 43 Lee Daniel Crocker 2001-11-19 23:59:43 0 0 13483.0 332605804 agiw12hes2mg6f40hswt07gsp1t2zf5 None None 13118744 0 Scientology 0 0 0 0.609479 2016-10-04 13:22:54 2016-10-04 12:04:55 741555064 218878 wikitext
4 332850161 13118744 0.0 *excerpts 32 Dreamyshade 2001-11-20 00:39:00 1 0 13484.0 332605804 deg5qn2il9yrq6k0doxxh35xvpub50e None None 13118744 0 Scientology 0 0 0 0.609479 2016-10-04 13:22:54 2016-10-04 12:04:55 741555064 218878 wikitext

Count the total number of revisions in the article.

len(rev_df)

15249

Count the total number of unique users contributing to the article.

len(rev_df['rev_user_text'].unique())

4838

When was the first revision made to the article?

rev_df['rev_timestamp'].min()

Timestamp('2001-10-10 23:25:41')

When was the most recent revision made to the article?

rev_df['rev_timestamp'].max()

Timestamp('2016-09-28 06:38:05')

How has the size of the article changed over time?

ax = rev_df['rev_len'].plot()
ax.set_xlabel('Revisions')
ax.set_ylabel('Article size (bytes)')

<matplotlib.text.Text at 0x7fbd0e1976d8>

Who made the largest change to the article?

# Use the idxmax() method to return the index value for the largest value in the series
loc_max_revision = rev_df['rev_len'].idxmax()

# Show the values of a single row
rev_df.ix[loc_max_revision]

rev_id                                      119434366
rev_page                                     13118744
rev_text_id                                         0
rev_comment                             /* Origins */
rev_user                                      3841559
rev_user_text                             Paullong444
rev_timestamp                     2007-04-01 06:23:35
rev_minor_edit                                      0
rev_deleted                                         0
rev_len                                   1.70975e+06
rev_parent_id                               119434210
rev_sha1              2vvvalnzty26ceswvky10igij4o9rn6
rev_content_model                                None
rev_content_format                               None
page_id                                      13118744
page_namespace                                      0
page_title                                Scientology
page_restrictions
page_counter                                        0
page_is_redirect                                    0
page_is_new                                         0
page_random                                  0.609479
page_touched                      2016-10-04 13:22:54
page_latest                                 741555064
page_len                                       218878
page_content_model                           wikitext
Name: 8951, dtype: object

Go inspect what this largest revision was. Actually this specific example is extremely NSFW.

rev_id = rev_df.ix[loc_max_revision,'rev_id']
print('https://en.wikipedia.org/w/index.php?oldid={0}'.format(rev_id))

https://en.wikipedia.org/w/index.php?oldid=119434366


## Group revisions by date¶

Muliple revisions can be made in a single day. We want to be able to count the number of revisions per day and how the article's length has changed over time.

# For each timestamp extract only its date (ignoring the time of day)
# and save as a new column "rev_date"
rev_df['rev_date'] = rev_df['rev_timestamp'].apply(lambda x:x.date())

# Compare the full "rev_timestamp" to the new "rev_date"

rev_timestamp rev_date
0 2001-10-10 23:25:41 2001-10-10
1 2001-11-19 23:39:39 2001-11-19
2 2001-11-19 23:48:28 2001-11-19
3 2001-11-19 23:59:43 2001-11-19
4 2001-11-20 00:39:00 2001-11-20

What dates had the most revisions in a single day? A simple way of doing this is to count the number of times each "rev_date" occurs in the data with the value_counts method. Let's look at the top 10, 5 of which are in October 2012 in the weeks before the Election Day.

rev_df['rev_date'].value_counts().head(10)

2008-11-24    80
2006-04-22    76
2007-04-13    75
2006-04-23    74
2006-04-28    74
2006-04-20    68
2006-03-20    67
2006-04-21    67
2005-11-18    67
2006-04-27    66
Name: rev_date, dtype: int64

We can use the groupby functionality to create "baby" DataFrames having all the same particular value. For example, we can group by the date and each of the baby DataFrames would be all the revisions that occurred on that date. Store the groupby in the "rev_gb_date" object for subsequent steps.

Inspect a single baby DataFrame by passing the key (the date values we grouped on) to the get_group method and look at the first 5 entries.

rev_gb_date = rev_df.groupby(['rev_date'])

_date = pd.Timestamp('2007-09-21').date()

rev_id rev_page rev_text_id rev_comment rev_user rev_user_text rev_timestamp rev_minor_edit rev_deleted rev_len rev_parent_id rev_sha1 rev_content_model rev_content_format page_id page_namespace page_title page_restrictions page_counter page_is_redirect page_is_new page_random page_touched page_links_updated page_latest page_len page_content_model rev_date
10740 159304172 13118744 0.0 /* Scientology as a cult */ 5146742 Kardif 2007-09-21 02:07:44 0 0 97904.0 158911937 q9eaxo6zblwi34h3fo4q3vabd8s3jy7 None None 13118744 0 Scientology 0 0 0 0.609479 2016-10-04 13:22:54 2016-10-04 12:04:55 741555064 218878 wikitext 2007-09-21
10741 159306884 13118744 0.0 [[WP:UNDO|Undid]] revision 159304172 by [[Spec... 4042147 Stan En 2007-09-21 02:24:50 1 0 97692.0 159304172 8ek31e4mwmvjldtrighuwg8z3s3n7qe None None 13118744 0 Scientology 0 0 0 0.609479 2016-10-04 13:22:54 2016-10-04 12:04:55 741555064 218878 wikitext 2007-09-21
10742 159307512 13118744 0.0 [[WP:UNDO|Undid]] revision 158884027 by [[Spec... 4042147 Stan En 2007-09-21 02:28:44 0 0 97869.0 159306884 j44iz2vm1yqo8fvvwi0g6gwcjt7nhxv None None 13118744 0 Scientology 0 0 0 0.609479 2016-10-04 13:22:54 2016-10-04 12:04:55 741555064 218878 wikitext 2007-09-21
10743 159309317 13118744 0.0 rv. myself I see now, you doubled it before...... 4042147 Stan En 2007-09-21 02:40:50 0 0 97692.0 159307512 8ek31e4mwmvjldtrighuwg8z3s3n7qe None None 13118744 0 Scientology 0 0 0 0.609479 2016-10-04 13:22:54 2016-10-04 12:04:55 741555064 218878 wikitext 2007-09-21
10744 159480017 13118744 0.0 /* Definition */ no such article 3564249 Shutterbug 2007-09-21 21:50:08 0 0 97661.0 159309317 rwi7ope3ocndovy7jbq4xjjx7qw6vtv None None 13118744 0 Scientology 0 0 0 0.609479 2016-10-04 13:22:54 2016-10-04 12:04:55 741555064 218878 wikitext 2007-09-21

### Compute unique revisions per day¶

We can also use aggregate functionality to perform computations efficiently across all the baby DataFrames.

daily_revs = rev_gb_date.agg({'rev_id':pd.Series.nunique})

rev_id
rev_date
2001-10-10 1
2001-11-19 3
2001-11-20 1
2001-11-24 2
2001-11-25 3

Reindex the axis to make it continuous over the date range.

# Compute the first and last days revisions were observed
_min, _max = daily_revs.index.min(), daily_revs.index.max()

# Create a new index that is continuous over the whoel date range
_ix = pd.date_range(_min,_max)

# Reindex the daily revisions over the continuous date range filling in the blanks with 0s
daily_revs_reindexed = daily_revs.reindex(_ix).fillna(0)

rev_id
2001-10-10 1.0
2001-10-11 0.0
2001-10-12 0.0
2001-10-13 0.0
2001-10-14 0.0

What was the most active day of editing?

daily_revs_reindexed['rev_id'].sort_values(ascending=False).head(5)

2008-11-24    80.0
2006-04-22    76.0
2007-04-13    75.0
2006-04-23    74.0
2006-04-28    74.0
Name: rev_id, dtype: float64

Plot the number of revisions by day.

ax = daily_revs_reindexed['rev_id'].plot(lw=.5)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Revisions')

<matplotlib.text.Text at 0x7fbd10730a20>

This is pretty noisy. We can clean it up by applying a rolling mean over a 60-day window.

smoothed_s = daily_revs_reindexed['rev_id'].rolling(60).mean()
ax = smoothed_s.plot(lw=3)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Revisions')
plt.savefig('ScientologyRevHist.png')


Plot the cumulative sum of the number of revisions over time using the cumsum method.

ax = daily_revs_reindexed['rev_id'].cumsum().plot(lw=2)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Cumulative Revisions')

<matplotlib.text.Text at 0x7fbd0ec47320>

### Compute article length by day¶

Using the same "rev_gb_date" object, aggregate each of the baby DataFrames and compute the median article size ("rev_len"). Reindex again to make a continuous time axis.

daily_length = rev_gb_date.agg({'rev_len':np.median})

# Reindex
daily_length_reindexed = daily_length.reindex(_ix).fillna(method='ffill')

rev_len
2001-10-10 13482.0
2001-10-11 13482.0
2001-10-12 13482.0
2001-10-13 13482.0
2001-10-14 13482.0

Plot the reindexed series, making the y-axis be on a log scale to pick up detail in the early days when the article was orders of magnitude smaller.

ax = daily_length_reindexed['rev_len'].plot(lw=2,logy=True)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Article size  (bytes)')

<matplotlib.text.Text at 0x7fbd0db89c88>

### Compute diffs on time and article size¶

The "rev_len" only records the length of the revision at that point in time. By taking the difference with the previous observation, we can compute how much net content the user added or removed from the previous revision. Using the diff method on the "rev_len" series, we store the resulting differences as "rev_len_diff".

Compare the first five elements of the two series to verify it's worked as intended.

# Compute the diff and store as a new series
rev_df['rev_len_diff'] = rev_df['rev_len'].diff()

# Verify the differences are accurate

rev_len rev_len_diff
0 13482.0 NaN
1 13462.0 -20.0
2 13506.0 44.0
3 13483.0 -23.0
4 13484.0 1.0

Do some more cleanup of the diffs. In particular, when a revision is deleted (which is rare) the "rev_len" value is null, which makes the diff null as well. Make the first revision's "rev_len_diff" equal to the size of the article and then ffill the other nan values.

# Set the revision lengths for the null rows to be the previous value
rev_df['rev_len'] = rev_df['rev_len'].fillna(method='ffill')

# Compute the diffs again
rev_df['rev_len_diff'] = rev_df['rev_len'].diff()

# Set the value for the first rev_len_diff to be the size of the rev_len
rev_df.loc[0,'rev_len_diff'] = rev_df.loc[0,'rev_len']

# Verify the differences are accurate

rev_len rev_len_diff
0 13482.0 13482.0
1 13462.0 -20.0
2 13506.0 44.0
3 13483.0 -23.0
4 13484.0 1.0

Similarly, the "rev_timestamp" series only records the time when the revision occurred. By taking the difference with the previous revision, we can compute how much time has elapsed since the last edit. Using the diff method again on the "rev_timestamp" series, we store the resulting differences as "ref_timestamp_diff".

The resulting values are Timedeltas which we convert back to number of day floats by dividing by np.timedelta64.

Compare the first five elements of the two series to verify it's worked as intended.

# Compute the diff and store as a new series
rev_df['rev_timestamp_diff'] = rev_df['rev_timestamp'].diff()

# Convert the values of the new series to floats by dividing by 1 day time delta.
rev_df['rev_timestamp_diff'] = rev_df['rev_timestamp_diff'].apply(lambda x:x/np.timedelta64(1,'D'))

# Verify the differences are accurate.

rev_timestamp rev_timestamp_diff
0 2001-10-10 23:25:41 NaN
1 2001-11-19 23:39:39 40.009699
2 2001-11-19 23:48:28 0.006123
3 2001-11-19 23:59:43 0.007812
4 2001-11-20 00:39:00 0.027280

### Combine all the computations with one daily DataFrame¶

Because we added new series to the "rev_df" we need to re-run the groupby to update it. We can also combine all the different aggregate functions we did before into one step by defining an "agg_dict" that applies an aggregation function over the values in a specific column. This will return a DataFrame with only a subset of columns with the aggregated values we specified.

# Groupby again
rev_gb_date = rev_df.groupby(['rev_date'])

# Define an aggregation mapping of functions to columns names
agg_dict_date = {'rev_id':pd.Series.nunique,
'rev_len':np.median,
'rev_len_diff':np.median,
'rev_timestamp_diff':np.mean
}

# Execiute the aggregation functions on the groupby object
rev_agg_date = rev_gb_date.agg(agg_dict_date)

# Reindex to continuous dates
rev_agg_date = rev_agg_date.reindex(_ix)

# Fill in the resulting values of the reindexed and aggregated "rev_len" from the last value
rev_agg_date[['rev_len']] = rev_agg_date[['rev_len']].fillna(method='ffill')

# Fill in the resulting values of the reindexed and aggregated other series as 0s.
rev_agg_date[['rev_id','rev_len_diff']] = rev_agg_date[['rev_id','rev_len_diff']].fillna(0)
#rev_agg_date[['rev_id','rev_len_diff','rev_timestamp_diff']] = rev_agg_date[['rev_id','rev_len_diff','rev_timestamp_diff']].fillna(0)

# Show the first 10 values to verify it's worked as intended
rev_agg_date.tail(10)

rev_len_diff rev_id rev_timestamp_diff rev_len
2016-09-19 0.0 0.0 NaN 220683.0
2016-09-20 0.0 0.0 NaN 220683.0
2016-09-21 0.0 0.0 NaN 220683.0
2016-09-22 0.0 0.0 NaN 220683.0
2016-09-23 0.0 0.0 NaN 220683.0
2016-09-24 0.0 0.0 NaN 220683.0
2016-09-25 0.0 0.0 NaN 220683.0
2016-09-26 0.0 0.0 NaN 220683.0
2016-09-27 0.0 0.0 NaN 220683.0
2016-09-28 -1805.0 1.0 16.694618 218878.0

Plot the changes in the average time since the last revision by day.

_s = rev_agg_date['rev_timestamp_diff']
ax = _s.plot(logy=True)
ax.set_xlabel('Date')
ax.set_ylabel('Average time since last edit (Days)')

<matplotlib.text.Text at 0x7fbd0da93940>

That's pretty noisy. We can apply the rolling method over a 60-day window and get the mean to smooth out some of the noise.

_s = rev_agg_date['rev_timestamp_diff'].fillna(0).rolling(60).mean()
ax = _s.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Average time since last edit (Days)')

<matplotlib.text.Text at 0x7fbd0e1f5518>

Similarly, what was the median size of a change to the article over time?

_s = rev_agg_date['rev_len_diff']
ax = _s.plot(lw=.5)
ax.set_xlabel('Date')
ax.set_ylabel('Median size of revision')
ax.set_yscale('symlog')


Again, pretty noisy. We can cleanup with a rolling mean over a 60-day window.

_s = rev_agg_date['rev_len_diff'].rolling(60).mean()
ax = _s.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Median size of revision')

<matplotlib.text.Text at 0x7fbd0d6fa9b0>

## Top editors on the article¶

Which editors made the most contributions to the article?

top_users = rev_df['rev_user_text'].value_counts()

Jayen466            593
Spidern             357
Bravehartbear       324
Antaeus Feldspar    257
AndroidCat          251
Raymond Hill        217
Shutterbug          202
JDPhD               180
Feoffer             160
Misou               116
Name: rev_user_text, dtype: int64

This extremely skewed distribution of activity is characteristic of many online participation patterns: the "long tail" where less than 20% of users are responsible for more than 20% of contributions.

rev_count_occurrences = top_users.value_counts().reset_index()
rev_count_occurrences = rev_count_occurrences.rename(columns={'index':'Revision count','rev_user_text':'Occurrences'})

Revision count Occurrences
0 1 3124
1 2 845
2 3 294
3 4 144
4 5 99

Plot the resulting distribution on log-log axes to see the shape of the long-tail distribution.

rev_count_occurrences.plot(x='Revision count',y='Occurrences',kind='scatter',logy=True,logx=True,s=50)

<matplotlib.axes._subplots.AxesSubplot at 0x7fbd0e23a9b0>

### Group revisions by users¶

We can also use the groupby-aggregate framework to group all the revisions together by the users who made them, rather than grouping revisions by the date as we did before.

rev_gb_user = rev_df.groupby(['rev_user_text'])


Look at a baby DataFrame within the groupby for a single user. These are all the revisions made by this user to this article.

rev_gb_user.get_group('Eustress').head()

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-91-89ca82b35bfc> in <module>()

/srv/paws/lib/python3.4/site-packages/pandas/core/groupby.py in get_group(self, name, obj)
585         inds = self._get_index(name)
586         if not len(inds):
--> 587             raise KeyError(name)
588
589         return obj.take(inds, axis=self.axis, convert=False)

KeyError: 'Eustress'

As before we can aggregate all the information in these baby DFs into a user-level dataframe that summarizes different attributes.

# Define the aggregation function mappings to variables.
# We can apply multiple functions to the same column/series by nesting a second dictionary within
'rev_timestamp':{'first_rev':np.min,'last_rev':np.max},
'rev_len_diff':{'rev_len_min':np.min,'rev_len_max':np.max,
'rev_len_med':np.median,'rev_len_sum':np.sum},
'rev_timestamp_diff':{'rev_ts_min':np.min,'rev_ts_max':np.max,
'rev_ts_avg':np.mean}
}

# Perform the aggregation
rev_agg_user = rev_gb_user.agg(agg_dict_user)

# Returns a multi-index column, simplify by dropping the top level
rev_agg_user.columns = rev_agg_user.columns.droplevel(0)

# Sort by number of revisions, and look at the first 5 rows


Define some additional user-level statistics based on this aggregated data.

#### Tenure on article¶

# Compute tenure as difference between last and first revision, convert back to 1-day floats by dividing by timedelta64
rev_agg_user['tenure'] = (rev_agg_user['last_rev'] - rev_agg_user['first_rev'])/np.timedelta64(1,'D')

# Round values up to whole integers
rev_agg_user['tenure'] = rev_agg_user['tenure'].apply(np.ceil)

# Look at the longest-tenure editors

ax = rev_agg_user['tenure'].plot('hist',logy=True,bins=25)
ax.set_xlabel('Tenure (Days)')
ax.set_ylabel('Number of editors')

logspace_bins = np.logspace(start=0, stop=4, num=25)
ax = rev_agg_user['tenure'].plot('hist',logy=True,logx=True,bins=logspace_bins)
ax.set_xlabel('Tenure (Days)')
ax.set_ylabel('Number of editors')

g = sb.lmplot(x='tenure',y='revs_made',data=rev_agg_user.query('tenure > 1'),aspect=1.5)
ax = g.axes[0][0]
ax.set_xlim((0,1e4))
ax.set_ylim((0,1e4))
ax.set_xscale('symlog')
ax.set_yscale('symlog')


### Get user info¶

Define a function get_user_info that takes a revision DataFrame containing a "rev_user" column/series. For all the users in the revision history, get their user information like regristration date and current edit count.

def get_user_info(rev_df,conn):
if 'rev_user' in rev_df.columns:

# Get all the user IDs from the revision DF
_ids = rev_df['rev_user'].unique().tolist()

# Remove the 0s and convert the rest to strings
_ids = [str(i) for i in _ids if i != 0]

# Join the list of strings into a big comma-separated string
user_string = ', '.join(_ids)

# Write the query
## Select specific columns
## from the user table
## where the user_ids are among the users revising the article
s = """
SELECT
user_id, user_name, user_registration, user_editcount
FROM
user
WHERE
user_id IN ({0})
ORDER BY
user_editcount DESC
""".format(user_string)

# Run the query

# Some of the results have a "bytestring" format
byte_columns = ['user_name','user_registration']

# For each column, convert it from bytestring to a utf8 string
for col in byte_columns:
_df[col] = _df[col].str.decode('utf8')

# Some of the values are missing, recode them as Jan 1, 2000
_df['user_registration'] = _df['user_registration'].replace({None:'20000101000000'})

# Convert timestamps to datetimes
_df['user_registration'] = _df['user_registration'].apply(lambda x:pd.datetime.strptime(x,'%Y%m%d%H%M%S'))

# Return the results
return _df


Run the get_user_info function on the "rev_df" DataFrame to get information about each of the users who contributed to the article.

# Refresh the database connection since too much time has passed since the last query
conn.ping()
conn.cursor().execute('use enwiki_p');

# Get the user information.
rev_user_info_df = get_user_info(rev_df,conn)


Merge the "rev_agg_user" and the "rev_user_info_df" DataFrames together by joining on the user name.

# Merge the two DataFrames together using the index of "reg_agg_user"
# and the "user_name" field from rev_user_info_df, joining on the left values which should be more inclusive
merged_rev_agg_user = pd.merge(rev_agg_user,rev_user_info_df,left_index=True,right_on='user_name',how='left')

# Check if the two DataFrames have the same number of users, if False then there will be NaNs
print(len(rev_agg_user) == len(rev_user_info_df))

# Show the first five rows


What fraction of the user's toital edits are devoted to this single article?

merged_rev_agg_user#['user_editcount']

merged_rev_agg_user['rev_fraction'] = merged_rev_agg_user['revs_made']/merged_rev_agg_user['user_editcount']

# Plot the distribution. Use log-bins since the data is so skewed
#logspace_bins = np.logspace(start=-7, stop=0, num=25)
ax = merged_rev_agg_user['rev_fraction'].plot('hist',logy=False,logx=True)#,bins=logspace_bins)
ax.set_xlabel('Fraction of total edits')
ax.set_ylabel('Count')


How long after a user's account registration did they make their first edit to this article?

merged_rev_agg_user['first_rev_account_age'] = merged_rev_agg_user['first_rev'] - merged_rev_agg_user['user_registration']
merged_rev_agg_user['first_rev_account_age'] = merged_rev_agg_user['first_rev_account_age']/np.timedelta64(1,'D')

# Plot the distribution. Use log-bins since the data is so skewed
logspace_bins = np.logspace(start=-1, stop=4, num=25)
ax = merged_rev_agg_user['first_rev_account_age'].plot('hist',logy=False,logx=True,bins=logspace_bins)
ax.set_xlabel('Account age at first article revision')
ax.set_ylabel('Count')


Do users editing the article earlier in their career end up contributing more to the article?

g = sb.lmplot(x='first_rev_account_age',y='revs_made',data=merged_rev_agg_user.query('tenure > 1'),aspect=1.5)
ax = g.axes[0][0]
ax.set_xlim((0,1e4))
ax.set_ylim((0,1e4))
ax.set_xscale('symlog')
ax.set_yscale('symlog')
ax.set_xlabel('Account age at first article revision')
ax.set_ylabel('Total article revisions')


## Compare two articles¶

# Refresh the database connection since too much time has passed since the last query
conn.ping()
conn.cursor().execute('use enwiki_p');

# Get the second set of revisions
rev2_df = get_page_revisions('Tom Cruise',0,conn)


Do the data cleanup and feature engineering for later steps. Define a function to do all the feature engineering steps with one call.

def data_cleanup(rev_df):
# Compute the rev_date from the timestamps
rev_df['rev_date'] = rev_df['rev_timestamp'].apply(lambda x:x.date())

# Set the revision lengths for the null rows to be the previous value
rev_df['rev_len'] = rev_df['rev_len'].fillna(method='ffill')

# Compute the diffs and store as a new series
rev_df['rev_len_diff'] = rev_df['rev_len'].diff()

# Set the value for the first rev_len_diff to be the size of the rev_len
rev_df.loc[0,'rev_len_diff'] = rev_df.loc[0,'rev_len']

# Compute the diffs and store as a new series
rev_df['rev_timestamp_diff'] = rev_df['rev_timestamp'].diff()

# Convert the values of the new series to floats by dividing by 1 day time delta.
rev_df['rev_timestamp_diff'] = rev_df['rev_timestamp_diff'].apply(lambda x:x/np.timedelta64(1,'D'))

return rev_df


Run the cleanup on rev2_df and confirm it has all the variables (columns) we need.

rev2_df = data_cleanup(rev2_df)


Group by the "rev_date", compute the daily_revs, re-index to make the time continuous.

rev2_gb_date = rev2_df.groupby(['rev_date'])

daily_revs2 = rev2_gb_date.agg({'rev_id':pd.Series.nunique})

# Compute the first and last days revisions were observed
_min, _max = daily_revs2.index.min(), daily_revs2.index.max()

# Create a new index that is continuous over the whoel date range
_ix = pd.date_range(_min,_max)

# Reindex the daily revisions over the continuous date range filling in the blanks with 0s
daily_revs2_reindexed = daily_revs2.reindex(_ix).fillna(0)


Combine both of the daily_revs series into a DataFrame by merging, renaming the columns, and filling the blank values with 0s.

daily_revs_compared_df = pd.merge(daily_revs_reindexed,daily_revs2_reindexed,
how='outer',left_index=True,right_index=True)

# The left DF was Mitt Romney and the right DF was John McCain. Rename to that effect
daily_revs_compared_df = daily_revs_compared_df.rename(columns={'rev_id_x':'Scientology','rev_id_y':'Tom Cruise'})

# Fill the null values with 0s
daily_revs_compared_df.fillna(0,inplace=True)

# Look at the first 5 rows


Plot the comparisong of both results in a single chart. Use a rolling average of 60 days to smooth out noise.

ax = daily_revs_compared_df.rolling(60).mean().plot(lw=3)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Revisions')


Zoom in on the 2008 election season.

# Define two rough dates marking the start and stop of the election season
_min = pd.Timestamp('2005-01-01').date()
_max = pd.Timestamp('2008-01-01').date()

# Create a subset_df based on the 2-year window of dates
subset_df = daily_revs_compared_df.ix[_min:_max]

# Use a window of 28 days to compute the rolling average instead
ax = subset_df.rolling(28).mean().plot(lw=3)

# Draw a vertical lines for major campaign dates
ax.axvline(pd.Timestamp('2005-11-16').date(),c='black',lw=1,ls='--')   # Super Tuesday in black
ax.axvline(pd.Timestamp('2006-03-16').date(),c='orange',lw=1,ls='--')  # RNC in orange
ax.axvline(pd.Timestamp('2006-5-2').date(),c='red',lw=1,ls='--')     # Election day in red

ax.set_xlabel('Date')
ax.set_ylabel('Revisions')