Lab 1 - Revision Histories

Professor Brian Keegan
Department of Information Science, CU Boulder
This notebook is copyright and made available under the Apache License v2.0 license.

This is the first of five lab notebooks that will explore how to do some introductory data extraction and analysis from Wikipedia data. This lab will use a combination of MySQL and Python to retrieve the revision histories for a selection of articles. You do not need to be fluent in either to complete the lab, but there are many options for extending the analyses we do here by using more advanced queries and scripting methods.

Acknowledgements
I'd like to thank the Wikimedia Foundation for the PAWS system and related Wikitech infrastructure that this workbook runs within. Yuvi Panda, Aaron Halfaker, Jonathan Morgan, and Dario Taraborelli have all provided crucial support and feedback.

Confirm basic Python commands work

a = 1
b = 3
a+b
4
list(range(0,5))
[0, 1, 2, 3, 4]
your_name = 'Pippy the cat'
print("Hello, {0}".format(your_name))
Hello, Pippy the cat

Import modules and setup environment

Load up all the libraries we'll need to connect to the database, retrieve information for analysis, and visualizing results.

# Makes the plots appear within the notebook
%matplotlib inline

# Two fundamental packages for doing data manipulation
import numpy as np                   # http://www.numpy.org/
import pandas as pd                  # http://pandas.pydata.org/

# Two related packages for plotting data
import matplotlib.pyplot as plt      # http://matplotlib.org/
import seaborn as sb                 # https://stanford.edu/~mwaskom/software/seaborn/

# Two packages for accessing the MySQL server
import pymysql                       # http://pymysql.readthedocs.io/en/latest/
import os                            # https://docs.python.org/3.4/library/os.html

# Setup the code environment to use plots with a white background and DataFrames show more columns and rows
sb.set_style('whitegrid')
pd.options.display.max_columns = 100
pd.options.display.max_rows = 110

Setup a connection to the English Wikipedia database

Retrieve the host name, username, and (encrypted) password associated with your account so you can login to the database.

host, user, password = os.environ['MYSQL_HOST'], os.environ['MYSQL_USERNAME'], os.environ['MYSQL_PASSWORD']

Connect to the system using the creditials from above, then select the database for the English Wikipedia.

conn = pymysql.connect(host=host,user=user,password=password,database='enwiki_p',connect_timeout=3600)
conn.cursor().execute('use enwiki_p');

What tables are available for use?

pd.read_sql_query('show tables',conn)
Tables_in_enwiki_p
0 abuse_filter
1 abuse_filter_action
2 abuse_filter_log
3 aft_article_answer
4 aft_article_answer_text
5 aft_article_feedback
6 aft_article_feedback_properties
7 aft_article_feedback_ratings_rollup
8 aft_article_feedback_select_rollup
9 aft_article_field
10 aft_article_field_group
11 aft_article_field_option
12 aft_article_filter_count
13 aft_article_revision_feedback_ratings_rollup
14 aft_article_revision_feedback_select_rollup
15 archive
16 archive_userindex
17 article_feedback
18 article_feedback_pages
19 article_feedback_properties
20 article_feedback_ratings
21 article_feedback_revisions
22 article_feedback_stats
23 article_feedback_stats_types
24 category
25 categorylinks
26 change_tag
27 ep_articles
28 ep_cas
29 ep_courses
30 ep_events
31 ep_instructors
32 ep_oas
33 ep_orgs
34 ep_revisions
35 ep_students
36 ep_users_per_course
37 externallinks
38 filearchive
39 filearchive_userindex
40 flaggedimages
41 flaggedpage_config
42 flaggedpage_pending
43 flaggedpages
44 flaggedrevs
45 flaggedrevs_promote
46 flaggedrevs_statistics
47 flaggedrevs_stats
48 flaggedrevs_stats2
49 flaggedrevs_tracking
50 flaggedtemplates
51 geo_tags
52 global_block_whitelist
53 image
54 imagelinks
... ...
56 ipblocks
57 ipblocks_ipindex
58 iwlinks
59 l10n_cache
60 langlinks
61 localisation
62 localisation_file_hash
63 logging
64 logging_logindex
65 logging_userindex
66 mark_as_helpful
67 math
68 module_deps
69 msg_resource_links
70 oldimage
71 oldimage_userindex
72 ores_classification
73 ores_model
74 page
75 page_assessments
76 page_assessments_projects
77 page_props
78 page_restrictions
79 pagelinks
80 pagetriage_log
81 pagetriage_page
82 pagetriage_page_tags
83 pagetriage_tags
84 pif_edits
85 povwatch_log
86 povwatch_subscribers
87 protected_titles
88 recentchanges
89 recentchanges_userindex
90 redirect
91 revision
92 revision_userindex
93 site_identifiers
94 site_stats
95 sites
96 tag_summary
97 templatelinks
98 transcode
99 updatelog
100 updates
101 user
102 user_former_groups
103 user_groups
104 user_properties
105 user_properties_anon
106 valid_tag
107 watchlist_count
108 wbc_entity_usage
109 wikilove_image_log
110 wikilove_log

111 rows × 1 columns

s = """
        SELECT 
            *
        FROM 
            geo_tags
        WHERE
            gt_lat <> 0.0
        LIMIT
            15
    """
    
pd.read_sql_query(s,conn)
gt_id gt_page_id gt_globe gt_primary gt_lat gt_lon gt_dim gt_type gt_name gt_country gt_region
0 32506976 39009140 b'earth' 1 22.224701 -159.548996 NaN None None None None
1 84519135 41934609 b'earth' 1 40.716099 -73.849998 1000.0 None b'' None None
2 97650400 43011740 b'earth' 1 45.416176 -122.658348 NaN None None None None
3 110174614 39060583 b'earth' 1 1.013428 35.005379 1000.0 None b'' None None
4 127825595 44201712 b'earth' 1 35.110321 32.879654 10000.0 b'city' b'' b'CY' None
5 137095446 35188758 b'earth' 1 34.064167 -118.359833 NaN None None None None
6 137095447 35752786 b'earth' 1 33.787628 -84.378311 NaN None None None None
7 137095488 41171788 b'earth' 1 34.776333 -84.977333 NaN None None None None
8 139471471 36243721 b'earth' 1 -15.427167 28.306000 NaN None None None None
9 139491084 37721268 b'earth' 1 -15.426167 28.310167 NaN None None None None
10 139491425 37791043 b'earth' 1 -15.393000 28.307833 NaN None None None None
11 139599425 39073876 b'earth' 1 51.505581 7.467708 NaN None None None None
12 142630171 37723580 b'earth' 1 -15.393000 28.307833 NaN None None None None
13 142632116 40663399 b'earth' 1 38.261600 -122.487106 NaN None None None None
14 145877790 36448491 b'earth' 1 19.178978 72.978683 NaN None None None None

Important note

Article page titles need to be separated by underscores "_" rather than spaces. It's easy to replace spaces with underscores using the replace method called on the string.

page_title = 'American Expeditionary Forces'.replace(' ','_')
page_title
'American_Expeditionary_Forces'

Get revisions for a single article

Write a function get_page_revisions that takes the name of a Wikipedia page and its namespace and use the database connection to return a pandas DataFrame containing all the revisions for the article. Some of the fields inside needs to be cleaned up by converting to other data types to help with analysis in later steps.

def get_page_revisions(page_title,namespace,conn):
    """ Takes a page title and returns its revision history.
      page_title = a string for the page title to get its revisions
      namespace = an int to get the revisions from a specific namespace
      conn = a database connection
      
    Returns:
      A pandas DataFrame containing all the article revisions and page information
    """
    # In case you pass a page title with spaces in it, replace the spaces with underscores
    page_title = page_title.replace(' ','_')

    # The MySQL query string used to retrieve the data. By line, it is
    ## selecting all columns 
    ## from the "revisions" table
    ## joining the "page" table on it
    ## using the page_id and rev_page columns as keys
    ## limiting the results to entries that occur in the namespace and pagetitle
    ## and ordering the results by the timestamp
    s = """
            SELECT 
                *
            FROM 
                revision 
            JOIN 
                page ON page.page_id = revision.rev_page
            WHERE 
                page.page_namespace = {1} AND page.page_title = '{0}' 
            ORDER BY 
                revision.rev_timestamp ASC
        """.format(page_title,namespace)
    
    # Use the connection to run the query and return the results as a DataFrame
    _df = pd.read_sql_query(s,conn)

    # Some of the results have a "bytestring" format
    byte_columns = ['rev_comment','rev_user_text','rev_timestamp','rev_sha1','page_content_model',
                    'page_title','page_restrictions','page_touched','page_links_updated']
    
    # For each column, convert it from bytestring to a utf8 string
    for col in byte_columns:
        _df[col] = _df[col].str.decode('utf8')

    # Several of the columns are timestamps. Convert to datetimes
    _df['rev_timestamp'] = _df['rev_timestamp'].apply(lambda x:pd.datetime.strptime(x,'%Y%m%d%H%M%S'))
    _df['page_touched'] = _df['page_touched'].apply(lambda x:pd.datetime.strptime(x,'%Y%m%d%H%M%S'))
    _df['page_links_updated'] = _df['page_links_updated'].apply(lambda x:pd.datetime.strptime(x,'%Y%m%d%H%M%S'))
    
    # Return the data
    return _df

Use the function to get the revisions for a single article and use the .head() method to inspect the first 5 rows in the DataFrame.

rev_df = get_page_revisions('American Expeditionary Forces',0,conn)
rev_df.head()
---------------------------------------------------------------------------
BrokenPipeError                           Traceback (most recent call last)
/srv/paws/lib/python3.4/site-packages/pymysql/connections.py in _write_bytes(self, data)
   1003         try:
-> 1004             self._sock.sendall(data)
   1005         except IOError as e:

BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

OperationalError                          Traceback (most recent call last)
/srv/paws/lib/python3.4/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
   1400             else:
-> 1401                 cur.execute(*args)
   1402             return cur

/srv/paws/lib/python3.4/site-packages/pymysql/cursors.py in execute(self, query, args)
    165 
--> 166         result = self._query(query)
    167         self._executed = query

/srv/paws/lib/python3.4/site-packages/pymysql/cursors.py in _query(self, q)
    321         self._last_executed = q
--> 322         conn.query(q)
    323         self._do_get_result()

/srv/paws/lib/python3.4/site-packages/pymysql/connections.py in query(self, sql, unbuffered)
    833                 sql = sql.encode(self.encoding, 'surrogateescape')
--> 834         self._execute_command(COMMAND.COM_QUERY, sql)
    835         self._affected_rows = self._read_query_result(unbuffered=unbuffered)

/srv/paws/lib/python3.4/site-packages/pymysql/connections.py in _execute_command(self, command, sql)
   1053         packet = prelude + sql[:packet_size-1]
-> 1054         self._write_bytes(packet)
   1055         if DEBUG: dump_packet(packet)

/srv/paws/lib/python3.4/site-packages/pymysql/connections.py in _write_bytes(self, data)
   1005         except IOError as e:
-> 1006             raise err.OperationalError(2006, "MySQL server has gone away (%r)" % (e,))
   1007 

OperationalError: (2006, "MySQL server has gone away (BrokenPipeError(32, 'Broken pipe'))")

During handling of the above exception, another exception occurred:

BrokenPipeError                           Traceback (most recent call last)
/srv/paws/lib/python3.4/site-packages/pymysql/connections.py in _write_bytes(self, data)
   1003         try:
-> 1004             self._sock.sendall(data)
   1005         except IOError as e:

BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

OperationalError                          Traceback (most recent call last)
/srv/paws/lib/python3.4/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
   1404             try:
-> 1405                 self.con.rollback()
   1406             except Exception:  # pragma: no cover

/srv/paws/lib/python3.4/site-packages/pymysql/connections.py in rollback(self)
    770         """Roll back the current transaction"""
--> 771         self._execute_command(COMMAND.COM_QUERY, "ROLLBACK")
    772         self._read_ok_packet()

/srv/paws/lib/python3.4/site-packages/pymysql/connections.py in _execute_command(self, command, sql)
   1053         packet = prelude + sql[:packet_size-1]
-> 1054         self._write_bytes(packet)
   1055         if DEBUG: dump_packet(packet)

/srv/paws/lib/python3.4/site-packages/pymysql/connections.py in _write_bytes(self, data)
   1005         except IOError as e:
-> 1006             raise err.OperationalError(2006, "MySQL server has gone away (%r)" % (e,))
   1007 

OperationalError: (2006, "MySQL server has gone away (BrokenPipeError(32, 'Broken pipe'))")

During handling of the above exception, another exception occurred:

DatabaseError                             Traceback (most recent call last)
<ipython-input-45-b35c72cbddc5> in <module>()
----> 1 rev_df = get_page_revisions('American Expeditionary Forces',0,conn)
      2 rev_df.head()

<ipython-input-37-7cf27ddd4395> in get_page_revisions(page_title, namespace, conn)
     32 
     33     # Use the connection to run the query and return the results as a DataFrame
---> 34     _df = pd.read_sql_query(s,conn)
     35 
     36     # Some of the results have a "bytestring" format

/srv/paws/lib/python3.4/site-packages/pandas/io/sql.py in read_sql_query(sql, con, index_col, coerce_float, params, parse_dates, chunksize)
    329     return pandas_sql.read_query(
    330         sql, index_col=index_col, params=params, coerce_float=coerce_float,
--> 331         parse_dates=parse_dates, chunksize=chunksize)
    332 
    333 

/srv/paws/lib/python3.4/site-packages/pandas/io/sql.py in read_query(self, sql, index_col, coerce_float, params, parse_dates, chunksize)
   1434 
   1435         args = _convert_params(sql, params)
-> 1436         cursor = self.execute(*args)
   1437         columns = [col_desc[0] for col_desc in cursor.description]
   1438 

/srv/paws/lib/python3.4/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
   1407                 ex = DatabaseError("Execution failed on sql: %s\n%s\nunable"
   1408                                    " to rollback" % (args[0], exc))
-> 1409                 raise_with_traceback(ex)
   1410 
   1411             ex = DatabaseError(

/srv/paws/lib/python3.4/site-packages/pandas/compat/__init__.py in raise_with_traceback(exc, traceback)
    337         if traceback == Ellipsis:
    338             _, _, traceback = sys.exc_info()
--> 339         raise exc.with_traceback(traceback)
    340 else:
    341     # this version of raise is a syntax error in Python 3

/srv/paws/lib/python3.4/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
   1403         except Exception as exc:
   1404             try:
-> 1405                 self.con.rollback()
   1406             except Exception:  # pragma: no cover
   1407                 ex = DatabaseError("Execution failed on sql: %s\n%s\nunable"

/srv/paws/lib/python3.4/site-packages/pymysql/connections.py in rollback(self)
    769     def rollback(self):
    770         """Roll back the current transaction"""
--> 771         self._execute_command(COMMAND.COM_QUERY, "ROLLBACK")
    772         self._read_ok_packet()
    773 

/srv/paws/lib/python3.4/site-packages/pymysql/connections.py in _execute_command(self, command, sql)
   1052         prelude = struct.pack('<iB', packet_size, command)
   1053         packet = prelude + sql[:packet_size-1]
-> 1054         self._write_bytes(packet)
   1055         if DEBUG: dump_packet(packet)
   1056         self._next_seq_id = 1

/srv/paws/lib/python3.4/site-packages/pymysql/connections.py in _write_bytes(self, data)
   1004             self._sock.sendall(data)
   1005         except IOError as e:
-> 1006             raise err.OperationalError(2006, "MySQL server has gone away (%r)" % (e,))
   1007 
   1008     def _read_query_result(self, unbuffered=False):

DatabaseError: Execution failed on sql: 
            SELECT 
                *
            FROM 
                revision 
            JOIN 
                page ON page.page_id = revision.rev_page
            WHERE 
                page.page_namespace = 0 AND page.page_title = 'American_Expeditionary_Forces' 
            ORDER BY 
                revision.rev_timestamp ASC
        
(2006, "MySQL server has gone away (BrokenPipeError(32, 'Broken pipe'))")
unable to rollback

Count the total number of revisions in the article.

len(rev_df)
758

Count the total number of unique users contributing to the article.

len(rev_df['rev_user_text'].unique())
345

When was the first revision made to the article?

rev_df['rev_timestamp'].min()
Timestamp('2003-09-07 01:05:58')

When was the most recent revision made to the article?

rev_df['rev_timestamp'].max()
Timestamp('2016-12-01 22:54:59')

How has the size of the article changed over time?

ax = rev_df['rev_len'].plot()
ax.set_xlabel('Revisions')
ax.set_ylabel('Article size (bytes)')
<matplotlib.text.Text at 0x7fe01d138fd0>

Who made the largest change to the article?

# Use the idxmax() method to return the index value for the largest value in the series
loc_max_revision = rev_df['rev_len'].idxmax()

# Show the values of a single row
rev_df.ix[loc_max_revision]
rev_id                                      266458149
rev_page                                       313147
rev_text_id                                         0
rev_comment                                          
rev_user                                      5345967
rev_user_text                          Alicemfisher58
rev_timestamp                     2009-01-26 04:14:19
rev_minor_edit                                      0
rev_deleted                                         0
rev_len                                        493540
rev_parent_id                               266250098
rev_sha1              hoswgv8mnwd1mw50k7bmx8zrwa2a311
rev_content_model                                None
rev_content_format                               None
page_id                                        313147
page_namespace                                      0
page_title              American_Expeditionary_Forces
page_restrictions                                    
page_counter                                        0
page_is_redirect                                    0
page_is_new                                         0
page_random                                  0.916663
page_touched                      2016-12-01 22:54:59
page_links_updated                2016-12-10 02:08:55
page_latest                                 752554663
page_len                                        24578
page_content_model                           wikitext
Name: 287, dtype: object

Go inspect what this largest revision was. Actually this specific example is extremely NSFW.

rev_id = rev_df.ix[loc_max_revision,'rev_id']
print('https://en.wikipedia.org/w/index.php?oldid={0}'.format(rev_id))
https://en.wikipedia.org/w/index.php?oldid=266458149

Group revisions by date

Muliple revisions can be made in a single day. We want to be able to count the number of revisions per day and how the article's length has changed over time.

# For each timestamp extract only its date (ignoring the time of day) 
# and save as a new column "rev_date"
rev_df['rev_date'] = rev_df['rev_timestamp'].apply(lambda x:x.date())

# Compare the full "rev_timestamp" to the new "rev_date"
rev_df[['rev_timestamp','rev_date']].head()
rev_timestamp rev_date
0 2003-09-07 01:05:58 2003-09-07
1 2003-09-07 01:06:08 2003-09-07
2 2003-09-07 01:06:24 2003-09-07
3 2003-09-07 01:07:53 2003-09-07
4 2003-10-24 08:05:33 2003-10-24

What dates had the most revisions in a single day? A simple way of doing this is to count the number of times each "rev_date" occurs in the data with the value_counts method. Let's look at the top 10, 5 of which are in October 2012 in the weeks before the Election Day.

rev_df['rev_date'].value_counts().head(10)
2009-03-19    46
2009-03-21    35
2009-01-25    33
2009-01-24    32
2009-03-18    26
2009-03-22    14
2007-02-06    10
2007-05-20     8
2013-06-17     8
2006-11-30     7
Name: rev_date, dtype: int64

We can use the groupby functionality to create "baby" DataFrames having all the same particular value. For example, we can group by the date and each of the baby DataFrames would be all the revisions that occurred on that date. Store the groupby in the "rev_gb_date" object for subsequent steps.

Inspect a single baby DataFrame by passing the key (the date values we grouped on) to the get_group method and look at the first 5 entries.

rev_gb_date = rev_df.groupby(['rev_date'])
#need to use a timestamp here that we know exists in this dataset or will get a key error
_date = pd.Timestamp('2013-06-17').date()
rev_gb_date.get_group(_date).head(5)
rev_id rev_page rev_text_id rev_comment rev_user rev_user_text rev_timestamp rev_minor_edit rev_deleted rev_len rev_parent_id rev_sha1 rev_content_model rev_content_format page_id page_namespace page_title page_restrictions page_counter page_is_redirect page_is_new page_random page_touched page_links_updated page_latest page_len page_content_model rev_date
611 560271420 313147 0 flags 12006778 Illegitimate Barrister 2013-06-17 09:17:30 0 0 20351 557106742 mksyoz8foykgy89j7s1evi7dsc6qlv4 None None 313147 0 American_Expeditionary_Forces 0 0 0 0.916663 2016-12-01 22:54:59 2016-12-10 02:08:55 752554663 24578 wikitext 2013-06-17
612 560361931 313147 0 looking 12006778 Illegitimate Barrister 2013-06-17 22:26:49 0 0 20377 560271420 gzxco4d1gohvzi077lhmy0q489pfgsv None None 313147 0 American_Expeditionary_Forces 0 0 0 0.916663 2016-12-01 22:54:59 2016-12-10 02:08:55 752554663 24578 wikitext 2013-06-17
613 560361972 313147 0 C.O. 12006778 Illegitimate Barrister 2013-06-17 22:27:14 0 0 20386 560361931 jk64mnsj2bpko9v8udjli295qqxo2fh None None 313147 0 American_Expeditionary_Forces 0 0 0 0.916663 2016-12-01 22:54:59 2016-12-10 02:08:55 752554663 24578 wikitext 2013-06-17
614 560362011 313147 0 typo 12006778 Illegitimate Barrister 2013-06-17 22:27:35 0 0 20383 560361972 njouapgjxs25sq6222krkctn3mskbdp None None 313147 0 American_Expeditionary_Forces 0 0 0 0.916663 2016-12-01 22:54:59 2016-12-10 02:08:55 752554663 24578 wikitext 2013-06-17
615 560362071 313147 0 added image 12006778 Illegitimate Barrister 2013-06-17 22:28:08 0 0 20341 560362011 ak7tgaz2sfb93i9kcyouvo937cgrrgj None None 313147 0 American_Expeditionary_Forces 0 0 0 0.916663 2016-12-01 22:54:59 2016-12-10 02:08:55 752554663 24578 wikitext 2013-06-17

Compute unique revisions per day

We can also use aggregate functionality to perform computations efficiently across all the baby DataFrames.

daily_revs = rev_gb_date.agg({'rev_id':pd.Series.nunique})
daily_revs.head() 
rev_id
rev_date
2003-09-07 4
2003-10-24 1
2004-02-16 2
2004-02-17 2
2004-02-18 4

Reindex the axis to make it continuous over the date range.

# Compute the first and last days revisions were observed
_min, _max = daily_revs.index.min(), daily_revs.index.max()

# Create a new index that is continuous over the whoel date range
_ix = pd.date_range(_min,_max)

# Reindex the daily revisions over the continuous date range filling in the blanks with 0s
daily_revs_reindexed = daily_revs.reindex(_ix).fillna(0)
daily_revs_reindexed.head()
rev_id
2003-09-07 4.0
2003-09-08 0.0
2003-09-09 0.0
2003-09-10 0.0
2003-09-11 0.0

What was the most active day of editing?

daily_revs_reindexed['rev_id'].sort_values(ascending=False).head(5)
2009-03-19    46.0
2009-03-21    35.0
2009-01-25    33.0
2009-01-24    32.0
2009-03-18    26.0
Name: rev_id, dtype: float64

Plot the number of revisions by day.

ax = daily_revs_reindexed['rev_id'].plot(lw=.5)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Revisions')
<matplotlib.text.Text at 0x7fe01d0599b0>

This is pretty noisy. We can clean it up by applying a rolling mean over a 60-day window.

smoothed_s = daily_revs_reindexed['rev_id'].rolling(60).mean()
ax = smoothed_s.plot(lw=3)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Revisions')
<matplotlib.text.Text at 0x7fe01cf5b6a0>

Plot the cumulative sum of the number of revisions over time using the cumsum method.

ax = daily_revs_reindexed['rev_id'].cumsum().plot(lw=2)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Cumulative Revisions')
<matplotlib.text.Text at 0x7fe01d0731d0>

Compute article length by day

Using the same "rev_gb_date" object, aggregate each of the baby DataFrames and compute the median article size ("rev_len"). Reindex again to make a continuous time axis.

daily_length = rev_gb_date.agg({'rev_len':np.median})

# Reindex
daily_length_reindexed = daily_length.reindex(_ix).fillna(method='ffill')
daily_length_reindexed.head()
rev_len
2003-09-07 65.5
2003-09-08 65.5
2003-09-09 65.5
2003-09-10 65.5
2003-09-11 65.5

Plot the reindexed series, making the y-axis be on a log scale to pick up detail in the early days when the article was orders of magnitude smaller.

ax = daily_length_reindexed['rev_len'].plot(lw=2,logy=True)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Article size  (bytes)')
<matplotlib.text.Text at 0x7fe01ce7f4e0>

Compute diffs on time and article size

The "rev_len" only records the length of the revision at that point in time. By taking the difference with the previous observation, we can compute how much net content the user added or removed from the previous revision. Using the diff method on the "rev_len" series, we store the resulting differences as "rev_len_diff".

Compare the first five elements of the two series to verify it's worked as intended.

# Compute the diff and store as a new series
rev_df['rev_len_diff'] = rev_df['rev_len'].diff()

# Verify the differences are accurate
rev_df[['rev_len','rev_len_diff']].head()
rev_len rev_len_diff
0 24 NaN
1 63 39.0
2 68 5.0
3 274 206.0
4 512 238.0

Do some more cleanup of the diffs. In particular, when a revision is deleted (which is rare) the "rev_len" value is null, which makes the diff null as well. Make the first revision's "rev_len_diff" equal to the size of the article and then ffill the other nan values.

# Set the revision lengths for the null rows to be the previous value
rev_df['rev_len'] = rev_df['rev_len'].fillna(method='ffill')

# Compute the diffs again 
rev_df['rev_len_diff'] = rev_df['rev_len'].diff()

# Set the value for the first rev_len_diff to be the size of the rev_len
rev_df.loc[0,'rev_len_diff'] = rev_df.loc[0,'rev_len']

# Verify the differences are accurate
rev_df[['rev_len','rev_len_diff']].head()
rev_len rev_len_diff
0 24 24.0
1 63 39.0
2 68 5.0
3 274 206.0
4 512 238.0

Similarly, the "rev_timestamp" series only records the time when the revision occurred. By taking the difference with the previous revision, we can compute how much time has elapsed since the last edit. Using the diff method again on the "rev_timestamp" series, we store the resulting differences as "ref_timestamp_diff".

The resulting values are Timedeltas which we convert back to number of day floats by dividing by np.timedelta64.

Compare the first five elements of the two series to verify it's worked as intended.

# Compute the diff and store as a new series
rev_df['rev_timestamp_diff'] = rev_df['rev_timestamp'].diff()

# Convert the values of the new series to floats by dividing by 1 day time delta.
rev_df['rev_timestamp_diff'] = rev_df['rev_timestamp_diff'].apply(lambda x:x/np.timedelta64(1,'D'))

# Verify the differences are accurate.
rev_df[['rev_timestamp','rev_timestamp_diff']].head()
rev_timestamp rev_timestamp_diff
0 2003-09-07 01:05:58 NaN
1 2003-09-07 01:06:08 0.000116
2 2003-09-07 01:06:24 0.000185
3 2003-09-07 01:07:53 0.001030
4 2003-10-24 08:05:33 47.290046

Combine all the computations with one daily DataFrame

Because we added new series to the "rev_df" we need to re-run the groupby to update it. We can also combine all the different aggregate functions we did before into one step by defining an "agg_dict" that applies an aggregation function over the values in a specific column. This will return a DataFrame with only a subset of columns with the aggregated values we specified.

# Groupby again
rev_gb_date = rev_df.groupby(['rev_date'])

# Define an aggregation mapping of functions to columns names
agg_dict_date = {'rev_id':pd.Series.nunique,
                  'rev_len':np.median,
                  'rev_len_diff':np.median,
                  'rev_timestamp_diff':np.mean
                 }

# Execiute the aggregation functions on the groupby object
rev_agg_date = rev_gb_date.agg(agg_dict_date)

# Reindex to continuous dates
rev_agg_date = rev_agg_date.reindex(_ix)

# Fill in the resulting values of the reindexed and aggregated "rev_len" from the last value
rev_agg_date[['rev_len']] = rev_agg_date[['rev_len']].fillna(method='ffill')

# Fill in the resulting values of the reindexed and aggregated other series as 0s.
rev_agg_date[['rev_id','rev_len_diff']] = rev_agg_date[['rev_id','rev_len_diff']].fillna(0)
#rev_agg_date[['rev_id','rev_len_diff','rev_timestamp_diff']] = rev_agg_date[['rev_id','rev_len_diff','rev_timestamp_diff']].fillna(0)

# Show the first 10 values to verify it's worked as intended
rev_agg_date.tail(10)
rev_timestamp_diff rev_len rev_len_diff rev_id
2016-11-22 NaN 24253.0 0.0 0.0
2016-11-23 NaN 24253.0 0.0 0.0
2016-11-24 NaN 24253.0 0.0 0.0
2016-11-25 NaN 24253.0 0.0 0.0
2016-11-26 NaN 24253.0 0.0 0.0
2016-11-27 NaN 24253.0 0.0 0.0
2016-11-28 NaN 24253.0 0.0 0.0
2016-11-29 NaN 24253.0 0.0 0.0
2016-11-30 NaN 24253.0 0.0 0.0
2016-12-01 42.106991 24578.0 325.0 1.0

Plot the changes in the average time since the last revision by day.

_s = rev_agg_date['rev_timestamp_diff']
ax = _s.plot(logy=True)
ax.set_xlabel('Date')
ax.set_ylabel('Average time since last edit (Days)')
<matplotlib.text.Text at 0x7fe01d0a9a90>
/srv/paws/lib/python3.4/site-packages/matplotlib/scale.py:101: RuntimeWarning: invalid value encountered in less_equal
  a[a <= 0.0] = 1e-300

That's pretty noisy. We can apply the rolling method over a 60-day window and get the mean to smooth out some of the noise.

_s = rev_agg_date['rev_timestamp_diff'].fillna(0).rolling(60).mean()
ax = _s.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Average time since last edit (Days)')
<matplotlib.text.Text at 0x7fe01cb92a90>

Similarly, what was the median size of a change to the article over time?

_s = rev_agg_date['rev_len_diff']
ax = _s.plot(lw=.5)
ax.set_xlabel('Date')
ax.set_ylabel('Median size of revision')
ax.set_yscale('symlog')

Again, pretty noisy. We can cleanup with a rolling mean over a 60-day window.

_s = rev_agg_date['rev_len_diff'].rolling(60).mean()
ax = _s.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Median size of revision')
<matplotlib.text.Text at 0x7fe01c845b00>

Top editors on the article

Which editors made the most contributions to the article?

top_users = rev_df['rev_user_text'].value_counts()
top_users.head(10)
71.84.8.103               129
98.204.190.17              39
Alicemfisher58             34
Rjensen                    18
Italia2006                 15
Illegitimate Barrister     12
AEF1918                    11
ClueBot NG                  9
DMorpheus                   8
216.229.201.7               7
Name: rev_user_text, dtype: int64

This extremely skewed distribution of activity is characteristic of many online participation patterns: the "long tail" where less than 20% of users are responsible for more than 20% of contributions.

rev_count_occurrences = top_users.value_counts().reset_index()
rev_count_occurrences = rev_count_occurrences.rename(columns={'index':'Revision count','rev_user_text':'Occurrences'})
rev_count_occurrences.head()
Revision count Occurrences
0 1 259
1 2 42
2 3 18
3 4 8
4 6 5

Plot the resulting distribution on log-log axes to see the shape of the long-tail distribution.

rev_count_occurrences.plot(x='Revision count',y='Occurrences',kind='scatter',logy=True,logx=True,s=50)
<matplotlib.axes._subplots.AxesSubplot at 0x7fe01c6f49e8>

Group revisions by users

We can also use the groupby-aggregate framework to group all the revisions together by the users who made them, rather than grouping revisions by the date as we did before.

rev_gb_user = rev_df.groupby(['rev_user_text'])

Look at a baby DataFrame within the groupby for a single user. These are all the revisions made by this user to this article.

rev_gb_user.get_group('71.84.8.103').head()#most active contributor- contributes in 2009
rev_id rev_page rev_text_id rev_comment rev_user rev_user_text rev_timestamp rev_minor_edit rev_deleted rev_len rev_parent_id rev_sha1 rev_content_model rev_content_format page_id page_namespace page_title page_restrictions page_counter page_is_redirect page_is_new page_random page_touched page_links_updated page_latest page_len page_content_model rev_date rev_len_diff rev_timestamp_diff
296 278014135 313147 0 /* Overview */ 0 71.84.8.103 2009-03-18 02:09:44 0 0 14471 272520716 27ugspejjem752qu61mpdg0yucgkvji None None 313147 0 American_Expeditionary_Forces 0 0 0 0.916663 2016-12-01 22:54:59 2016-12-10 02:08:55 752554663 24578 wikitext 2009-03-18 307.0 23.395949
297 278014327 313147 0 /* Overview */ 0 71.84.8.103 2009-03-18 02:10:52 0 0 14475 278014135 jh974rvtwz96jv4ocnjxgqsyq6kb2du None None 313147 0 American_Expeditionary_Forces 0 0 0 0.916663 2016-12-01 22:54:59 2016-12-10 02:08:55 752554663 24578 wikitext 2009-03-18 4.0 0.000787
298 278014624 313147 0 /* Overview */ 0 71.84.8.103 2009-03-18 02:12:26 0 0 14472 278014327 1axe5513zji2pon46v1t0pn2v6pn6ru None None 313147 0 American_Expeditionary_Forces 0 0 0 0.916663 2016-12-01 22:54:59 2016-12-10 02:08:55 752554663 24578 wikitext 2009-03-18 -3.0 0.001088
299 278014782 313147 0 /* Overview */ 0 71.84.8.103 2009-03-18 02:13:16 0 0 14480 278014624 7cdnrnfguco9tnp7z7rvrf8x6hobvwd None None 313147 0 American_Expeditionary_Forces 0 0 0 0.916663 2016-12-01 22:54:59 2016-12-10 02:08:55 752554663 24578 wikitext 2009-03-18 8.0 0.000579
300 278015036 313147 0 /* Overview */ 0 71.84.8.103 2009-03-18 02:14:28 0 0 14472 278014782 1axe5513zji2pon46v1t0pn2v6pn6ru None None 313147 0 American_Expeditionary_Forces 0 0 0 0.916663 2016-12-01 22:54:59 2016-12-10 02:08:55 752554663 24578 wikitext 2009-03-18 -8.0 0.000833

As before we can aggregate all the information in these baby DFs into a user-level dataframe that summarizes different attributes.

# Define the aggregation function mappings to variables.
# We can apply multiple functions to the same column/series by nesting a second dictionary within
agg_dict_user = {'rev_id':{'revs_made':pd.Series.nunique},
                  'rev_timestamp':{'first_rev':np.min,'last_rev':np.max},
                  'rev_len_diff':{'rev_len_min':np.min,'rev_len_max':np.max,
                                  'rev_len_med':np.median,'rev_len_sum':np.sum},
                  'rev_timestamp_diff':{'rev_ts_min':np.min,'rev_ts_max':np.max,
                                        'rev_ts_avg':np.mean}
                 }

# Perform the aggregation
rev_agg_user = rev_gb_user.agg(agg_dict_user)

# Returns a multi-index column, simplify by dropping the top level
rev_agg_user.columns = rev_agg_user.columns.droplevel(0)

# Sort by number of revisions, and look at the first 5 rows
rev_agg_user.sort_values('revs_made',ascending=False).head()
rev_ts_min rev_ts_avg rev_ts_max rev_len_med rev_len_min rev_len_max rev_len_sum first_rev last_rev revs_made
rev_user_text
71.84.8.103 0.000208 0.260048 23.395949 3.0 -216.0 819.0 1986.0 2009-03-18 02:09:44 2009-03-29 04:45:58 129
98.204.190.17 0.000197 0.427413 16.378889 52.0 -53.0 176230.0 478326.0 2009-01-24 19:19:44 2009-01-25 04:17:43 39
Alicemfisher58 0.000498 0.729285 14.107975 16.5 -293.0 718.0 1458.0 2009-01-25 00:55:02 2009-06-17 22:33:46 34
Rjensen 0.001100 3.196289 20.016331 165.0 -583.0 2156.0 4736.0 2006-05-24 07:08:37 2015-04-28 08:48:55 18
Italia2006 0.000417 3.775796 20.365266 1.0 -11.0 150.0 457.0 2013-06-28 05:38:29 2015-02-10 20:46:39 15

Define some additional user-level statistics based on this aggregated data.

Tenure on article

# Compute tenure as difference between last and first revision, convert back to 1-day floats by dividing by timedelta64
rev_agg_user['tenure'] = (rev_agg_user['last_rev'] - rev_agg_user['first_rev'])/np.timedelta64(1,'D')

# Round values up to whole integers
rev_agg_user['tenure'] = rev_agg_user['tenure'].apply(np.ceil)

# Look at the longest-tenure editors
rev_agg_user['tenure'].sort_values(ascending=False).head()
rev_user_text
Rjensen         3262.0
Jrt989          2564.0
Anotherclown    2008.0
ClueBot NG      1996.0
Hmains          1946.0
Name: tenure, dtype: float64
ax = rev_agg_user['tenure'].plot('hist',logy=True,bins=25)
ax.set_xlabel('Tenure (Days)')
ax.set_ylabel('Number of editors')
<matplotlib.text.Text at 0x7fe01c7669e8>
logspace_bins = np.logspace(start=0, stop=4, num=25)
ax = rev_agg_user['tenure'].plot('hist',logy=True,logx=True,bins=logspace_bins)
ax.set_xlabel('Tenure (Days)')
ax.set_ylabel('Number of editors')
<matplotlib.text.Text at 0x7fe01ca325f8>
g = sb.lmplot(x='tenure',y='revs_made',data=rev_agg_user.query('tenure > 1'),aspect=1.5)
ax = g.axes[0][0]
ax.set_xlim((0,1e4))
ax.set_ylim((0,1e4))
ax.set_xscale('symlog')
ax.set_yscale('symlog')

Get user info

Define a function get_user_info that takes a revision DataFrame containing a "rev_user" column/series. For all the users in the revision history, get their user information like regristration date and current edit count.

def get_user_info(rev_df,conn):
    if 'rev_user' in rev_df.columns:
        
        # Get all the user IDs from the revision DF
        _ids = rev_df['rev_user'].unique().tolist() 
        
        # Remove the 0s and convert the rest to strings
        _ids = [str(i) for i in _ids if i != 0]
        
        # Join the list of strings into a big comma-separated string
        user_string = ', '.join(_ids)
        
        # Write the query
        ## Select specific columns
        ## from the user table
        ## where the user_ids are among the users revising the article
        s = """
                SELECT 
                    user_id, user_name, user_registration, user_editcount
                FROM 
                    user
                WHERE
                    user_id IN ({0})
                ORDER BY
                    user_editcount DESC
            """.format(user_string)

        # Run the query
        _df = pd.read_sql_query(s,conn)
        
        # Some of the results have a "bytestring" format
        byte_columns = ['user_name','user_registration']
    
        # For each column, convert it from bytestring to a utf8 string
        for col in byte_columns:
            _df[col] = _df[col].str.decode('utf8')

        # Some of the values are missing, recode them as Jan 1, 2000
        _df['user_registration'] = _df['user_registration'].replace({None:'20000101000000'})
        
        # Convert timestamps to datetimes
        _df['user_registration'] = _df['user_registration'].apply(lambda x:pd.datetime.strptime(x,'%Y%m%d%H%M%S'))
        
        # Return the results
        return _df

Run the get_user_info function on the "rev_df" DataFrame to get information about each of the users who contributed to the article.

# Refresh the database connection since too much time has passed since the last query
conn.ping()
conn.cursor().execute('use enwiki_p');

# Get the user information.
rev_user_info_df = get_user_info(rev_df,conn)
rev_user_info_df.head()
user_id user_name user_registration user_editcount
0 7328338 Yobot 2008-06-17 22:33:17 4644858
1 13286072 ClueBot NG 2010-10-20 17:01:19 4228559
2 433328 SmackBot 2005-09-13 22:47:59 3734324
3 6569922 Addbot 2008-03-01 21:58:03 2838809
4 7611264 AnomieBOT 2008-08-07 17:09:33 2692037

Merge the "rev_agg_user" and the "rev_user_info_df" DataFrames together by joining on the user name.

# Merge the two DataFrames together using the index of "reg_agg_user"
# and the "user_name" field from rev_user_info_df, joining on the left values which should be more inclusive
merged_rev_agg_user = pd.merge(rev_agg_user,rev_user_info_df,left_index=True,right_on='user_name',how='left')

# Check if the two DataFrames have the same number of users, if False then there will be NaNs
print(len(rev_agg_user) == len(rev_user_info_df))

# Show the first five rows
merged_rev_agg_user.head()
False
rev_ts_min rev_ts_avg rev_ts_max rev_len_med rev_len_min rev_len_max rev_len_sum first_rev last_rev revs_made tenure user_id user_name user_registration user_editcount
152 30.065567 30.065567 30.065567 35.0 35.0 35.0 35.0 2007-09-11 11:50:24 2007-09-11 11:50:24 1 0.0 843479.0 -js- 2006-01-26 21:07:09 2409.0
202 2.281470 2.281470 2.281470 4.0 4.0 4.0 4.0 2015-03-11 06:45:31 2015-03-11 06:45:31 1 0.0 NaN 101.168.85.68 NaT NaN
202 0.000463 11.950642 23.900822 3.5 -5.0 12.0 7.0 2012-09-13 21:16:44 2012-09-13 21:17:24 2 1.0 NaN 108.235.32.10 NaT NaN
202 0.336678 0.336678 0.336678 42.0 42.0 42.0 42.0 2013-11-18 00:11:35 2013-11-18 00:11:35 1 0.0 NaN 108.254.130.98 NaT NaN
202 40.779699 40.779699 40.779699 245.0 245.0 245.0 245.0 2016-01-31 22:40:58 2016-01-31 22:40:58 1 0.0 NaN 108.79.192.79 NaT NaN

What fraction of the user's toital edits are devoted to this single article?

merged_rev_agg_user['rev_fraction'] = merged_rev_agg_user['revs_made']/merged_rev_agg_user['user_editcount']

# Plot the distribution. Use log-bins since the data is so skewed
logspace_bins = np.logspace(start=-7, stop=0, num=25)
ax = merged_rev_agg_user['rev_fraction'].plot('hist',logy=False,logx=True,bins=logspace_bins)
ax.set_xlabel('Fraction of total edits')
ax.set_ylabel('Count')
<matplotlib.text.Text at 0x7fe01c553470>

How long after a user's account registration did they make their first edit to this article?

merged_rev_agg_user['first_rev_account_age'] = merged_rev_agg_user['first_rev'] - merged_rev_agg_user['user_registration']
merged_rev_agg_user['first_rev_account_age'] = merged_rev_agg_user['first_rev_account_age']/np.timedelta64(1,'D')

# Plot the distribution. Use log-bins since the data is so skewed
logspace_bins = np.logspace(start=-1, stop=4, num=25)
ax = merged_rev_agg_user['first_rev_account_age'].plot('hist',logy=False,logx=True,bins=logspace_bins)
ax.set_xlabel('Account age at first article revision')
ax.set_ylabel('Count')
<matplotlib.text.Text at 0x7fe01c36e588>

Do users editing the article earlier in their career end up contributing more to the article?

g = sb.lmplot(x='first_rev_account_age',y='revs_made',data=merged_rev_agg_user.query('tenure > 1'),aspect=1.5)
ax = g.axes[0][0]
ax.set_xlim((0,1e4))
ax.set_ylim((0,1e4))
ax.set_xscale('symlog')
ax.set_yscale('symlog')
ax.set_xlabel('Account age at first article revision')
ax.set_ylabel('Total article revisions')
<matplotlib.text.Text at 0x7fe01c18c128>

Compare two articles

# Refresh the database connection since too much time has passed since the last query
conn.ping()
conn.cursor().execute('use enwiki_p');

# Get the second set of revisions
rev2_df = get_page_revisions('British_Expeditionary_Force_(World_War_I)',0,conn)

Do the data cleanup and feature engineering for later steps. Define a function to do all the feature engineering steps with one call.

def data_cleanup(rev_df):
    # Compute the rev_date from the timestamps
    rev_df['rev_date'] = rev_df['rev_timestamp'].apply(lambda x:x.date())

    # Set the revision lengths for the null rows to be the previous value
    rev_df['rev_len'] = rev_df['rev_len'].fillna(method='ffill')

    # Compute the diffs and store as a new series 
    rev_df['rev_len_diff'] = rev_df['rev_len'].diff()

    # Set the value for the first rev_len_diff to be the size of the rev_len
    rev_df.loc[0,'rev_len_diff'] = rev_df.loc[0,'rev_len']

    # Compute the diffs and store as a new series
    rev_df['rev_timestamp_diff'] = rev_df['rev_timestamp'].diff()

    # Convert the values of the new series to floats by dividing by 1 day time delta.
    rev_df['rev_timestamp_diff'] = rev_df['rev_timestamp_diff'].apply(lambda x:x/np.timedelta64(1,'D'))
    
    return rev_df

Run the cleanup on rev2_df and confirm it has all the variables (columns) we need.

rev2_df = data_cleanup(rev2_df)
rev2_df.head()
rev_id rev_page rev_text_id rev_comment rev_user rev_user_text rev_timestamp rev_minor_edit rev_deleted rev_len rev_parent_id rev_sha1 rev_content_model rev_content_format page_id page_namespace page_title page_restrictions page_counter page_is_redirect page_is_new page_random page_touched page_links_updated page_latest page_len page_content_model rev_date rev_len_diff rev_timestamp_diff
0 810116 207150 0 0 81.131.3.94 2003-04-07 23:45:41 0 0 1573 0 eafkbqad2h8v4j7pyl01spn5bk0ltcz None None 207150 0 British_Expeditionary_Force_(World_War_I) 1 0 0 0.36539 2016-12-08 06:49:25 2016-12-10 01:58:20 753619404 39169 wikitext 2003-04-07 1573.0 NaN
1 810119 207150 0 4626 Michael Hardy 2003-04-07 23:46:14 0 0 1560 810116 fz5xye7n8wmm2lrlk7mwg35rv10cnxs None None 207150 0 British_Expeditionary_Force_(World_War_I) 1 0 0 0.36539 2016-12-08 06:49:25 2016-12-10 01:58:20 753619404 39169 wikitext 2003-04-07 -13.0 0.000382
2 1156784 207150 0 "Army" ---> "army". 4626 Michael Hardy 2003-04-07 23:46:48 0 0 1560 810119 ifgq1j4gwrgfar0vqau9scu1cadozhz None None 207150 0 British_Expeditionary_Force_(World_War_I) 1 0 0 0.36539 2016-12-08 06:49:25 2016-12-10 01:58:20 753619404 39169 wikitext 2003-04-07 0.0 0.000394
3 1415735 207150 0 Flander > Flanders 3295 Mintguy 2003-07-16 15:47:21 1 0 1555 1156784 9ujus8ns628sr1t0a6zedl72yhltrcn None None 207150 0 British_Expeditionary_Force_(World_War_I) 1 0 0 0.36539 2016-12-08 06:49:25 2016-12-10 01:58:20 753619404 39169 wikitext 2003-07-16 -5.0 99.667049
4 1787055 207150 0 13420 David Newton 2003-09-10 08:51:59 0 0 1566 1415735 iejrn3buz0jo3af0nbhabsin89pqcqj None None 207150 0 British_Expeditionary_Force_(World_War_I) 1 0 0 0.36539 2016-12-08 06:49:25 2016-12-10 01:58:20 753619404 39169 wikitext 2003-09-10 11.0 55.711551

Group by the "rev_date", compute the daily_revs, re-index to make the time continuous.

rev2_gb_date = rev2_df.groupby(['rev_date'])

daily_revs2 = rev2_gb_date.agg({'rev_id':pd.Series.nunique})

# Compute the first and last days revisions were observed
_min, _max = daily_revs2.index.min(), daily_revs2.index.max()

# Create a new index that is continuous over the whoel date range
_ix = pd.date_range(_min,_max)

# Reindex the daily revisions over the continuous date range filling in the blanks with 0s
daily_revs2_reindexed = daily_revs2.reindex(_ix).fillna(0)

Combine both of the daily_revs series into a DataFrame by merging, renaming the columns, and filling the blank values with 0s.

daily_revs_compared_df = pd.merge(daily_revs_reindexed,daily_revs2_reindexed,
                                  how='outer',left_index=True,right_index=True)

# The left DF was Mitt Romney and the right DF was John McCain. Rename to that effect
daily_revs_compared_df = daily_revs_compared_df.rename(columns={'rev_id_x':'American Expeditionary Forces','rev_id_y':'British_Expeditionary_Force_(World_War_I)'})

# Fill the null values with 0s
daily_revs_compared_df.fillna(0,inplace=True)

# Look at the first 5 rows
daily_revs_compared_df.head()
American Expeditionary Forces British_Expeditionary_Force_(World_War_I)
2003-04-07 0.0 3.0
2003-04-08 0.0 0.0
2003-04-09 0.0 0.0
2003-04-10 0.0 0.0
2003-04-11 0.0 0.0

Plot the comparisong of both results in a single chart. Use a rolling average of 60 days to smooth out noise.

ax = daily_revs_compared_df.rolling(60).mean().plot(lw=3)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Revisions')
<matplotlib.text.Text at 0x7fe01c06b160>

Zoom in on the 2008 election season.

# Define two rough dates marking the start and stop of x axis
_min = pd.Timestamp('2004-01-01').date()
_max = pd.Timestamp('2009-01-01').date()

# Create a subset_df based on the 2-year window of dates
subset_df = daily_revs_compared_df.ix[_min:_max]

# Use a window of 28 days to compute the rolling average instead
ax = subset_df.rolling(28).mean().plot(lw=3)

# Draw a vertical lines for major campaign dates
ax.axvline(pd.Timestamp('2004-07-28').date(),c='black',lw=1,ls='--')   # anniversary of start in black
ax.axvline(pd.Timestamp('2006-01-01').date(),c='orange',lw=1,ls='--')  # battle of the somme in orange
ax.axvline(pd.Timestamp('2007-04-04').date(),c='purple',lw=1,ls='--')  # US declaration in orange
ax.axvline(pd.Timestamp('2008-11-11').date(),c='red',lw=1,ls='--')     # armistice day in red

ax.set_xlabel('Date')
ax.set_ylabel('Revisions')
<matplotlib.text.Text at 0x7fe01bebcf60>