Lab 1 - Revision Histories

Professor Brian Keegan
Department of Information Science, CU Boulder
This notebook is copyright and made available under the Apache License v2.0 license.

This is the first of five lab notebooks that will explore how to do some introductory data extraction and analysis from Wikipedia data. This lab will use a combination of MySQL and Python to retrieve the revision histories for a selection of articles. You do not need to be fluent in either to complete the lab, but there are many options for extending the analyses we do here by using more advanced queries and scripting methods.

Acknowledgements
I'd like to thank the Wikimedia Foundation for the PAWS system and related Wikitech infrastructure that this workbook runs within. Yuvi Panda, Aaron Halfaker, Jonathan Morgan, and Dario Taraborelli have all provided crucial support and feedback.

Confirm basic Python commands work

a = 1
b = 2
a+b
3
list(range(0,5))
[0, 1, 2, 3, 4]
your_name = 'Brian Keegan'
print("Hello, {0}".format(your_name))
Hello, Brian Keegan

Import modules and setup environment

Load up all the libraries we'll need to connect to the database, retrieve information for analysis, and visualizing results.

# Makes the plots appear within the notebook
%matplotlib inline

# Two fundamental packages for doing data manipulation
import numpy as np                   # http://www.numpy.org/
import pandas as pd                  # http://pandas.pydata.org/

# Two related packages for plotting data
import matplotlib.pyplot as plt      # http://matplotlib.org/
import seaborn as sb                 # https://stanford.edu/~mwaskom/software/seaborn/

# Two packages for accessing the MySQL server
import pymysql                       # http://pymysql.readthedocs.io/en/latest/
import os                            # https://docs.python.org/3.4/library/os.html

# Setup the code environment to use plots with a white background and DataFrames show more columns and rows
sb.set_style('whitegrid')
pd.options.display.max_columns = 100
pd.options.display.max_rows = 110

Setup a connection to the English Wikipedia database

Retrieve the host name, username, and (encrypted) password associated with your account so you can login to the database.

host, user, password = os.environ['MYSQL_HOST'], os.environ['MYSQL_USERNAME'], os.environ['MYSQL_PASSWORD']

Connect to the system using the creditials from above, then select the database for the English Wikipedia.

conn = pymysql.connect(host=host,user=user,password=password,database='trwiki_p',connect_timeout=3600)
conn.cursor().execute('use trwiki_p');

What tables are available for use?

pd.read_sql_query('show tables',conn)
Tables_in_trwiki_p
0 abuse_filter
1 abuse_filter_action
2 abuse_filter_log
3 archive
4 archive_userindex
5 category
6 categorylinks
7 change_tag
8 externallinks
9 filearchive
10 filearchive_userindex
11 flaggedimages
12 flaggedpage_config
13 flaggedpage_pending
14 flaggedpages
15 flaggedrevs
16 flaggedrevs_promote
17 flaggedrevs_statistics
18 flaggedrevs_stats
19 flaggedrevs_stats2
20 flaggedrevs_tracking
21 flaggedtemplates
22 geo_tags
23 global_block_whitelist
24 image
25 imagelinks
26 interwiki
27 ipblocks
28 ipblocks_ipindex
29 iwlinks
30 l10n_cache
31 langlinks
32 logging
33 logging_logindex
34 logging_userindex
35 mark_as_helpful
36 math
37 module_deps
38 msg_resource_links
39 oldimage
40 oldimage_userindex
41 ores_classification
42 ores_model
43 page
44 page_props
45 page_restrictions
46 pagelinks
47 pif_edits
48 povwatch_log
49 povwatch_subscribers
50 protected_titles
51 recentchanges
52 recentchanges_userindex
53 redirect
54 revision
55 revision_userindex
56 site_identifiers
57 site_stats
58 sites
59 tag_summary
60 templatelinks
61 transcode
62 updatelog
63 updates
64 user
65 user_former_groups
66 user_groups
67 user_properties
68 user_properties_anon
69 valid_tag
70 watchlist_count
71 wbc_entity_usage
72 wikilove_image_log
73 wikilove_log
s = """
        SELECT 
            *
        FROM 
            wikilove_log
        LIMIT
            5
    """
    
pd.read_sql_query(s,conn)
wll_id wll_timestamp wll_sender wll_sender_registration wll_sender_editcount wll_receiver wll_receiver_registration wll_receiver_editcount wll_type wll_subject wll_message wll_email
0 1 b'20120706215949' 389435 b'20120219234014' 514 407174 b'20120531165047' 564 b'food-baklava' b'Bir y\xc4\xb1ld\xc4\xb1z da sizin i\xc3\xa7in!' b'Dada\xc5\x9f maddesinde biriken katk\xc4\xb1... 0
1 2 b'20120706221523' 407174 b'20120531165047' 564 389435 b'20120219234014' 515 b'barnstar-special' b'Bir y\xc4\xb1ld\xc4\xb1z da sizin i\xc3\xa7in!' b'Zahmet ne kelime efendim. \xc3\x87ok te\xc5\... 0
2 3 b'20120707052346' 181110 b'20081016175229' 52479 26493 b'20060803230508' 52488 b'food-coffee' b'Bir bardak kahve de sizin i\xc3\xa7in!' b'sabah sabah iyi gider :D denemi\xc5\x9f oldu... 0
3 4 b'20120707083440' 265550 b'20100131150347' 6389 26493 b'20060803230508' 52491 b'barnstar-surreal' b'Bir y\xc4\xb1ld\xc4\xb1z da sizin i\xc3\xa7in!' b"Of be Vito! Viki'ye d\xc3\xb6nd\xc3\xbcn d\x... 0
4 5 b'20120707102220' 168116 b'20080804163912' 1526 159334 b'20080609220448' 581 b'barnstar-surreal' b'Bir y\xc4\xb1ld\xc4\xb1z da sizin i\xc3\xa7in!' b'Tebrik ederim' 0

Important note

Article page titles need to be separated by underscores "_" rather than spaces. It's easy to replace spaces with underscores using the replace method called on the string.

page_title = "2016 Türkiye askerî darbe girişimi".replace(' ','_')
page_title
'2016_Türkiye_askerî_darbe_girişimi'

Get revisions for a single article

Write a function get_page_revisions that takes the name of a Wikipedia page and its namespace and use the database connection to return a pandas DataFrame containing all the revisions for the article. Some of the fields inside needs to be cleaned up by converting to other data types to help with analysis in later steps.

def get_page_revisions(page_title,namespace,conn):#namespace one or 0 
    """ Takes a page title and returns its revision history.
      page_title = a string for the page title to get its revisions
      namespace = an int to get the revisions from a specific namespace
      conn = a database connection
      
    Returns:
      A pandas DataFrame containing all the article revisions and page information
    """
    # In case you pass a page title with spaces in it, replace the spaces with underscores
    page_title = page_title.replace(' ','_')

    # The MySQL query string used to retrieve the data. By line, it is
    ## selecting all columns 
    ## from the "revisions" table
    ## joining the "page" table on it
    ## using the page_id and rev_page columns as keys
    ## limiting the results to entries that occur in the namespace and pagetitle
    ## and ordering the results by the timestamp
    s = """
            SELECT 
                *
            FROM 
                revision 
            JOIN 
                page ON page.page_id = revision.rev_page
            WHERE 
                page.page_namespace = {1} AND page.page_title = '{0}' 
            ORDER BY 
                revision.rev_timestamp ASC
        """.format(page_title,namespace)
    
    # Use the connection to run the query and return the results as a DataFrame
    _df = pd.read_sql_query(s,conn)

    # Some of the results have a "bytestring" format
    byte_columns = ['rev_comment','rev_user_text','rev_timestamp','rev_sha1','page_content_model',
                    'page_title','page_restrictions','page_touched','page_links_updated']
    
    # For each column, convert it from bytestring to a utf8 string
    for col in byte_columns:
        _df[col] = _df[col].str.decode('utf8')

    # Several of the columns are timestamps. Convert to datetimes
    _df['rev_timestamp'] = _df['rev_timestamp'].apply(lambda x:pd.datetime.strptime(x,'%Y%m%d%H%M%S'))
    _df['page_touched'] = _df['page_touched'].apply(lambda x:pd.datetime.strptime(x,'%Y%m%d%H%M%S'))
    _df['page_links_updated'] = _df['page_links_updated'].apply(lambda x:pd.datetime.strptime(x,'%Y%m%d%H%M%S'))
    
    # Return the data
    return _df

Use the function to get the revisions for a single article and use the .head() method to inspect the first 5 rows in the DataFrame.

rev_df = get_page_revisions('page_ti',0,conn)
rev_df.head()
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
/srv/paws/lib/python3.4/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
   1400             else:
-> 1401                 cur.execute(*args)
   1402             return cur

/srv/paws/lib/python3.4/site-packages/pymysql/cursors.py in execute(self, query, args)
    165 
--> 166         result = self._query(query)
    167         self._executed = query

/srv/paws/lib/python3.4/site-packages/pymysql/cursors.py in _query(self, q)
    321         self._last_executed = q
--> 322         conn.query(q)
    323         self._do_get_result()

/srv/paws/lib/python3.4/site-packages/pymysql/connections.py in query(self, sql, unbuffered)
    849             else:
--> 850                 sql = sql.encode(self.encoding, 'surrogateescape')
    851         self._execute_command(COMMAND.COM_QUERY, sql)

UnicodeEncodeError: 'latin-1' codec can't encode character '\u015f' in position 270: ordinal not in range(256)

During handling of the above exception, another exception occurred:

DatabaseError                             Traceback (most recent call last)
<ipython-input-25-3e9607f05551> in <module>()
----> 1 rev_df = get_page_revisions('2016 Türkiye askerî darbe girişimi',0,conn)
      2 rev_df.head()

<ipython-input-14-458f26638916> in get_page_revisions(page_title, namespace, conn)
     32 
     33     # Use the connection to run the query and return the results as a DataFrame
---> 34     _df = pd.read_sql_query(s,conn)
     35 
     36     # Some of the results have a "bytestring" format

/srv/paws/lib/python3.4/site-packages/pandas/io/sql.py in read_sql_query(sql, con, index_col, coerce_float, params, parse_dates, chunksize)
    329     return pandas_sql.read_query(
    330         sql, index_col=index_col, params=params, coerce_float=coerce_float,
--> 331         parse_dates=parse_dates, chunksize=chunksize)
    332 
    333 

/srv/paws/lib/python3.4/site-packages/pandas/io/sql.py in read_query(self, sql, index_col, coerce_float, params, parse_dates, chunksize)
   1434 
   1435         args = _convert_params(sql, params)
-> 1436         cursor = self.execute(*args)
   1437         columns = [col_desc[0] for col_desc in cursor.description]
   1438 

/srv/paws/lib/python3.4/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
   1411             ex = DatabaseError(
   1412                 "Execution failed on sql '%s': %s" % (args[0], exc))
-> 1413             raise_with_traceback(ex)
   1414 
   1415     @staticmethod

/srv/paws/lib/python3.4/site-packages/pandas/compat/__init__.py in raise_with_traceback(exc, traceback)
    338         if traceback == Ellipsis:
    339             _, _, traceback = sys.exc_info()
--> 340         raise exc.with_traceback(traceback)
    341 else:
    342     # this version of raise is a syntax error in Python 3

/srv/paws/lib/python3.4/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
   1399                 cur.execute(*args, **kwargs)
   1400             else:
-> 1401                 cur.execute(*args)
   1402             return cur
   1403         except Exception as exc:

/srv/paws/lib/python3.4/site-packages/pymysql/cursors.py in execute(self, query, args)
    164         query = self.mogrify(query, args)
    165 
--> 166         result = self._query(query)
    167         self._executed = query
    168         return result

/srv/paws/lib/python3.4/site-packages/pymysql/cursors.py in _query(self, q)
    320         conn = self._get_db()
    321         self._last_executed = q
--> 322         conn.query(q)
    323         self._do_get_result()
    324         return self.rowcount

/srv/paws/lib/python3.4/site-packages/pymysql/connections.py in query(self, sql, unbuffered)
    848                 sql = sql.encode(self.encoding)
    849             else:
--> 850                 sql = sql.encode(self.encoding, 'surrogateescape')
    851         self._execute_command(COMMAND.COM_QUERY, sql)
    852         self._affected_rows = self._read_query_result(unbuffered=unbuffered)

DatabaseError: Execution failed on sql '
            SELECT 
                *
            FROM 
                revision 
            JOIN 
                page ON page.page_id = revision.rev_page
            WHERE 
                page.page_namespace = 0 AND page.page_title = '2016_Türkiye_askerî_darbe_girişimi' 
            ORDER BY 
                revision.rev_timestamp ASC
        ': 'latin-1' codec can't encode character '\u015f' in position 270: ordinal not in range(256)

Count the total number of revisions in the article.

len(rev_df)
117

Count the total number of unique users contributing to the article.

len(rev_df['rev_user_text'].unique())
49

When was the first revision made to the article?

rev_df['rev_timestamp'].min()
Timestamp('2007-01-31 21:42:36')

When was the most recent revision made to the article?

rev_df['rev_timestamp'].max()
Timestamp('2017-04-28 11:20:49')

How has the size of the article changed over time?

ax = rev_df['rev_len'].plot()
ax.set_xlabel('Revisions')
ax.set_ylabel('Article size (bytes)')
<matplotlib.text.Text at 0x7f2353ea75f8>
/srv/paws/lib/python3.4/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Who made the largest change to the article?

# Use the idxmax() method to return the index value for the largest value in the series
loc_max_revision = rev_df['rev_len'].idxmax()

# Show the values of a single row
rev_df.ix[loc_max_revision]
rev_id                                          14277181
rev_page                                          206379
rev_text_id                                            0
rev_comment                                             
rev_user                                           17481
rev_user_text                            Dünya vatandaşı
rev_timestamp                        2014-04-02 17:44:26
rev_minor_edit                                         0
rev_deleted                                            0
rev_len                                             9104
rev_parent_id                                   14277179
rev_sha1                 s79kro3dvgak6wyk4br7rajr88loalt
rev_content_model                                   None
rev_content_format                                  None
page_id                                           206379
page_namespace                                         0
page_title                                   Mitt_Romney
page_restrictions                                       
page_counter                                           0
page_is_redirect                                       0
page_is_new                                            0
page_random                                     0.677916
page_touched                         2017-04-28 11:20:49
page_links_updated                   2017-04-28 11:20:51
page_latest                                     18544631
page_len                                            7111
page_no_title_convert                                  0
page_content_model                              wikitext
page_lang                                           None
Name: 94, dtype: object

Go inspect what this largest revision was. Actually this specific example is extremely NSFW.

rev_id = rev_df.ix[loc_max_revision,'rev_id']
print('https://en.wikipedia.org/w/index.php?oldid={0}'.format(rev_id))
https://en.wikipedia.org/w/index.php?oldid=173926275

Group revisions by date

Muliple revisions can be made in a single day. We want to be able to count the number of revisions per day and how the article's length has changed over time.

# For each timestamp extract only its date (ignoring the time of day) 
# and save as a new column "rev_date"
rev_df['rev_date'] = rev_df['rev_timestamp'].apply(lambda x:x.date())

# Compare the full "rev_timestamp" to the new "rev_date"
rev_df[['rev_timestamp','rev_date']].head()
rev_timestamp rev_date
0 2004-01-09 03:00:24 2004-01-09
1 2004-01-09 03:01:26 2004-01-09
2 2004-01-09 03:02:28 2004-01-09
3 2004-01-09 03:57:22 2004-01-09
4 2004-01-09 06:03:31 2004-01-09

What dates had the most revisions in a single day? A simple way of doing this is to count the number of times each "rev_date" occurs in the data with the value_counts method. Let's look at the top 10, 5 of which are in October 2012 in the weeks before the Election Day.

rev_df['rev_date'].value_counts().head(10)
2012-10-07    77
2012-10-06    73
2012-10-13    71
2012-05-13    65
2007-12-13    58
2012-07-27    54
2012-10-03    50
2012-10-05    50
2012-05-30    50
2012-07-20    49
Name: rev_date, dtype: int64

We can use the groupby functionality to create "baby" DataFrames having all the same particular value. For example, we can group by the date and each of the baby DataFrames would be all the revisions that occurred on that date. Store the groupby in the "rev_gb_date" object for subsequent steps.

Inspect a single baby DataFrame by passing the key (the date values we grouped on) to the get_group method and look at the first 5 entries.

rev_gb_date = rev_df.groupby(['rev_date'])

_date = pd.Timestamp('2012-10-07').date()
rev_gb_date.get_group(_date).head(5)
rev_id rev_page rev_text_id rev_comment rev_user rev_user_text rev_timestamp rev_minor_edit rev_deleted rev_len rev_parent_id rev_sha1 rev_content_model rev_content_format page_id page_namespace page_title page_restrictions page_counter page_is_redirect page_is_new page_random page_touched page_links_updated page_latest page_len page_content_model rev_date
9915 516397493 426208 0.0 /* Political positions */ ce, improve prose 11243047 GabeMc 2012-10-07 00:11:21 0 0 215302.0 516387652 sylmwj2deuw63nk04da8uv1u2bgt8t3 None None 426208 0 Mitt_Romney 0 0 0 0.487268 2016-12-15 02:42:57 2016-12-15 02:43:34 754102119 256456 wikitext 2012-10-07
9916 516398755 426208 0.0 /* Political positions */ ce, avoid redundancy... 11243047 GabeMc 2012-10-07 00:22:12 0 0 215311.0 516397493 lsh4gi37dd000s67qsulfv35uf0i16i None None 426208 0 Mitt_Romney 0 0 0 0.487268 2016-12-15 02:42:57 2016-12-15 02:43:34 754102119 256456 wikitext 2012-10-07
9917 516403792 426208 0.0 clarify that church positions were not full time 61960 Anythingyouwant 2012-10-07 01:06:39 0 0 215338.0 516398755 pe5aabxcd8lvll3wv9y5c3n8bbhrc51 None None 426208 0 Mitt_Romney 0 0 0 0.487268 2016-12-15 02:42:57 2016-12-15 02:43:34 754102119 256456 wikitext 2012-10-07
9918 516405141 426208 0.0 /* University, France mission, marriage, and c... 61960 Anythingyouwant 2012-10-07 01:19:16 0 0 215346.0 516403792 d5lo1cm9nzg2oh7ply53xj9hdq7cua0 None None 426208 0 Mitt_Romney 0 0 0 0.487268 2016-12-15 02:42:57 2016-12-15 02:43:34 754102119 256456 wikitext 2012-10-07
9919 516405398 426208 0.0 /* University, France mission, marriage, and c... 61960 Anythingyouwant 2012-10-07 01:21:48 0 0 215357.0 516405141 2mbiudrlxffpz90zx4pi5cq33xywkpb None None 426208 0 Mitt_Romney 0 0 0 0.487268 2016-12-15 02:42:57 2016-12-15 02:43:34 754102119 256456 wikitext 2012-10-07

Compute unique revisions per day

We can also use aggregate functionality to perform computations efficiently across all the baby DataFrames.

daily_revs = rev_gb_date.agg({'rev_id':pd.Series.nunique})
daily_revs.head() 
rev_id
rev_date
2004-01-10 1
2004-01-25 1
2004-03-07 1
2004-03-17 1
2004-04-05 1

Reindex the axis to make it continuous over the date range.

# Compute the first and last days revisions were observed
_min, _max = daily_revs.index.min(), daily_revs.index.max()

# Create a new index that is continuous over the whoel date range
_ix = pd.date_range(_min,_max)

# Reindex the daily revisions over the continuous date range filling in the blanks with 0s
daily_revs_reindexed = daily_revs.reindex(_ix).fillna(0)
daily_revs_reindexed.head()
rev_id
2004-01-10 1.0
2004-01-11 0.0
2004-01-12 0.0
2004-01-13 0.0
2004-01-14 0.0

What was the most active day of editing?

daily_revs_reindexed['rev_id'].sort_values(ascending=False).head(5)
2012-10-07    77.0
2012-10-06    73.0
2012-10-13    71.0
2012-05-13    65.0
2007-12-13    58.0
Name: rev_id, dtype: float64

Plot the number of revisions by day.

ax = daily_revs_reindexed.ix[,'rev_id'].plot(lw=.5)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Revisions')
<matplotlib.text.Text at 0x7fda55fb0550>

This is pretty noisy. We can clean it up by applying a rolling mean over a 60-day window.

smoothed_s = daily_revs_reindexed['rev_id'].rolling(60).mean()
ax = smoothed_s.plot(lw=3)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Revisions')
<matplotlib.text.Text at 0x7fda53ab74a8>

Plot the cumulative sum of the number of revisions over time using the cumsum method.

ax = daily_revs_reindexed['rev_id'].cumsum().plot(lw=2)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Cumulative Revisions')
<matplotlib.text.Text at 0x7fda55fc6390>

Compute article length by day

Using the same "rev_gb_date" object, aggregate each of the baby DataFrames and compute the median article size ("rev_len"). Reindex again to make a continuous time axis.

daily_length = rev_gb_date.agg({'rev_len':np.median})

# Reindex
daily_length_reindexed = daily_length.reindex(_ix).fillna(0).diff()
daily_length_reindexed.tail()
rev_len
2016-12-06 0.0
2016-12-07 256512.0
2016-12-08 150.0
2016-12-09 -74.0
2016-12-10 -132.0
daily_length = rev_gb_date.agg({'rev_len':np.median})

# Reindex
daily_length_reindexed = daily_length.reindex(_ix).fillna(method='ffill')
daily_length_reindexed.head()
rev_len
2004-01-10 237.0
2004-01-11 237.0
2004-01-12 237.0
2004-01-13 237.0
2004-01-14 237.0

Plot the reindexed series, making the y-axis be on a log scale to pick up detail in the early days when the article was orders of magnitude smaller.

ax = daily_length_reindexed['rev_len'].plot(lw=2,logy=True)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Article size  (bytes)')
<matplotlib.text.Text at 0x7fda539126d8>

Compute diffs on time and article size

The "rev_len" only records the length of the revision at that point in time. By taking the difference with the previous observation, we can compute how much net content the user added or removed from the previous revision. Using the diff method on the "rev_len" series, we store the resulting differences as "rev_len_diff".

Compare the first five elements of the two series to verify it's worked as intended.

# Compute the diff and store as a new series
rev_df['rev_len_diff'] = rev_df['rev_len'].diff()

# Verify the differences are accurate
rev_df[['rev_len','rev_len_diff']].head()
rev_len rev_len_diff
0 237.0 NaN
1 369.0 132.0
2 972.0 603.0
3 1139.0 167.0
4 1152.0 13.0

Do some more cleanup of the diffs. In particular, when a revision is deleted (which is rare) the "rev_len" value is null, which makes the diff null as well. Make the first revision's "rev_len_diff" equal to the size of the article and then ffill the other nan values.

# Set the revision lengths for the null rows to be the previous value
rev_df['rev_len'] = rev_df['rev_len'].fillna(method='ffill')

# Compute the diffs again 
rev_df['rev_len_diff'] = rev_df['rev_len'].diff()

# Set the value for the first rev_len_diff to be the size of the rev_len
rev_df.loc[0,'rev_len_diff'] = rev_df.loc[0,'rev_len']

# Verify the differences are accurate
rev_df[['rev_len','rev_len_diff']].head()
rev_len rev_len_diff
0 237.0 237.0
1 369.0 132.0
2 972.0 603.0
3 1139.0 167.0
4 1152.0 13.0

Similarly, the "rev_timestamp" series only records the time when the revision occurred. By taking the difference with the previous revision, we can compute how much time has elapsed since the last edit. Using the diff method again on the "rev_timestamp" series, we store the resulting differences as "ref_timestamp_diff".

The resulting values are Timedeltas which we convert back to number of day floats by dividing by np.timedelta64.

Compare the first five elements of the two series to verify it's worked as intended.

# Compute the diff and store as a new series
rev_df['rev_timestamp_diff'] = rev_df['rev_timestamp'].diff()

# Convert the values of the new series to floats by dividing by 1 day time delta.
rev_df['rev_timestamp_diff'] = rev_df['rev_timestamp_diff'].apply(lambda x:x/np.timedelta64(1,'D'))

# Verify the differences are accurate.
rev_df[['rev_timestamp','rev_timestamp_diff']].head()
rev_timestamp rev_timestamp_diff
0 2004-01-10 07:37:48 NaN
1 2004-01-25 21:23:55 15.573692
2 2004-03-07 22:46:47 42.057546
3 2004-03-17 08:37:03 9.409907
4 2004-04-05 06:17:41 18.903218

Combine all the computations with one daily DataFrame

Because we added new series to the "rev_df" we need to re-run the groupby to update it. We can also combine all the different aggregate functions we did before into one step by defining an "agg_dict" that applies an aggregation function over the values in a specific column. This will return a DataFrame with only a subset of columns with the aggregated values we specified.

# Groupby again
rev_gb_date = rev_df.groupby(['rev_date'])

# Define an aggregation mapping of functions to columns names
agg_dict_date = {'rev_id':pd.Series.nunique,
                  'rev_len':np.median,
                  'rev_len_diff':np.median,
                  'rev_timestamp_diff':np.mean
                 }

# Execiute the aggregation functions on the groupby object
rev_agg_date = rev_gb_date.agg(agg_dict_date)

# Reindex to continuous dates
rev_agg_date = rev_agg_date.reindex(_ix)

# Fill in the resulting values of the reindexed and aggregated "rev_len" from the last value
rev_agg_date[['rev_len']] = rev_agg_date[['rev_len']].fillna(method='ffill')

# Fill in the resulting values of the reindexed and aggregated other series as 0s.
rev_agg_date[['rev_id','rev_len_diff']] = rev_agg_date[['rev_id','rev_len_diff']].fillna(0)
#rev_agg_date[['rev_id','rev_len_diff','rev_timestamp_diff']] = rev_agg_date[['rev_id','rev_len_diff','rev_timestamp_diff']].fillna(0)

# Show the first 10 values to verify it's worked as intended
rev_agg_date.tail(10)
rev_len rev_id rev_len_diff rev_timestamp_diff
2016-12-01 256701.0 5.0 0.0 0.322317
2016-12-02 256414.0 1.0 13.0 1.864572
2016-12-03 256414.0 0.0 0.0 NaN
2016-12-04 256414.0 0.0 0.0 NaN
2016-12-05 256414.0 0.0 0.0 NaN
2016-12-06 256414.0 0.0 0.0 NaN
2016-12-07 256512.0 13.0 60.0 0.389046
2016-12-08 256662.0 8.0 0.0 0.005595
2016-12-09 256588.0 10.0 -25.0 0.175712
2016-12-10 256456.0 3.0 -2.0 0.376655

Plot the changes in the average time since the last revision by day.

_s = rev_agg_date['rev_timestamp_diff']
ax = _s.plot(logy=True)
ax.set_xlabel('Date')
ax.set_ylabel('Average time since last edit (Days)')
<matplotlib.text.Text at 0x7fda53b8a630>
/srv/paws/lib/python3.4/site-packages/matplotlib/scale.py:101: RuntimeWarning: invalid value encountered in less_equal
  a[a <= 0.0] = 1e-300

That's pretty noisy. We can apply the rolling method over a 60-day window and get the mean to smooth out some of the noise.

_s = rev_agg_date['rev_timestamp_diff'].fillna(0).rolling(60).mean()
ax = _s.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Average time since last edit (Days)')
<matplotlib.text.Text at 0x7fda55cbe1d0>

Similarly, what was the median size of a change to the article over time?

_s = rev_agg_date['rev_len_diff']
ax = _s.plot(lw=.5)
ax.set_xlabel('Date')
ax.set_ylabel('Median size of revision')
ax.set_yscale('symlog')

Again, pretty noisy. We can cleanup with a rolling mean over a 60-day window.

_s = rev_agg_date['rev_len_diff'].rolling(60).mean()
ax = _s.plot()
ax.set_xlabel('Date')
ax.set_ylabel('Median size of revision')
<matplotlib.text.Text at 0x7fda534c4c88>

Top editors on the article

Which editors made the most contributions to the article?

top_users = rev_df['rev_user_text'].value_counts()
top_users.head(10)
Wasted Time R              1394
Anythingyouwant             754
GabeMc                      421
Notmyrealname               196
Yellowdesk                  190
Hodgdon's secret garden     152
Dezastru                    133
72Dino                      126
Cwobeel                     113
Eustress                    102
Name: rev_user_text, dtype: int64

This extremely skewed distribution of activity is characteristic of many online participation patterns: the "long tail" where less than 20% of users are responsible for more than 20% of contributions.

rev_count_occurrences = top_users.value_counts().reset_index()
rev_count_occurrences = rev_count_occurrences.rename(columns={'index':'Revision count','rev_user_text':'Occurrences'})
rev_count_occurrences.head()
Revision count Occurrences
0 1 1676
1 2 407
2 3 165
3 4 121
4 5 69

Plot the resulting distribution on log-log axes to see the shape of the long-tail distribution.

rev_count_occurrences.plot(x='Revision count',y='Occurrences',kind='scatter',logy=True,logx=True,s=50)
<matplotlib.axes._subplots.AxesSubplot at 0x7fda5170e048>

Group revisions by users

We can also use the groupby-aggregate framework to group all the revisions together by the users who made them, rather than grouping revisions by the date as we did before.

rev_gb_user = rev_df.groupby(['rev_user_text'])

Look at a baby DataFrame within the groupby for a single user. These are all the revisions made by this user to this article.

rev_gb_user.get_group('Eustress').head()
rev_id rev_page rev_text_id rev_comment rev_user rev_user_text rev_timestamp rev_minor_edit rev_deleted rev_len rev_parent_id rev_sha1 rev_content_model rev_content_format page_id page_namespace page_title page_restrictions page_counter page_is_redirect page_is_new page_random page_touched page_links_updated page_latest page_len page_content_model rev_date rev_len_diff rev_timestamp_diff
5002 186165588 426208 0.0 /* Campaign for United States President, 2008 ... 5781611 Eustress 2008-01-22 18:53:17 0 0 62602.0 186147853 3s2mhxwf7zre265xnoclini8x2j5vuu None None 426208 0 Mitt_Romney 0 0 0 0.487268 2016-12-15 02:42:57 2016-12-15 02:43:34 754102119 256456 wikitext 2008-01-22 32.0 0.059618
5003 186166154 426208 0.0 /* Religious background */ Citation or not, th... 5781611 Eustress 2008-01-22 18:56:07 0 0 62258.0 186165588 kb0myuxejjcq13abgth1qu1n91uje8c None None 426208 0 Mitt_Romney 0 0 0 0.487268 2016-12-15 02:42:57 2016-12-15 02:43:34 754102119 256456 wikitext 2008-01-22 -344.0 0.001968
5278 191953237 426208 0.0 /* Religious background */ Neither of the sour... 5781611 Eustress 2008-02-16 23:44:01 0 0 65452.0 191953125 39djnfp4f0y7d3d01gd2r726ilh5m98 None None 426208 0 Mitt_Romney 0 0 0 0.487268 2016-12-15 02:42:57 2016-12-15 02:43:34 754102119 256456 wikitext 2008-02-16 -59.0 0.000440
5600 259363469 426208 0.0 [[WP:UNDO|Undid]] revision 259358412 by [[Spec... 5781611 Eustress 2008-12-21 16:32:31 0 0 70668.0 259358412 cp87oft75dawqa62dszytdtkh8biu48 None None 426208 0 Mitt_Romney 0 0 0 0.487268 2016-12-15 02:42:57 2016-12-15 02:43:34 754102119 256456 wikitext 2008-12-21 6.0 0.026030
5610 263025217 426208 0.0 /* External links */ Removed {{Private equity ... 5781611 Eustress 2009-01-09 19:19:24 0 0 70376.0 262816306 5132oqqvmva8v1e82r07k2ctctq672k None None 426208 0 Mitt_Romney 0 0 0 0.487268 2016-12-15 02:42:57 2016-12-15 02:43:34 754102119 256456 wikitext 2009-01-09 -72.0 0.961887

As before we can aggregate all the information in these baby DFs into a user-level dataframe that summarizes different attributes.

# Define the aggregation function mappings to variables.
# We can apply multiple functions to the same column/series by nesting a second dictionary within
agg_dict_user = {'rev_id':{'revs_made':pd.Series.nunique},
                  'rev_timestamp':{'first_rev':np.min,'last_rev':np.max},
                  'rev_len_diff':{'rev_len_min':np.min,'rev_len_max':np.max,
                                  'rev_len_med':np.median,'rev_len_sum':np.sum},
                  'rev_timestamp_diff':{'rev_ts_min':np.min,'rev_ts_max':np.max,
                                        'rev_ts_avg':np.mean}
                 }

# Perform the aggregation
rev_agg_user = rev_gb_user.agg(agg_dict_user)

# Returns a multi-index column, simplify by dropping the top level
rev_agg_user.columns = rev_agg_user.columns.droplevel(0)

# Sort by number of revisions, and look at the first 5 rows
rev_agg_user.sort_values('revs_made',ascending=False).head()
revs_made rev_len_max rev_len_min rev_len_sum rev_len_med last_rev first_rev rev_ts_max rev_ts_avg rev_ts_min
rev_user_text
Wasted Time R 1394 6293.0 -22982.0 41593.0 4.5 2016-05-07 00:41:14 2007-09-23 12:14:47 18.718762 0.254017 0.000370
Anythingyouwant 754 18098.0 -18098.0 -2653.0 2.0 2016-08-17 18:22:41 2007-08-25 23:08:07 11.942674 0.121582 0.000359
GabeMc 421 393.0 -286.0 2632.0 0.0 2012-10-25 02:40:46 2012-05-06 00:57:07 1.225752 0.028450 0.000313
Notmyrealname 196 2123.0 -4377.0 -10955.0 -14.0 2008-04-16 02:04:28 2006-08-01 03:27:55 1.837187 0.117915 0.000035
Yellowdesk 190 1359.0 -2746.0 7755.0 19.0 2012-09-25 15:55:01 2006-10-23 23:18:33 2.608600 0.128679 0.000127

Define some additional user-level statistics based on this aggregated data.

Tenure on article

# Compute tenure as difference between last and first revision, convert back to 1-day floats by dividing by timedelta64
rev_agg_user['tenure'] = (rev_agg_user['last_rev'] - rev_agg_user['first_rev'])/np.timedelta64(1,'D')

# Round values up to whole integers
rev_agg_user['tenure'] = rev_agg_user['tenure'].apply(np.ceil)

# Look at the longest-tenure editors
rev_agg_user['tenure'].sort_values(ascending=False).head()
rev_user_text
BD2412             3795.0
GoodDay            3694.0
Anythingyouwant    3280.0
Gadget850          3181.0
Johnpacklambert    3179.0
Name: tenure, dtype: float64
ax = rev_agg_user['tenure'].plot('hist',logy=True,bins=25)
ax.set_xlabel('Tenure (Days)')
ax.set_ylabel('Number of editors')
<matplotlib.text.Text at 0x7fda5144d550>
logspace_bins = np.logspace(start=0, stop=4, num=25)
ax = rev_agg_user['tenure'].plot('hist',logy=True,logx=True,bins=logspace_bins)
ax.set_xlabel('Tenure (Days)')
ax.set_ylabel('Number of editors')
<matplotlib.text.Text at 0x7fda534a2978>
g = sb.lmplot(x='tenure',y='revs_made',data=rev_agg_user.query('tenure > 1'),aspect=1.5)
ax = g.axes[0][0]
ax.set_xlim((0,1e4))
ax.set_ylim((0,1e4))
ax.set_xscale('symlog')
ax.set_yscale('symlog')

Get user info

Define a function get_user_info that takes a revision DataFrame containing a "rev_user" column/series. For all the users in the revision history, get their user information like regristration date and current edit count.

def get_user_info(rev_df,conn):
    if 'rev_user' in rev_df.columns:
        
        # Get all the user IDs from the revision DF
        _ids = rev_df['rev_user'].unique().tolist() 
        
        # Remove the 0s and convert the rest to strings
        _ids = [str(i) for i in _ids if i != 0]
        
        # Join the list of strings into a big comma-separated string
        user_string = ', '.join(_ids)
        
        # Write the query
        ## Select specific columns
        ## from the user table
        ## where the user_ids are among the users revising the article
        s = """
                SELECT 
                    user_id, user_name, user_registration, user_editcount
                FROM 
                    user
                WHERE
                    user_id IN ({0})
                ORDER BY
                    user_editcount DESC
            """.format(user_string)

        # Run the query
        _df = pd.read_sql_query(s,conn)
        
        # Some of the results have a "bytestring" format
        byte_columns = ['user_name','user_registration']
    
        # For each column, convert it from bytestring to a utf8 string
        for col in byte_columns:
            _df[col] = _df[col].str.decode('utf8')

        # Some of the values are missing, recode them as Jan 1, 2000
        _df['user_registration'] = _df['user_registration'].replace({None:'20000101000000'})
        
        # Convert timestamps to datetimes
        _df['user_registration'] = _df['user_registration'].apply(lambda x:pd.datetime.strptime(x,'%Y%m%d%H%M%S'))
        
        # Return the results
        return _df

Run the get_user_info function on the "rev_df" DataFrame to get information about each of the users who contributed to the article.

# Refresh the database connection since too much time has passed since the last query
conn.ping()
conn.cursor().execute('use enwiki_p');

# Get the user information.
rev_user_info_df = get_user_info(rev_df,conn)
rev_user_info_df.head()
user_id user_name user_registration user_editcount
0 1215485 Cydebot 2006-04-07 01:24:50 5573534
1 7328338 Yobot 2008-06-17 22:33:17 4644858
2 13286072 ClueBot NG 2010-10-20 17:01:19 4230275
3 433328 SmackBot 2005-09-13 22:47:59 3734324
4 6569922 Addbot 2008-03-01 21:58:03 2838809

Merge the "rev_agg_user" and the "rev_user_info_df" DataFrames together by joining on the user name.

# Merge the two DataFrames together using the index of "reg_agg_user"
# and the "user_name" field from rev_user_info_df, joining on the left values which should be more inclusive
merged_rev_agg_user = pd.merge(rev_agg_user,rev_user_info_df,left_index=True,right_on='user_name',how='left')

# Check if the two DataFrames have the same number of users, if False then there will be NaNs
print(len(rev_agg_user) == len(rev_user_info_df))

# Show the first five rows
merged_rev_agg_user.head()
False
revs_made rev_len_max rev_len_min rev_len_sum rev_len_med last_rev first_rev rev_ts_max rev_ts_avg rev_ts_min tenure user_id user_name user_registration user_editcount
641 1 0.0 0.0 0.0 0.0 2016-07-29 04:45:04 2016-07-29 04:45:04 3.742072 3.742072 3.742072 0.0 22197028.0 0xF8E8 2014-08-14 03:02:27 6674.0
1718 1 3.0 3.0 3.0 3.0 2012-07-27 18:21:26 2012-07-27 18:21:26 0.000405 0.000405 0.000405 0.0 NaN 108.16.207.9 NaT NaN
1209 4 2468.0 28.0 7021.0 2262.5 2012-09-09 02:29:06 2012-08-31 03:06:10 0.235266 0.070113 0.002083 9.0 12697706.0 111Alpha 2010-07-08 14:42:31 243.0
1718 1 42.0 42.0 42.0 42.0 2012-07-22 12:09:15 2012-07-22 12:09:15 0.291343 0.291343 0.291343 0.0 NaN 117.18.231.7 NaT NaN
1718 1 -112.0 -112.0 -112.0 -112.0 2012-07-25 01:35:02 2012-07-25 01:35:02 0.010289 0.010289 0.010289 0.0 NaN 119.11.151.33 NaT NaN

What fraction of the user's toital edits are devoted to this single article?

merged_rev_agg_user['rev_fraction'] = merged_rev_agg_user['revs_made']/merged_rev_agg_user['user_editcount']

# Plot the distribution. Use log-bins since the data is so skewed
logspace_bins = np.logspace(start=-7, stop=0, num=25)
ax = merged_rev_agg_user['rev_fraction'].plot('hist',logy=False,logx=True,bins=logspace_bins)
ax.set_xlabel('Fraction of total edits')
ax.set_ylabel('Count')
<matplotlib.text.Text at 0x7fda5630db38>

How long after a user's account registration did they make their first edit to this article?

merged_rev_agg_user['first_rev_account_age'] = merged_rev_agg_user['first_rev'] - merged_rev_agg_user['user_registration']
merged_rev_agg_user['first_rev_account_age'] = merged_rev_agg_user['first_rev_account_age']/np.timedelta64(1,'D')

# Plot the distribution. Use log-bins since the data is so skewed
logspace_bins = np.logspace(start=-1, stop=4, num=25)
ax = merged_rev_agg_user['first_rev_account_age'].plot('hist',logy=False,logx=True,bins=logspace_bins)
ax.set_xlabel('Account age at first article revision')
ax.set_ylabel('Count')
<matplotlib.text.Text at 0x7fda53558240>

Do users editing the article earlier in their career end up contributing more to the article?

g = sb.lmplot(x='first_rev_account_age',y='revs_made',data=merged_rev_agg_user.query('tenure > 1'),aspect=1.5)
ax = g.axes[0][0]
ax.set_xlim((0,1e4))
ax.set_ylim((0,1e4))
ax.set_xscale('symlog')
ax.set_yscale('symlog')
ax.set_xlabel('Account age at first article revision')
ax.set_ylabel('Total article revisions')
<matplotlib.text.Text at 0x7fda56c0b668>

Compare two articles

# Refresh the database connection since too much time has passed since the last query
conn.ping()
conn.cursor().execute('use enwiki_p');

# Get the second set of revisions
rev2_df = get_page_revisions('John McCain',0,conn)

Do the data cleanup and feature engineering for later steps. Define a function to do all the feature engineering steps with one call.

def data_cleanup(rev_df):
    # Compute the rev_date from the timestamps
    rev_df['rev_date'] = rev_df['rev_timestamp'].apply(lambda x:x.date())

    # Set the revision lengths for the null rows to be the previous value
    rev_df['rev_len'] = rev_df['rev_len'].fillna(method='ffill')

    # Compute the diffs and store as a new series 
    rev_df['rev_len_diff'] = rev_df['rev_len'].diff()

    # Set the value for the first rev_len_diff to be the size of the rev_len
    rev_df.loc[0,'rev_len_diff'] = rev_df.loc[0,'rev_len']

    # Compute the diffs and store as a new series
    rev_df['rev_timestamp_diff'] = rev_df['rev_timestamp'].diff()

    # Convert the values of the new series to floats by dividing by 1 day time delta.
    rev_df['rev_timestamp_diff'] = rev_df['rev_timestamp_diff'].apply(lambda x:x/np.timedelta64(1,'D'))
    
    return rev_df

Run the cleanup on rev2_df and confirm it has all the variables (columns) we need.

rev2_df = data_cleanup(rev2_df)
rev2_df.head()
rev_id rev_page rev_text_id rev_comment rev_user rev_user_text rev_timestamp rev_minor_edit rev_deleted rev_len rev_parent_id rev_sha1 rev_content_model rev_content_format page_id page_namespace page_title page_restrictions page_counter page_is_redirect page_is_new page_random page_touched page_links_updated page_latest page_len page_content_model rev_date rev_len_diff rev_timestamp_diff
0 27424 43715 0 * 0 128.12.181.70 2002-03-11 01:34:10 1 0 29 0 qfczsomdfu7wdhck9unyvohs0op9rgh None None 43715 0 John_McCain 288 0 0 0.526428 2016-12-15 19:00:52 2016-12-15 04:09:18 754904803 197328 wikitext 2002-03-11 29.0 NaN
1 27841 43715 0 * 0 128.12.181.70 2002-03-11 01:34:45 0 0 37 27424 4vswcxsbepcaoum3ake4zd6pmynqjrt None None 43715 0 John_McCain 288 0 0 0.526428 2016-12-15 19:00:52 2016-12-15 04:09:18 754904803 197328 wikitext 2002-03-11 8.0 0.000405
2 379846 43715 0 still a stub 0 140.247.197.26 2002-03-11 12:26:34 0 0 272 27841 0tjumke88mrriwguue9cfwz5wghj94o None None 43715 0 John_McCain 288 0 0 0.526428 2016-12-15 19:00:52 2016-12-15 04:09:18 754904803 197328 wikitext 2002-03-11 235.0 0.452650
3 379849 43715 0 Added Brief Historical overview. 4956 Oscar Arias 2002-10-23 16:03:48 0 0 1197 379846 fibxhtkmlxca5jpgmd35awojd70jfm2 None None 43715 0 John_McCain 288 0 0 0.526428 2016-12-15 19:00:52 2016-12-15 04:09:18 754904803 197328 wikitext 2002-10-23 925.0 226.150856
4 379854 43715 0 4956 Oscar Arias 2002-10-23 16:05:04 0 0 1205 379849 9prvxhxo01fum80jormhbky1dwo68zi None None 43715 0 John_McCain 288 0 0 0.526428 2016-12-15 19:00:52 2016-12-15 04:09:18 754904803 197328 wikitext 2002-10-23 8.0 0.000880

Group by the "rev_date", compute the daily_revs, re-index to make the time continuous.

rev2_gb_date = rev2_df.groupby(['rev_date'])

daily_revs2 = rev2_gb_date.agg({'rev_id':pd.Series.nunique})

# Compute the first and last days revisions were observed
_min, _max = daily_revs2.index.min(), daily_revs2.index.max()

# Create a new index that is continuous over the whoel date range
_ix = pd.date_range(_min,_max)

# Reindex the daily revisions over the continuous date range filling in the blanks with 0s
daily_revs2_reindexed = daily_revs2.reindex(_ix).fillna(0)

Combine both of the daily_revs series into a DataFrame by merging, renaming the columns, and filling the blank values with 0s.

daily_revs_compared_df = pd.merge(daily_revs_reindexed,daily_revs2_reindexed,
                                  how='outer',left_index=True,right_index=True)

# The left DF was Mitt Romney and the right DF was John McCain. Rename to that effect
daily_revs_compared_df = daily_revs_compared_df.rename(columns={'rev_id_x':'Mitt Romney','rev_id_y':'John McCain'})

# Fill the null values with 0s
daily_revs_compared_df.fillna(0,inplace=True)

# Look at the first 5 rows
daily_revs_compared_df.head()
Mitt Romney John McCain
2002-03-11 0.0 3.0
2002-03-12 0.0 0.0
2002-03-13 0.0 0.0
2002-03-14 0.0 0.0
2002-03-15 0.0 0.0

Plot the comparisong of both results in a single chart. Use a rolling average of 60 days to smooth out noise.

daily_revs_compared_df.ix['2008-01-01':'2009-01-01']#.mean()#.plot(lw=3)

# Label the axes
#ax.set_xlabel('Date')
#ax.set_ylabel('Revisions')
Mitt Romney John McCain
2008-01-01 41.0 16.0
2008-01-02 17.0 2.0
2008-01-03 4.0 16.0
2008-01-04 18.0 9.0
2008-01-05 4.0 12.0
2008-01-06 3.0 10.0
2008-01-07 20.0 8.0
2008-01-08 36.0 16.0
2008-01-09 21.0 51.0
2008-01-10 3.0 11.0
2008-01-11 11.0 13.0
2008-01-12 3.0 0.0
2008-01-13 6.0 0.0
2008-01-14 20.0 0.0
2008-01-15 17.0 8.0
2008-01-16 15.0 11.0
2008-01-17 16.0 12.0
2008-01-18 4.0 12.0
2008-01-19 8.0 21.0
2008-01-20 4.0 20.0
2008-01-21 4.0 0.0
2008-01-22 12.0 1.0
2008-01-23 7.0 5.0
2008-01-24 7.0 9.0
2008-01-25 8.0 6.0
2008-01-26 5.0 3.0
2008-01-27 3.0 2.0
2008-01-28 7.0 4.0
2008-01-29 9.0 10.0
2008-01-30 2.0 48.0
2008-01-31 6.0 25.0
2008-02-01 10.0 15.0
2008-02-02 6.0 33.0
2008-02-03 12.0 27.0
2008-02-04 12.0 18.0
2008-02-05 13.0 18.0
2008-02-06 12.0 32.0
2008-02-07 48.0 16.0
2008-02-08 12.0 15.0
2008-02-09 3.0 57.0
2008-02-10 10.0 17.0
2008-02-11 8.0 15.0
2008-02-12 4.0 1.0
2008-02-13 8.0 13.0
2008-02-14 25.0 9.0
2008-02-15 9.0 2.0
2008-02-16 26.0 9.0
2008-02-17 11.0 13.0
2008-02-18 10.0 8.0
2008-02-19 16.0 7.0
2008-02-20 5.0 12.0
2008-02-21 11.0 76.0
2008-02-22 9.0 48.0
2008-02-23 4.0 24.0
2008-02-24 8.0 23.0
... ... ...
2008-11-08 0.0 2.0
2008-11-09 0.0 3.0
2008-11-10 2.0 4.0
2008-11-11 1.0 13.0
2008-11-12 2.0 3.0
2008-11-13 4.0 6.0
2008-11-14 0.0 1.0
2008-11-15 2.0 0.0
2008-11-16 3.0 0.0
2008-11-17 1.0 0.0
2008-11-18 0.0 33.0
2008-11-19 2.0 2.0
2008-11-20 1.0 2.0
2008-11-21 0.0 3.0
2008-11-22 0.0 3.0
2008-11-23 0.0 5.0
2008-11-24 1.0 2.0
2008-11-25 0.0 2.0
2008-11-26 0.0 1.0
2008-11-27 1.0 1.0
2008-11-28 0.0 8.0
2008-11-29 2.0 6.0
2008-11-30 4.0 9.0
2008-12-01 0.0 1.0
2008-12-02 0.0 1.0
2008-12-03 0.0 6.0
2008-12-04 0.0 1.0
2008-12-05 0.0 0.0
2008-12-06 0.0 1.0
2008-12-07 0.0 0.0
2008-12-08 0.0 0.0
2008-12-09 3.0 0.0
2008-12-10 0.0 0.0
2008-12-11 0.0 0.0
2008-12-12 0.0 0.0
2008-12-13 1.0 0.0
2008-12-14 0.0 0.0
2008-12-15 0.0 0.0
2008-12-16 0.0 0.0
2008-12-17 0.0 2.0
2008-12-18 0.0 1.0
2008-12-19 0.0 12.0
2008-12-20 0.0 3.0
2008-12-21 2.0 0.0
2008-12-22 2.0 0.0
2008-12-23 0.0 0.0
2008-12-24 1.0 1.0
2008-12-25 0.0 0.0
2008-12-26 0.0 1.0
2008-12-27 0.0 2.0
2008-12-28 0.0 0.0
2008-12-29 0.0 0.0
2008-12-30 0.0 1.0
2008-12-31 1.0 7.0
2009-01-01 2.0 3.0

367 rows × 2 columns

ax = daily_revs_compared_df.ix['2008-01-01':'2009-01-01'].cumsum().plot(lw=3)

# Label the axes
ax.set_xlabel('Date')
ax.set_ylabel('Revisions')
<matplotlib.text.Text at 0x7fda50e359e8>