Querying A Wikipedia's Newly Registered Accounts via SQL

by J. Nathan Matias, May 2018

Based on example code at: https://wikitech.wikimedia.org/wiki/PAWS

Queries can be tested at: https://quarry.wmflabs.org/

For a given language wikipedia, set database to LANG + "wiki_p"

List of queryable databases: https://quarry.wmflabs.org/query/278

For a list of example queries, see this page, which had been edited as late as March 2018, but may not be fully up to date: https://wikitech.wikimedia.org/wiki/Help:MySQL_queries

See key information about when to use the alternative views provided in Toolforge and when to use the original tables: https://wikitech.wikimedia.org/wiki/Help:MySQL_queries#Alternative_Views

import os, time, datetime
import pymysql
import pprint
from collections import defaultdict
import matplotlib.pyplot as plt   # Matplotlib for plotting
pp = pprint.PrettyPrinter(depth=6)

wiki_database = "eswiki_p"

conn = pymysql.connect(
    host=os.environ['MYSQL_HOST'],
    user=os.environ['MYSQL_USERNAME'],
    password=os.environ['MYSQL_PASSWORD'],
    database= wiki_database,
    charset='utf8'
)

Query Recently Registered Users

Users table documentation: https://www.mediawiki.org/wiki/Manual:User_table

## IF YOU WANT TO AVOID GETTING KEYS, WHICH TAKE UP MEMORY
## YOU CAN CONNECT USING THE LINE THAT DOESN'T RETURN COLUMN INFO.
## COLUMNS APPEAR TO BE ALPHABETICALLY SORTED, BUT NOT SURE
# with conn.cursor() as cur:

# with conn.cursor(pymysql.cursors.DictCursor) as cur:
#     cur.execute('USE ' + wiki_database + ";")
#     cur.execute("""
#         SELECT * from user 
#         WHERE user_registration > 20180501000000 LIMIT 1;""")
#     newcomer = cur.fetchone()
#     pp.pprint(newcomer)

Query Recently Registered Users with their Edits

First Step: Query the schema for the revision_userindex view on toolforge

with conn.cursor(pymysql.cursors.DictCursor) as cur:
    cur.execute('USE ' + wiki_database + ";")
    cur.execute("""SHOW FIELDS IN revision_userindex;""")
    print("======= revision_userindex fields =======")
    for field in cur.fetchall():
        print("  {0}: \t {1} {2}".format(field['Field'], field['Type'], field['Key']).expandtabs(25))
        
#     cur.execute("""SELECT * from revision_userindex LIMIT 1;""")
#     print("\n")
#     print("Example row from revision_userindex:\n")
#     pp.pprint(cur.fetchone())
======= revision_userindex fields =======
  rev_id:                 int(8) unsigned 
  rev_page:               int(8) unsigned 
  rev_text_id:            decimal(20,0) 
  rev_comment_id:         decimal(20,0) 
  rev_comment:            blob 
  rev_user:               int(5) unsigned 
  rev_user_text:          varbinary(255) 
  rev_timestamp:          varbinary(14) 
  rev_minor_edit:         tinyint(1) unsigned 
  rev_deleted:            tinyint(1) unsigned 
  rev_len:                bigint(10) unsigned 
  rev_parent_id:          int(8) unsigned 
  rev_sha1:               varbinary(32) 
  rev_content_model:      varbinary(64) 
  asrev_content_format:   varbinary(64) 

Second Step: Query Newcomer -> Revision Information

newcomer_revisions is a dict of lists where the dict key is a user_id pointing to a list of all revisions made by the editor.

IMPORTANT: on very active wikipedias, or in cases where the query goes back very far, this dict could consume very large amounts of memory. Use with caution.

newcomer_revisions = defaultdict(list)
record_counter = 0

start_time = datetime.datetime.now()


with conn.cursor(pymysql.cursors.DictCursor) as cur:
    cur.execute('USE ' + wiki_database + ";")
    cur.execute("""
        SELECT user.*, revision_userindex.* 
        FROM user 
            JOIN revision_userindex 
            ON user.user_id = revision_userindex.rev_user
        WHERE user_registration > 20180501000000 ORDER BY rev_timestamp ASC;""")
    for revision in cur.fetchall():
        newcomer_revisions[revision['user_id']].append(revision)
        record_counter += 1
        
end_time = datetime.datetime.now()    

print("Queried {0} records for {1} newcomers".format(
    record_counter, len(newcomer_revisions)))
print("Query duration: {0}".format((end_time-start_time).total_seconds()))
Queried 25070 records for 6792 newcomers
Query duration: 2640.84095

Plotting Revision Count Per Newcomer In May 2016

Querying and returning actual revisions is an inefficient way to get counts of the number of revisions per person, but I use it here to at least observe some basics about this dataset

newcomer_revision_counts = []
for user_id, revisions in newcomer_revisions.items():
    newcomer_revision_counts.append({"user_id": user_id, "revision_count": len(revisions)})
    
plt.figure(num=None, figsize=(12, 6), dpi=80, facecolor='w', edgecolor='k')
plt.hist([x['revision_count'] for x in newcomer_revision_counts])
plt.title("")
Text(0.5,1,'')
#list(newcomer_revisions.values())[0]