Wikipedia Edits: Burstiness Analysis

Christian Bouwense

Introduction

The world can be thought of as a set of systems. These consist of components and interactions between them. Think of two people friending each other on Facebook, or internet routers exchanging packets of information. On the surface, these systems may seem very different, but the underlying principle of actors and actions is the same.

If we can understand the behavior of these systems - what they have in common and what they do not - we could react to our world in a more efficient way. We could have an algorithm to allocate FEMA funds more accurately based on Hurricane patterns before disasters, or know when to administer what type of treatment based on cancer metabolization patterns.

What is Burstiness?

This is not a new idea. Mathematicians have tried to model systems for decades. However, classical statistics have not given very good results. The general idea is that human interaction is random, and if that is the case we should be able to model interactions with random statistical models. Unfortunately, the world is not that simple, and many different systems have very dissimilar behaviors.

Fortunately, there is a new way of modelling system behavior: burstiness. Burstiness is a measure of how regular, random, or "bursty" interactions in an system are. If system behavior is "bursty," there are long periods of inactivity, followed by short bursts of activity. On the opposite side of the spectrum, systems that are very regularly timed interactions are "anti-bursty," such as time between heartbeats at rest. The middle of the scale is completely random behavior, such as the time inbetween winning a slot machine.

Methodology

We can do better than just describing burstiness in English; this paper by Barabasi introduced the idea of burstiness, and provides mathematical ways of measuring it.

For this experiment, we will study the system of Wikipedia article editing. This is a very interesting case because it involves human and non-human interaction (bots can make edits as well), over many different subjects of varying degrees of public interest. They also have a ton of data.

In order to measure the burstiness of Wikipedia editing (I'll be using "edits" and "revisions" interchangeably), we need a program to download the data, and then measure its burstiness. For this I'll be using Python, notably using the mwapi module, which lets you query Wikipedia's databases.

The Code

Above every code cell, I will be putting descriptions of what the code is doing, and how the code is doing it. If you do not understand Python code, then reading these descriptions should give you a good idea of what's going on. If you are interested in the nitty-gritty of the Python, then reading the "how" should give you some insight. (I have marked the high level explanations with "Description" and nitty-gritty explanations with "Technical" to save people time!)

Importing Modules

Technical: Here we just import some modules that we will need in the program. The most notable one, as I have mentioned earlier, is the inclusion of "mwapi" which is an interfact between Python and REST APIs which we can send to Wikipedia. An interesting point is that mwapi is actually a wrapper for another module called mw, where you need to actually craft the REST calls yourself.

"""
Author: Christian Bouwense

Program that gets the revision for a page by user for a certain time 
interval and measures burstiness.
"""

import time
import random
import datetime as dt
import mwapi
import operator
import numpy as np
import dateutil.parser as dup
from matplotlib import pyplot as plt
import matplotlib.patches as mpatches

Getting the Data

Description: Here I am creating a bit of code that will ask Wikipedia for the revision data of any article we want. You can specify which article you want data for, a time period, and what kind of data you want about the revisions (e.g. who made the revisions, when the revisions was made, etc.).

Technical: get_revisions() is a wrapper for the mwapi get() call. We are always going to want to query, get revisions, and accept the maximum amount of revisions. However, you can also specify things like the revision properties you want, and the date range you wish to see (and if you don't really care, they have default values). The values are returned in JSON and stored in a global dictionary called article_data. So even though we are getting data, this function has a void return type.

You may notice that there is a for loop that assigns a variable called page_id. We will need this value to traverse the JSON later when we parse its contents, and also for continuing the query. Speaking about continuing the query, any get request is only allowed 500 objects to be received per call. However, if there are more than 500 revisions, there will be a field in the JSON called 'continue.' If you put the value of 'continue' into your next get call, it will return you the next 500. So we have a while loop that checks for this continue value, and if its there just continually make get calls until its gone.

def get_revisions(title, rv_prop='timestamp|user', rv_start='today', rv_end='2000-01-01T00:00:00Z'):
    # Information specifying article and time interval to look at
    title = title
    
    # We're always going to want these parameters to be the same
    action = 'query'
    prop = 'revisions'
    rv_limit = 'max'
    
    today = dt.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%dT%H:%M:%SZ')
    # User can just give the string "today" instead of the timestamp
    if rv_start == "today":
        rv_start = today
    else:
        rv_start = start_date
    
    # Temporary dictionary holding amount of revisions for each user
    revisions_by_user = {}
    
    # Connect to Wikipedia
    session = mwapi.Session('https://en.wikipedia.org', user_agent='cbouwense')

    # Query Wikipedia for revisions on the supplied article
    # The result is stored into the dictionary "rev_dict"
    rev_dict = session.get(action=action,
                           prop=prop,
                           rvprop=rv_prop,
                           titles=title,
                           rvlimit=rv_limit,
                           rvstart=rv_start,
                           rvend=rv_end)
    
    # Find page_id for selected article
    for keys in rev_dict['query']['pages'].keys():
        page_id = keys
        
    # Go through the timestamps for each revision made.
    # If the timestamp is already a key in our dictionary, increment that key value by 1.
    # Else, create a new key for that year in our dictionary and set it to 1
    rev_timestamps = []
    for props in rev_dict['query']['pages'][str(page_id)]['revisions']:
        if 'user' in props:
            if (props['user'] not in revisions_by_user):
                revisions_by_user[props['user']] = 1
            else:
                revisions_by_user[props['user']] += 1

        timestamp = dup.parse(props['timestamp'])
        rev_timestamps.append(timestamp)
        
    # Check if there is a section named "continue".
    # If there is, that means the query did not get all the data
    # because of the per-user query limits.
    print ("Retrieving data on %s from Wikipedia..." % title)
    while 'continue' in rev_dict:
        continue_val = rev_dict['continue']['rvcontinue']
        rev_dict = session.get(action=action,
                               prop=prop,
                               rvprop=rv_prop,
                               titles=title,
                               rvlimit=rv_limit,
                               rvstart=today,
                               rvend=rv_end,
                               rvcontinue=continue_val)
        for props in rev_dict['query']['pages'][str(page_id)]['revisions']:
            if 'user' in props:
                if (props['user'] not in revisions_by_user):
                    revisions_by_user[props['user']] = 1
                else:
                    revisions_by_user[props['user']] += 1
            timestamp = dup.parse(props['timestamp'])
            rev_timestamps.append(timestamp)

    # List of tuples of revisions made by user for page
    sorted_user_revisions = sorted(revisions_by_user.items(), key=operator.itemgetter(1))[::-1]
    
    # Enumerate the times between events into a list
    interevent_times = []
    for i in range(0, len(rev_timestamps)-1):
        interevent_times.append((rev_timestamps[i] - rev_timestamps[i+1]).total_seconds())
    
    # Add data to global dictionaries
    article_data[title] = {}
    article_data[title]['user_revs'] = sorted_user_revisions
    article_data[title]['revision_times'] = rev_timestamps
    article_data[title]['interevent_times'] = interevent_times
    get_B(article)
    get_M(article)
    
    print ("Data received successfully!")

Storing Article Data

Description: We need a place to store all of this data that Wikipedia is giving us, so this one line of code below is the place we will keep it. When we ask Wikipedia for the revision data of an article, we will put the amount of edits per user, the times each revision was created, and how much time is inbetween revisions in article_data.

Technical: This is simply a global dictionary to store the data parsed by get_revisions(). It has a key for each article we query. Each article has a key for 'user_revs' (a list of tuples containing a username and how many edits they created), 'revision_times' (a list of timestamps for each edit), and 'interevent_times' (a list of floats indicating the amount of seconds between each edit).

article_data = {}

Analyzing Burstiness: B

Description: Now that we have a way of getting data from Wikipedia, we need to create a way to analyze its burstiness. One of the ways of measuring burstiness is looking at the distribution of the time between events in the system (interevent times). In order to do this, we need to find their mean and standard deviation and plug them into one of the two burstiness equations.

The measure of burstiness by interevent times is defined as follows:

$$B \ \triangleq\ \frac{\sigma_{\tau}-m_{\tau}}{\sigma_{\tau}+m_{\tau}}$$

If you're not a math person, essentially we are just taking some facts about the data and plugging them into an equation. You can think of this equation as a "Burstiness-o-meter." If we take some data and put it into the "Burstiness-o-meter" it spits out a number. This output is simply a number from -1 to 1; let's call it B. The lower the number is, the less bursty it is, and the higher it is, the burstier. If it is close to 0, that means that the data is random.

Technical: We use the Numpy module's built-in functions to find the mean and standard deviation of the data. Since we already stored the interevent times in the article_data, we can just give those to the Numpy functions.

B is essentially analyzing how different the distribution of interevent times is from the Poisson Distribution. The math here is pretty simple, we're just taking the difference between the standard deviation and the mean over the sum of the standard deviation and the mean. The range of this formula is [-1, 1]. The $\lim_{\sigma\to 0} B=-1$, which makes sense: as the data deviates less and less, there are less "bursty" situations. As $B\to0$, the distribution of interevent times more closely resembles the Poisson distribution, meaning that they are random. Finally as $B\to1$, the burstier the data.

def get_B(title):
    # Calculate interevent time mean and standard deviation
    interevent_mean = (np.mean(article_data[title]['interevent_times']))
    interevent_std_dev = (np.std(article_data[title]['interevent_times']))
    B = ((interevent_std_dev - interevent_mean) / (interevent_std_dev + interevent_mean))
    article_data[title]['B'] = B

Analyzing Burstiness: M

Description: There is a different angle that we can measure burstiness with: memory. At first it may sound the same but it is actually completely independent of the "Burstiness-o-meter" we used before. Memory means that short interevent times are really likely to happen after short ones, and long ones occur after long ones.

In order to measure this, we will make a "Burstiness-o-meter 2.0". The way that this new measurement will calculate burstiness is different, but the numbers it can output will be the same. So we're still looking at a scale that goes from -1 to 1; let's call this output M.

This isn't a better measurement of the data's burstiness, just a different way of measuring it. Just to be clear, say some data has a $B=0.75$ and $M=0.10$ and some other data has $B=0.75$ and $M=0.65$: just because the second data set had both numbers higher doesn't mean it was "burstier," it just means it is bursty in both ways instead of just one.

Technical: Memory is easiest thought of in the context of Markov Chains. The state machine consist of two states: "Short Interevent Time" and "Long Interevent Time." If a system is very bursty, the chance of changing states would be very small; and vice versa for a system with very regular interactions.

We will need the mean and standard deviation of interevent times [1, n-1], and [2, n] (assuming a total of n interevent times).

The coefficient of burstiness due to memory is defined as

$$M \ \triangleq\ \frac{1}{n_{\tau} - 1}\sum_{i=1}^{n_{\tau}-1} \frac{(\tau_{i}-m_{1})(\tau_{i+1}-m_{2})}{\sigma_{1}\sigma_{2}}$$
def get_M(title):
    # Store times in this variable with a much shorter name
    times = article_data[title]['interevent_times']
    mean_1 = np.mean(times[0:len(times)-1])
    mean_2 = np.mean(times[1:len(times)])
    std_dev_1 = np.std(times[0:len(times)-1])
    std_dev_2 = np.std(times[1:len(times)])
    
    summation = 0
    for i in range(0, len(times)-1):
        tau_i = times[i]
        tau_i_plus_one = times[i+1]
        summation_term = (((tau_i - mean_1) * (tau_i_plus_one - mean_2)) / (std_dev_1 * std_dev_2)) 
        summation += summation_term

    M = (1 / (len(times) - 1)) * summation
    
    article_data[title]['M'] = M

Bringing it All Together

Let's summarize a little bit. We now have code that gets revision data from Wikipedia, extracts some interesting facts about that data, and can analyze the burstiness of it in a couple different ways. Now that we have these tools at our disposal, let's use them on a bunch of different articles! I've Googled the top 15 most edited Wikipedia articles, so they should be a good place to start (I love the fact that "List of WWE Personnel" is the 2nd most edited Wikipedia article).

articles = [
'George W. Bush',
'List of WWE personnel',
'United States',
'Wikipedia',
'Michael Jackson',
'Jesus',
'Catholic Church',
'List of programs broadcast by ABS-CBN',
'Donald Trump',
'Barack Obama',
'Adolf Hitler',
'Britney Spears',
'World War II',
'The Beatles',
'India',
'Game of Thrones'
]
print ("Getting data for 15 articles.")
print ("This may take a while!")
for article in articles:
    get_revisions(article)
Getting data for 15 articles.
This may take a while!
Retrieving data on George W. Bush from Wikipedia...
Data received successfully!
Retrieving data on List of WWE personnel from Wikipedia...
Data received successfully!
Retrieving data on United States from Wikipedia...
Data received successfully!
Retrieving data on Wikipedia from Wikipedia...
Data received successfully!
Retrieving data on Michael Jackson from Wikipedia...
Data received successfully!
Retrieving data on Jesus from Wikipedia...
Data received successfully!
Retrieving data on Catholic Church from Wikipedia...
Data received successfully!
Retrieving data on List of programs broadcast by ABS-CBN from Wikipedia...
Data received successfully!
Retrieving data on Donald Trump from Wikipedia...
Data received successfully!
Retrieving data on Barack Obama from Wikipedia...
Data received successfully!
Retrieving data on Adolf Hitler from Wikipedia...
Data received successfully!
Retrieving data on Britney Spears from Wikipedia...
Data received successfully!
Retrieving data on World War II from Wikipedia...
Data received successfully!
Retrieving data on The Beatles from Wikipedia...
Data received successfully!
Retrieving data on India from Wikipedia...
Data received successfully!
Retrieving data on Game of Thrones from Wikipedia...
Data received successfully!

Visualizing the Results

Now let's take a look at the data that we've collected! We're going to use a bar graph of the B and M values for each article we queried. Remember that B and M range from -1 to 1, and the closer to 1 they are the more they contributed to the data's burstiness.

ind = np.arange(len(article_data.keys()))
B_vals = []
M_vals = []
for article in article_data.keys():
    B_vals.append(article_data[article]['B'])
    M_vals.append(article_data[article]['M'])
    
width = 0.4
fig, ax = plt.subplots()
plt.title('B & M of Article Inter-Revision Times', fontsize=26)
plt.yticks(ind, article_data.keys(), rotation=0, fontsize=18)
plt.xticks(fontsize=18)
plt.xlim(-1, 1)
plt.xlabel('Burstiness', fontsize=20)

ax.barh(ind, B_vals, width, color='#FF9933', label='N')
ax.barh(ind + width, M_vals, width, color='#0099CC', label='M')
for i in range(-10, 10):
    ax.axvline(i/10, linestyle='dotted', zorder=0)
    

orange_patch = mpatches.Patch(color='#FF9933', label='B')
blue_patch = mpatches.Patch(color='#0099CC', label='M')
plt.legend(handles=[orange_patch, blue_patch])

fig.set_size_inches(10, 10)

plt.show()

Digging Deeper

The further right the bars are in the graph, the burstier the data is: as you can see, Wikipedia edits are pretty bursty! In fact, human activities generally tend to have a B coefficient of 0.5 to 0.8, and an M of 0 to 0.1, so these results make a lot of sense.

Now that we know that Wikpedia edits are bursty, let's dig a little deeper. It would be interesting to find out if we can correlate burstiness with other variables, such as number of users editing the page, or number of overall revisions. We already have this data from earlier, so let's add it to the graph.

ind = np.arange(len(article_data.keys()))
article_editors = []
article_edits = []

article_count = 0
for article in article_data.keys():
    article_editors.append(0)
    article_edits.append(0)
    for tuple in article_data[article]['user_revs']:
        article_editors[article_count] += 1
        article_edits[article_count] += tuple[1]
    article_count += 1
    
fig, ax = plt.subplots()
plt.title('Edit Stats for Articles', fontsize=26)
plt.yticks(ind, article_data.keys(), rotation=0, fontsize=18)
plt.xticks(fontsize=18)

ax.barh(ind, article_editors, width, color='#990033', label='Editors')
ax.barh(ind + width, article_edits, width, color='#669900', label='Edits')    

#for i in range(0, max(article_edits)):
#    ax.axvline(i, linestyle='dotted', zorder=0)
    #i += max(article_edits) / 10

red_patch = mpatches.Patch(color='#990033', label='Editors')
green_patch = mpatches.Patch(color='#669900', label='Edits')
plt.legend(handles=[red_patch, green_patch])

fig.set_size_inches(10, 10)

plt.show()