WikiProject Computational Biology Quality Dynamics

1: Setup

We first queried the dataset to obtain the predicted quality data for all articles tagged with the "WikiProject Computational Biology" template and all English Wikipedia articles. We then installed and loaded the necessary libararies and data sets for analysis.

%matplotlib inline
import requests, csv
import matplotlib.pyplot as plt
import time
import datetime
allwikiresponse = requests.get('https://quarry.wmflabs.org/run/195914/output/0/csv?download=true', stream = True)
allwikirows = csv.DictReader(allwikiresponse.iter_lines(decode_unicode='utf8'))

cbwikireponse = requests.get('https://quarry.wmflabs.org/run/199100/output/0/csv?download=true', stream = True)
cbwikirows = csv.DictReader(cbwikireponse.iter_lines(decode_unicode='utf8'))

2. Generate Aggregated Quality Measures

I then calculated two aggreated quality measures: 1) mean weighted sum and 2) proportion of articles in each prediction clss. The mean weighted sum was calculated by taking the weighted sum measurement incremented by 1 and dividing it by the total number of articles in the aggregate, max(n). I used the total number of articles in the aggregate at the last month (max(n)) as the denomintor for calculating each of the the quality measures.

#create lists of each aggregated measure to create plots of all English Wikipedia quality trends. 
allwiki_ts = []
allwiki_ws = []
allwiki_stub = []
allwiki_start = []
allwiki_c = []
allwiki_b = []
allwiki_ga = []
allwiki_fa = []
allwiki_total = 5206553
 
for row in allwikirows:
    allwiki_ts.append(time.mktime(datetime.datetime.strptime(row['timestamp'], "%Y%m%d%H%M%S").timetuple()))
    allwiki_ws.append((float(row['weighed_sum'])+ 1)/(allwiki_total))
    allwiki_stub.append((float(row['stub_n']))/(allwiki_total))
    allwiki_start.append((float(row['start_n']))/(allwiki_total))
    allwiki_c.append((float(row['c_n']))/(allwiki_total))
    allwiki_b.append((float(row['b_n']))/(allwiki_total))
    allwiki_ga.append((float(row['ga_n']))/(allwiki_total))
    allwiki_fa.append((float(row['fa_n']))/(allwiki_total))
    
#create lists of each variable to create plot of all WikiProject India quality trends.
cbwiki_ts = []
cbwiki_ws = []
cbwiki_stub = []
cbwiki_start = []
cbwiki_c = []
cbwiki_b = []
cbwiki_ga = []
cbwiki_fa = []
cbwiki_total = 1312
 
for row in cbwikirows:
    cbwiki_ts.append(time.mktime(datetime.datetime.strptime(row['timestamp'], "%Y%m%d%H%M%S").timetuple()))
    cbwiki_ws.append((float(row['weighted_sum']) + 1)/(cbwiki_total))
    cbwiki_stub.append((float(row['stub_n']))/(cbwiki_total))
    cbwiki_start.append((float(row['start_n']))/(cbwiki_total))
    cbwiki_c.append((float(row['c_n']))/(cbwiki_total))
    cbwiki_b.append((float(row['b_n']))/(cbwiki_total))
    cbwiki_ga.append((float(row['ga_n']))/(cbwiki_total))
    cbwiki_fa.append((float(row['fa_n']))/(cbwiki_total))

3. Plot mean weighted sum

plt.plot(allwiki_ts,allwiki_ws, '-', label = "All Wiki")
plt.plot(cbwiki_ts,cbwiki_ws, '--', label = "WikiProject Computational Biology")
plt.xlabel('Time(s)')
plt.ylabel('Mean Weighted Sum')
plt.legend(loc = 'upper left')
<matplotlib.legend.Legend at 0x7f1eb5612a20>

I then plotted the quality trends across each class prediction for WikiProject India articles to help determine where the quality changes are occuring.

fig, ax = plt.subplots(nrows=2,ncols=3)

plt.subplot(2,3,1)
plt.plot(allwiki_ts,allwiki_stub, '-', label = "Stub")
plt.plot(indiawiki_ts,indiawiki_stub, '--', label = "Stub")
plt.title("Stub")
plt.subplot(2,3,2)
plt.plot(allwiki_ts,allwiki_start, '-', label = "Start")
plt.plot(indiawiki_ts,indiawiki_start, '--', label = "Start")
plt.title("Start")
plt.subplot(2,3,3)
plt.plot(allwiki_ts,allwiki_c, '-', label = "C")
plt.plot(indiawiki_ts,indiawiki_c, '--', label = "C")
plt.title("C")
plt.subplot(2,3,4)
plt.plot(allwiki_ts,allwiki_b, '-', label = "B")
plt.plot(indiawiki_ts,indiawiki_b, '--', label = "B")
plt.title("B")
plt.subplot(2,3,5)
plt.plot(allwiki_ts,allwiki_ga, '-', label = "GA")
plt.plot(indiawiki_ts,indiawiki_ga, '--', label = "GA")
plt.title("GA")
plt.subplot(2,3,6)
plt.plot(allwiki_ts,allwiki_fa, '-', label = "FA")
plt.plot(indiawiki_ts,indiawiki_fa, '--', label = "FA")
plt.title("FA")

plt.tight_layout()

Oberservations

  • The stub plot show that growth of stub articles for WikiProject India seem to fluctuate more than English Wikipedia overall, with several up-ticks and declines.
  • Wikiproject India articles saw a faster growth rate in "start" and "C" quality articles surpassing the growth rate of Wikipedia overall around 2012.
  • "B" Class articles graph shows that all English wiki and WikiProject India articles follow similar trend that differs from the other classes. Around 2007,"B" class article growth stays stable for all English Wiki and slightly declines for WikiProject India articles. Why?
  • Wikiproject India articles follow a very similar growth rate trend for higher quality "GA" and "FA" articles, with India articles growth rate slightly declining for "FA" articles. This may be due to coverage bias for source material or that it is more difficult to get WikiProject India articles through the "FA" article review process.

5. Plot quality gap

I also plotted the difference in mean weighted sums (quality gap) between WikiProject India and all English Wiki articles.

from operator import sub
wsdiff = list(map(sub, indiawiki_ws, allwiki_ws))
plt.plot(indiawiki_ts, wsdiff, '-')
plt.axhline(0, color = "black")
plt.xlabel('Time(s)')
plt.ylabel('WikiProject India Quality Gap')
<matplotlib.text.Text at 0x7f9cf0cc4470>

The plot show IndiaWiki articles quality decreasing from 2001 until about late 2009 when the quality increases.

Next Steps

  1. Possible to convert timestamp unit to months so graphs are easier to read?
  2. Look at quality trends for "empty" articles.
  3. Compare WikiProject India to other similar projects such as WikiProject Africa to see if there are similar trends across other country-based articles. Also, maybe other India related WikiProjects.
  4. Do some research to see if the WikiProject India article quality growth increase around 2009 coincides to a community event or initiative (e.g. Start of WikiProject India [July 2006] or start of Wikiproject India Nurturing (WIN) project [Started in August 2015])