WikiProject India Quality Dynamics

In this notebook, I compare the article quality trend plots of all WikiProject India articles to all English Wikipedia articles using the predicted data quality dataset.

1: Setup

We first queried the dataset to obtain the predicted quality data for all articles tagged with the "WikiProject India" template and all English Wikipedia articles. We then installed and loaded the necessary libararies and data sets for analysis.

%matplotlib inline
import requests, csv
import matplotlib.pyplot as plt
import time
import datetime
allwikiresponse = requests.get('https://quarry.wmflabs.org/run/195914/output/0/csv?download=true', stream = True)
allwikirows = csv.DictReader(allwikiresponse.iter_lines(decode_unicode='utf8'))

indiawikireponse = requests.get('https://quarry.wmflabs.org/run/195707/output/0/csv?download=true', stream = True)
indiawikirows = csv.DictReader(indiawikireponse.iter_lines(decode_unicode='utf8'))

2. Generate Aggregated Quality Measures

I then calculated two aggreated quality measures: 1) mean weighted sum and 2) proportion of articles in each prediction class. The mean weighted sum was calculated by taking the weighted sum measurement incremented by 1 and dividing it by the total number of articles in the aggregate, max(n). I used the total number of articles in the aggregate at the last month (max(n)) as the denomintor for calculating each of the the quality measures.

Note: I revised the initial queries to also find the total number of articles in the aggregate (n) because it seemed this was needed to calculate the proportions.

#create lists of each aggregated measure to create plots of all English Wikipedia quality trends. 
allwiki_ts = []
allwiki_ws = []
allwiki_stub = []
allwiki_start = []
allwiki_c = []
allwiki_b = []
allwiki_ga = []
allwiki_fa = []
allwiki_total = 5206553
 
for row in allwikirows:
    allwiki_ts.append(time.mktime(datetime.datetime.strptime(row['timestamp'], "%Y%m%d%H%M%S").timetuple()))
    allwiki_ws.append((float(row['weighed_sum'])+ 1)/(allwiki_total))
    allwiki_stub.append((float(row['stub_n']))/(allwiki_total))
    allwiki_start.append((float(row['start_n']))/(allwiki_total))
    allwiki_c.append((float(row['c_n']))/(allwiki_total))
    allwiki_b.append((float(row['b_n']))/(allwiki_total))
    allwiki_ga.append((float(row['ga_n']))/(allwiki_total))
    allwiki_fa.append((float(row['fa_n']))/(allwiki_total))
    
#create lists of each variable to create plot of all WikiProject India quality trends.
indiawiki_ts = []
indiawiki_ws = []
indiawiki_stub = []
indiawiki_start = []
indiawiki_c = []
indiawiki_b = []
indiawiki_ga = []
indiawiki_fa = []
indiawiki_total = 134043
 
for row in indiawikirows:
    indiawiki_ts.append(time.mktime(datetime.datetime.strptime(row['timestamp'], "%Y%m%d%H%M%S").timetuple()))
    indiawiki_ws.append((float(row['weighed_sum']) + 1)/(indiawiki_total))
    indiawiki_stub.append((float(row['stub_n']))/(indiawiki_total))
    indiawiki_start.append((float(row['start_n']))/(indiawiki_total))
    indiawiki_c.append((float(row['c_n']))/(indiawiki_total))
    indiawiki_b.append((float(row['b_n']))/(indiawiki_total))
    indiawiki_ga.append((float(row['ga_n']))/(indiawiki_total))
    indiawiki_fa.append((float(row['fa_n']))/(indiawiki_total))

3. Plot mean weighted sum

plt.plot(allwiki_ts,allwiki_ws, '-', label = "All Wiki")
plt.plot(indiawiki_ts,indiawiki_ws, '--', label = "WikiProject India")
plt.xlabel('Time(s)')
plt.ylabel('Mean Weighted Sum')
plt.legend(loc = 'upper left')
<matplotlib.legend.Legend at 0x7f9cf10efda0>

The above plot show the quality trends of WikiProject India articles compared to quality trends across all of English Wikipedia. Both show a roughly linear trend in quality growth. There's a slight bend in the growth of the all English Wiki article growth. From about 2004 to late 2012, the quality of WikiProject India articles are generally of lower quality compared to all English wiki; however their quality growth starts to increase around 2009? and eventually surpasses all English Wiki around 2012.

I then plotted the quality trends across each class prediction for WikiProject India articles to help determine where the quality changes are occuring.

fig, ax = plt.subplots(nrows=2,ncols=3)

plt.subplot(2,3,1)
plt.plot(allwiki_ts,allwiki_stub, '-', label = "Stub")
plt.plot(indiawiki_ts,indiawiki_stub, '--', label = "Stub")
plt.title("Stub")
plt.subplot(2,3,2)
plt.plot(allwiki_ts,allwiki_start, '-', label = "Start")
plt.plot(indiawiki_ts,indiawiki_start, '--', label = "Start")
plt.title("Start")
plt.subplot(2,3,3)
plt.plot(allwiki_ts,allwiki_c, '-', label = "C")
plt.plot(indiawiki_ts,indiawiki_c, '--', label = "C")
plt.title("C")
plt.subplot(2,3,4)
plt.plot(allwiki_ts,allwiki_b, '-', label = "B")
plt.plot(indiawiki_ts,indiawiki_b, '--', label = "B")
plt.title("B")
plt.subplot(2,3,5)
plt.plot(allwiki_ts,allwiki_ga, '-', label = "GA")
plt.plot(indiawiki_ts,indiawiki_ga, '--', label = "GA")
plt.title("GA")
plt.subplot(2,3,6)
plt.plot(allwiki_ts,allwiki_fa, '-', label = "FA")
plt.plot(indiawiki_ts,indiawiki_fa, '--', label = "FA")
plt.title("FA")

plt.tight_layout()

Oberservations

  • The stub plot show that growth of stub articles for WikiProject India seem to fluctuate more than English Wikipedia overall, with several up-ticks and declines.
  • Wikiproject India articles saw a faster growth rate in "start" and "C" quality articles surpassing the growth rate of Wikipedia overall around 2012.
  • "B" Class articles graph shows that all English wiki and WikiProject India articles follow similar trend that differs from the other classes. Around 2007,"B" class article growth stays stable for all English Wiki and slightly declines for WikiProject India articles. Why?
  • Wikiproject India articles follow a very similar growth rate trend for higher quality "GA" and "FA" articles, with India articles growth rate slightly declining for "FA" articles. This may be due to coverage bias for source material or that it is more difficult to get WikiProject India articles through the "FA" article review process.

5. Plot quality gap

I also plotted the difference in mean weighted sums (quality gap) between WikiProject India and all English Wiki articles.

from operator import sub
wsdiff = list(map(sub, indiawiki_ws, allwiki_ws))
plt.plot(indiawiki_ts, wsdiff, '-')
plt.axhline(0, color = "black")
plt.xlabel('Time(s)')
plt.ylabel('WikiProject India Quality Gap')
<matplotlib.text.Text at 0x7f9cf0cc4470>

The plot show IndiaWiki articles quality decreasing from 2001 until about late 2009.

Next Steps

  1. Possible to convert timestamp unit to months so graphs are easier to read?
  2. Look at quality trends for "empty" articles.
  3. Compare WikiProject India to other similar projects such as WikiProject Africa to see if there are similar trends across other country-based articles. Also, maybe other India related WikiProjects.
  4. Do some research to see if the WikiProject India article quality growth increase around 2009 coincides to a community event or initiative (e.g. Start of WikiProject India [July 2006], Wikiproject India Nurturing (WIN) project [Started in August 2015], or WikiProject Asia 10,000 project [2016].