Comparing WikiProject Women Scientists to the rest of Wikipedia

In this notebook, we'll be exploring the article quality trends of the entirity of Wikipedia with those articles that fall under the perview of WikiProject Women Scientists.

This notebook proceeds in 3 main stages.

  1. Setup and data loading
  2. Comparing "weighted_sum" scores overall
  3. Comparing quality predictions

1: Setup and data loading

In this step, we'll ensure that we have a ggplot2 and data.table libraries installed and loaded. Then we'll move on to loading in the aggregate monthly statistics for Wikipedia and WikiProject Women Scientists. This section concludes with small snippet of data from the datasets.

# install.packages("ggplot2")
# install.packages("data.table")
library(ggplot2)
library(data.table)
allwiki.mq = data.table(read.table("enwiki.monthly_wiki_quality.tsv", sep="\t", quote="", header=T))
ws.mq = data.table(read.table("enwiki.monthly_women_scientist_quality.tsv", sep="\t", quote="", header=T))

allwiki.mq$group = "all wiki"
ws.mq$group = "women scientist"

mq = rbind(allwiki.mq, ws.mq)
mq$month = as.Date(format(mq$month, scientific=F), format="%Y%m%d")
mq[1:3]
monthpossible_nstub_nstart_nc_nb_nga_nfa_nmean_weighted_sumgeo_mean_weighted_sumgroup
2015-01-015206553 2218765 1546679 657522 168087 113715 28797 0.910254070.5153308 all wiki
2014-02-015206553 2117526 1462730 585046 166175 96728 25798 0.839045030.5245081 all wiki
2003-12-015206553 93449 69620 43 9919 0 0 0.029071660.9904793 all wiki

2: Comparing "weighted_sum" scores overall

OK. Now we'll compare the aggregated "weighted_sum" measures for all articles in the different sets. Here's how the weighted sum is calculated:

CLASS_WEIGHTS = {
  'Stub': 0,
  'Start': 1,
  'C': 2,
  'B': 3,
  'GA': 4,
  'FA': 5
}
def weighted_sum(probabilities):
    return sum(CLASS_WEIGHTS[cls] * proba 
               for cls, proba in probabilities.items())

In our previous analysis, we generated a simple arithmetic mean as a centrality measure for article quality predictions for each month.

options(repr.plot.height=3)
ggplot(
    mq, aes(month, mean_weighted_sum, linetype=group, color=group)
) + 
theme_bw() + 
geom_line()

Here, we can see a clear trend in the expected "weighted_sum" quality of articles in each set. While overall, Wikipedia shows a linear growth since 2005, WikiProject Women Scientists articles shows a sudden shift around mid-2013. After this point, it seems that the quality level of articles about women scientists grows much more quickly than the rest of the encyclopedia.

3: Comparing quality predictions

The arithmetic mean provides a nice way to look at overall quality, but where were the quality changes happening? Where Stubs being converted to Start? Are GAs becoming FAs? In this section, we'll compare the proportion of articles that fall into each quality class prediction over time.

mq.by_prediction = rbind(
    mq[,list(month=month, group=group, possible_n=possible_n, prediction="Empty", n=possible_n-(stub_n+start_n+c_n+b_n+ga_n+fa_n)),],
    mq[,list(month=month, group=group, possible_n=possible_n, prediction="Stub", n=stub_n),],
    mq[,list(month=month, group=group, possible_n=possible_n, prediction="Start", n=start_n),],
    mq[,list(month=month, group=group, possible_n=possible_n, prediction="C", n=c_n),],
    mq[,list(month=month, group=group, possible_n=possible_n, prediction="B", n=b_n),],
    mq[,list(month=month, group=group, possible_n=possible_n, prediction="GA", n=ga_n),],
    mq[,list(month=month, group=group, possible_n=possible_n, prediction="FA", n=fa_n),]
)
mq.by_prediction$prediction = ordered(mq.by_prediction$prediction, levels=c("Empty", "Stub", "Start", "C", "B", "GA", "FA"))
options(repr.plot.height=5, repr.plot.width=10)
plot = ggplot(
    mq.by_prediction[prediction %in% c("Empty", "Stub", "Start", "C"),],
    aes(x=month, y=n/possible_n, linetype=group, color=group)
) + 
theme_bw() + 
facet_wrap(~prediction, nrow=1) + 
geom_line() + 
scale_y_continuous("Proportion of all 'possible articles'")
print(plot)

In the plot above, we can see some clear differences between Wikipedia overall and articles about Women Scientists. The "Empty" cell shows the proportion of articles that exist at the end of our observations ("possible_n") that were not yet created. Here, we can see that the rate of creation of articles for all of Wikipedia is much faster than the rate of creation of Women Scientist articles until mid-2013.

Surprisingly, the "Stub" cell show that Wikipedia generally saw a steady growth of Stubs whereas Women Scientists saw period growth. Between 2004 and 2009, Women Scientist Stubs grew steadily, but then stopped until mid-2013 when there was a sudden up-tick that continues until the end of our data (Aug 2016).

The creation (or promotion) of Women Scientist articles to Start class runs much more in-step with the rest of Wikipedia for most of its history, but it shows a suddent growth around beginning of 2013. A similar trend plays out in "C" class articles.

plot = ggplot(
    mq.by_prediction[prediction %in% c("B", "GA", "FA"),],
    aes(x=month, y=n/possible_n, linetype=group, color=group)
) + 
theme_bw() + 
facet_wrap(~prediction, nrow=1) + 
geom_line() + 
scale_y_continuous("Proportion of all 'possible articles'")
print(plot)

The proportion of articles falling to "B" class shows a dramatically different trend for all Wikipedia (exponential growth with an abrupt stop in 2007) and articles about women scientists (eponential growth until 2007, followed by a sharp decline to a valley between 2011 and 2014 and recovery). The proportion of GA articles largely tracks overall Wikipedia, though there's a smaller proportion of GA class women scientist articles between 2007 and 2013 followed by a dramatically higher proportion from 2013 forward. The FA class cell shows the proportion of FA (Wikipedia's highest quality class) articles about women scientists has been consistently lower than the rest of the encyclopedia.