Exploring the draft quality of unreviewed New Page Patrol backlog

ORES draft quality analysis of backlog: 22,029 OK, 27 spam, 0 attack, 0 vandalism → 99.9% of the existing backlog is not easily classifiable as spam/attack/vandalism. This makes intuitive sense; we expect that the most blatant bad stuff will be easier to identify by the patrollers and that they will deal with it first, so it won't linger in the backlog.

We used WikiLabels to hand-code a random selection of 190 pages from the ~22,000 pages that were automatically evaluated via the ORES draft quality model. There are more probable promo spam pages in there (21/190 or 11%) than ORES identified, but there also seem to be a healthy chunk of "time consuming judgement calls" that we think an overworked reviewer without extensive domain knowledge would have a hard time making a binary keep/delete judgement about.

import csv
import json
from pprint import pprint
import requests
from scipy.stats import chisquare
import time
pages = []
#full list of unreviewed pages as of May 20
with open("npp_unreviewed_20170519000000.csv", "rt") as fin:
    reader = csv.reader(fin)
    for row in reader:
def api_call(url):
        call = requests.get(url)
        response = call.json()
        response = None
    return response
#see https://meta.wikimedia.org/wiki/Wiki_labels#Machine-readable_paths
call = requests.get('http://labels.wmflabs.org/campaigns/enwiki/58/?tasks')
response = call.json()
labels = response['tasks']
# pprint(labels)
scores = {

for l in labels:
    if len(l['labels']) > 0:
        if l['labels'][0]['data']['firstpass'] in scores.keys():
            scores[l['labels'][0]['data']['firstpass']] += 1
#how many were tagged as spam?
{'OK': 169, 'spam': 21}
spam = [x for x in labels if len(x['labels']) > 0 and x['labels'][0]['data']['firstpass'] == 'spam']
#which ones were spam?
# spam

Which ones did we mark as both 'probably spam' and also 'interesting', meaning we think this case should be examined in greater detail and discussed, possibly because they are edge cases where we aren't sure of our judgement, possibly because they are a particularly relevant example of the type.

interesting_spam = [x for x in labels 
                    if len(x['labels']) > 0 
                    and x['labels'][0]['data']['interesting'] == True
                   and x['labels'][0]['data']['firstpass'] == 'spam']
# interesting_spam

Which ones did we mark as 'interesting' but 'OK'? These might be edge cases where we lean towards 'don't delete right away', or they might be cases where we see clear grounds for keep, but think that they would require special expertise to make that determination.

interesting_nonspam = [x for x in labels 
                    if len(x['labels']) > 0 
                    and x['labels'][0]['data']['interesting'] == True
                   and x['labels'][0]['data']['firstpass'] == 'OK']
# interesting_nonspam
def label_lists(fname,fdata):
    with open(fname, "w") as fout:
        w = csv.writer(fout, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
        for row in fdata: