Basic Recent Changes Statistics

An IRC discussion reminded me of an oft-quoted statistic that "Most vandalism comes from IPs but most IPs are not vandals." I discovered that this statistic comes from a study of 250 mainspace edits in February 2007. I wanted to try the experiment again, but with the aid of modern technology. This script will grab an arbitrary number of recent changes and send them to ORES for scoring. Using the scores from ORES, the revisions are categorized as constructive (predicted to be good faith and not damaging) or nonconstructive. Contributions outside of mainspace, bot edits, page creations, and other logged items that show up in the recent changes feed are ignored.

You can see the code below or you can skip to the bottom to see the results.

Setup and configuration

import pywikibot
import requests 
import platform
import IPython.display

site = pywikibot.Site('en', 'wikipedia')
revs = 250

Grab revs recent changes with revid, username, and user type

rc = site.recentchanges(namespaces=0, total=revs, changetype='edit', bot=False)

data = {}

for change in rc:
    revid = change['revid']
    user = change['user']
    timestamp = change['timestamp']
    
    userPage = pywikibot.page.User(site, user)
    userType = userPage.isRegistered()
    
    data[revid] = {'revid': revid, 'user': user, 'registered': userType, 'timestamp': timestamp}
    
print('Done')
Done

Query ORES for classification

To be nice, queries are sent with up to 50 revisions each.

i = 0
ids = list(data)
db = site.dbName()
headers = {'user-agent': f'AntiCompositeBot/Basic RC statistics on PAWS (en:User:AntiCompositeNumber) Requests/{requests.__version__} Python/{platform.python_version()}'}

while i < len(ids):
    s = ids[i:i+50]
    i += 50
    url = f'https://ores.wmflabs.org/v3/scores/{db}/?models=damaging|goodfaith&revids='
    for j in s:
        url = url + str(j) + "|"
    url = url[:-1]
    r = requests.get(url, headers=headers)
    r.raise_for_status()
    ores = r.json()
    scores = ores[db]['scores']
    for rev, models in scores.items():
        data[int(rev)]['damaging'] = models['damaging']['score']['prediction']
        data[int(rev)]['goodfaith'] = models['goodfaith']['score']['prediction']
        
print('Done')
Done

Make some basic calculations

regCons = 0
regDest = 0
ipCons = 0
ipDest = 0
ts = []

#Sort revisions 
for rev, info in data.items():
    if info['damaging'] == False and info['goodfaith'] == True:
        if info['registered'] == True:
            regCons += 1
        else:
            ipCons += 1
    elif info['registered'] == True:
        regDest += 1
    else:
        ipDest += 1
    ts.append(info['timestamp'])

#Calculate totals and percentages
regTot = regCons + regDest 
ipTot = ipCons + ipDest

regTotP = regTot / revs * 100
ipTotP = ipTot / revs * 100

regConsP = regCons / regTot * 100
regDestP = regDest / regTot * 100
regConsTP = regCons / revs * 100
regDestTP = regDest / revs * 100

ipConsP = ipCons / ipTot * 100
ipDestP = ipDest / ipTot * 100
ipConsTP = ipCons / revs * 100
ipDestTP = ipDest / revs * 100


#Calculate revision timestamp range
tss = sorted(ts)
tsStart = tss[0]
tsEnd = tss[-1]

print('Done')
Done

Output

IPython.display.display_markdown(f'''\
Out of {revs} revisions scanned between {tsStart} and {tsEnd}:
- Registered editors made {regTot} contributions ({regTotP:.3}%)
  - {regCons} were likely constructive ({regConsP:.3}% of registered contributions, {regConsTP:.3}% of all contributions)
  - {regDest} were likely nonconstructive ({regDestP:.3}% of registered contributions, {regDestTP:.3}% of all contributions)
- Unregistered editors made {ipTot} contributions ({ipTotP:.3}%)
  - {ipCons} were likely constructive ({ipConsP:.3}% of unregistered contributions, {ipConsTP:.3}% of all contributions)
  - {ipDest} were likely nonconstructive ({ipDestP:.3}% of unregistered contributions, {ipDestTP:.3}% of all contributions)
''',raw=True)

Out of 250 revisions scanned between 2019-08-14T18:25:13Z and 2019-08-14T18:27:55Z:

  • Registered editors made 174 contributions (69.6%)
    • 168 were likely constructive (96.6% of registered contributions, 67.2% of all contributions)
    • 6 were likely nonconstructive (3.45% of registered contributions, 2.4% of all contributions)
  • Unregistered editors made 76 contributions (30.4%)
    • 53 were likely constructive (69.7% of unregistered contributions, 21.2% of all contributions)
    • 23 were likely nonconstructive (30.3% of unregistered contributions, 9.2% of all contributions)

Licensing

Copyright 2019 AntiCompositeNumber

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.