Fixing redirects with pywikibot

This is a small notebook that you can import to fix redirects in statements (that is, change the value of the statement to the redirect target instead of the redirect itself). You can either fix individual statements or fix all statements returned by a query.

All edits are logged to the file fix-redirect.log.

Usage

from paws.TweetsFactsAndQueries.fixRedirect import fixRedirectInMainSnak, fixRedirectsInMainSnaksFromQuery

# fix a single statement
statementId = "Q0$1234abcd-1234-abcd-5678-1a2b3d4e5f"
fixRedirectInMainSnak(statementId, summary="Fix all redirects of fictional horses")

# fix statements returned by a query
query = """
SELECT ?statement WHERE {
  ?property a wikibase:Property;
            wikibase:claim ?p;
            wikibase:statementProperty ?ps.
  ?item ?p ?statement.
  ?statement ?ps ?redirect.
  ?redirect owl:sameAs ?value.
}
LIMIT 1
"""
fixRedirectsInMainSnaksFromQuery(query) # but please use a better query than the above one ☺

TODO

  • fixRedirectInMainSnak should probably refuse to fix redirects that are less than some time old – one day, probably, to match KrBot. (Until I implement it, you can do it in the query by checking the schema:dateModified of the redirect.)

  • As we’re already logging – how about logging in QuickStatements syntax, so you can easily revert any redirect fixes that turned out to be incorrect (when the items had to be unmerged again)?


Some boilerplate code and imports.

from sys import stderr
import pywikibot as pwb
from pywikibot.data import sparql
site = pwb.Site("wikidata", "wikidata")
repo = site.data_repository()
sparqlQuery = sparql.SparqlQuery(repo=repo)

Usually, Pywikibot logs to stderr whenever it sleeps after an API response told it to try again in n seconds. When doing lots of edits (e. g. with fixRedirectsInMainSnaksFromQuery), this results in a lot of output, which makes the notebook slow to load. Suppress this output as long as the sleep time isn’t overly long.

pwb.config.noisysleep = 10 # seconds

For the same reason, we don’t want to print lots of log lines to stdout, so instead redirect all our logging to a file.

log = open("fix-redirect.log", mode="a")

Helper function to extract the item ID part from a statement ID (which consists of the item ID, a $ sign, and then a UUID).

def itemIdFromStatementId(statementId):
    index = statementId.find("$")
    if index > 0:
        return statementId[:index]
    else:
        raise ValueError("{} is not a valid statement ID.".format(statementId))

Helper function to find a statement by its statement ID in an item object. I’m not aware of any Pywikibot function offering this functionality, but if there is one, please let me know!

def findStatementById(item, statementId):
    for propertyId in item.claims:
        for statement in item.claims[propertyId]:
            if statement.snak == statementId:
                return statement
    return None

The main worker function. Given a statement ID, load the item, statement and value, and if the value is a redirect, change the statement to point to the redirect target instead.

def fixRedirectInMainSnak(statementId, summary="Fix redirect"):
    itemId = itemIdFromStatementId(statementId)
    item = pwb.ItemPage(repo, itemId)
    try:
        item.get()
    except pwb.NoPage:
        print("Item {} does not exist.".format(itemId), file=stderr)
        return
    statement = findStatementById(item, statementId)
    if statement is None:
        print("Statement {} not found on item {}.".format(statementId, itemId), file=stderr)
        return
    if statement.type != 'wikibase-item':
        print("Value of statement {} is not an item.".format(statementId), file=stderr)
        return
    value = statement.getTarget()
    if value.isRedirectPage():
        target = value.getRedirectTarget()
        print("Changing value of statement {} from redirect {} to target {}.".format(statementId, value.id, target.id), file=log, flush=True)
        statement.changeTarget(value.getRedirectTarget(), summary=summary)

Helper function to turn a statement URI from the Wikidata Query Service (or from an RDF dump, I suppose) into a statement ID.

def statementIdFromStatementUri(statementUri):
    baseUri = "http://www.wikidata.org/entity/statement/"
    if statementUri.startswith(baseUri):
        return statementUri[len(baseUri):].replace("-", "$", 1)
    else:
        raise ValueError("{} is not a valid statement URI.".format(statementUri))

Auxiliary worker function. Run the given query, look for ?statements in the result and run fixRedirectInMainSnak for each of them.

def fixRedirectsInMainSnaksFromQuery(query, summary="Fix redirect"):
    for result in sparqlQuery.select(query):
        fixRedirectInMainSnak(statementIdFromStatementUri(result['statement']), summary)