SparQL as a Wikidata pywikibot generator

In Wikidata, complex queries can be performed because the data is stored in a structured way. SparQL is the querying language used by the wikibase technology (which drives wikidata).

SparQL is meant to write queries to what is generally called (key-value) like data, which is exactly how Wikidata stores it's data (property, value) tuples. In general, it's a query language for RDF. RDF (Resource Description Framework) is a W3C specificationn to write metadata model graphs. i.e. it helps in specifying a way to write some types of relational diagrams.

To run and test SparQL queries on wikidata, a query service was created at https://query.wikidata.org - Use it while going through the tutorial.

1. Turtle

The basic building block of a SparQL query is an RDF/turtle. The full form of turtle is "Terse RDF Triple Language". It consists of a triplet or a 3-tuple where the items reresent a subject, a predicate and an object. In wikidata, we would say the three items are subject, property, value.

For example, in wikidata, we can write the following turtles:

In wikidata, the SparlQL have some special definitions (prefixes) which have been given pre-defined meanings. The wdt: and wd: prefixes:

  • wdt: - The wdt prefix is used for a property. Example, wdt:P856 is considered as the P856 property in wikidata.
  • wd: - The wd prefix is used for entities or items. Example, wd:Q42 is considered as the Q42 item in wikidata.

These words can be changed and other prefixes can be defined by using @prefix. Hence, you can simply consider the following two lines are always added to every query by default:

@prefix wd: <http://www.wikidata.org/entity/>
@prefix wdt: <http://www.wikidata.org/prop/direct/>


Hence, the above mentioned turtles will be written as the following in SparQL:

The standard prefixes used by wikidata are:

@prefix wd: <http://www.wikidata.org/entity/> 
@prefix wdt: <http://www.wikidata.org/prop/direct/>
@prefix wikibase: <http://wikiba.se/ontology#>
@prefix p: <http://www.wikidata.org/prop/>
@prefix ps: <http://www.wikidata.org/prop/statement/>
@prefix pq: <http://www.wikidata.org/prop/qualifier/>
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

2. Writing a simple RDF query

Using turtles, we can define a basic query which fetches all items with a specific property value. The syntax for this is:

SELECT ?item WHERE { ?item wdt:P31 wd:Q5 . } LIMIT 100

The word item is similar to a variable. The query above means "Return all items, which are instance of human, limited to 100 items". let us try fetching this data in pywikibot:

import pywikibot
import pywikibot.pagegenerators as pagegen
from pprint import pprint

wikidata = pywikibot.Site("wikidata", "wikidata")
human_list = list(pagegen.WikidataSPARQLPageGenerator("SELECT ?item WHERE { ?item wdt:P31 wd:Q5 . } LIMIT 5", site=wikidata))
pprint(human_list)
for human in human_list:
    print(human, human.get()['labels']['en'])

The pywikibotpagegenerators.WikidataSPARQLPageGenerator function is restricted, as it can only accept queries which gives out a single ItemPage. It also expects the variabe name to be ?item. But SparQL is considerably more flexible, as it can generate different types of output.

For example, try the following query which should list all the places Douglas Adams (Q42) was educated at (P69):

pprint(list(pagegen.WikidataSPARQLPageGenerator("SELECT ?val WHERE { wd:Q42 wdt:P69 ?val . } LIMIT 5", site=wikidata)))

This would give the KeyError saying that item was not found. Running the same query on https://query.wikidata.org gives the appropriate result. Run the next code block and click the "Run" button to see the query:

from IPython.display import IFrame
IFrame('https://query.wikidata.org/#SELECT ?val WHERE { wd:Q42 wdt:P69 ?val . } LIMIT 5', width="100%", height="400px")

2. Running generic SparQL queries in Pywikibot

Pywikibot can also be used to run any generic SparQL queries using the SparqlQuery class:

from pywikibot.data.sparql import SparqlQuery

wikiquery = SparqlQuery()
wikiquery.query('SELECT ?val WHERE { wd:Q42 wdt:P69 ?val . } LIMIT 5')

The result given by the SparqlQuery is a bit raw and just gives the raw RDF converted to JSON. Hence normally the pywikibot API using ItemPage and Claim is an easier way to get data from the pages after creating the appropraite Page Generator.

If you're sure that the value is going to be a SELECT query, then the .select() function is a much cleaner way to get the data as it parses the JSON and sanitizes it:

wikiquery.select('SELECT ?val WHERE { wd:Q42 wdt:P69 ?val . } LIMIT 5')

But the data here still gives the url given by RDF rather than the ItemPage, hence it is rather limited in functionaity.

Resources

For a more elaborate RDF quide on SparQL check out https://commons.wikimedia.org/wiki/File:Wikidata%27s_SPARQL_introduction_presentation.pdf

For the complete guide to wikidata's SparQL check out https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries

Also, check out the example queries in https://www.mediawiki.org/wiki/Wikibase/Indexing/SPARQL_Query_Examples#Cats and https://query.wikidata.org/ to understand more complex queries.