SparQL as a Wikidata pywikibot generator

In Wikidata, complex queries can be performed because the data is stored in a structured way. SparQL is the querying language used by the wikibase technology (which drives wikidata).

SparQL is meant to write queries to what is generally called (key-value) like data, which is exactly how Wikidata stores it's data (property, value) tuples. In general, it's a query language for RDF. RDF (Resource Description Framework) is a W3C specificationn to write metadata model graphs. i.e. it helps in specifying a way to write some types of relational diagrams.

To run and test SparQL queries on wikidata, a query service was created at https://query.wikidata.org - Use it while going through the tutorial.

1. Turtle

The basic building block of a SparQL query is an RDF/turtle. The full form of turtle is "Terse RDF Triple Language". It consists of a triplet or a 3-tuple where the items reresent a subject, a predicate and an object. In wikidata, we would say the three items are subject, property, value.

For example, in wikidata, we can write the following turtles:

In wikidata, the SparlQL have some special definitions (prefixes) which have been given pre-defined meanings. The wdt: and wd: prefixes:

  • wdt: - The wdt prefix is used for a property. Example, wdt:P856 is considered as the P856 property in wikidata.
  • wd: - The wd prefix is used for entities or items. Example, wd:Q42 is considered as the Q42 item in wikidata.

These words can be changed and other prefixes can be defined by using @prefix. Hence, you can simply consider the following two lines are always added to every query by default:

@prefix wd: <http://www.wikidata.org/entity/>
@prefix wdt: <http://www.wikidata.org/prop/direct/>


Hence, the above mentioned turtles will be written as the following in SparQL:

The standard prefixes used by wikidata are:

@prefix wd: <http://www.wikidata.org/entity/> 
@prefix wdt: <http://www.wikidata.org/prop/direct/>
@prefix wikibase: <http://wikiba.se/ontology#>
@prefix p: <http://www.wikidata.org/prop/>
@prefix ps: <http://www.wikidata.org/prop/statement/>
@prefix pq: <http://www.wikidata.org/prop/qualifier/>
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

2. Writing a simple RDF query

Using turtles, we can define a basic query which fetches all items with a specific property value. The syntax for this is:

SELECT ?item WHERE { ?item wdt:P31 wd:Q5 . } LIMIT 100

The word item is similar to a variable. The query above means "Return all items, which are instance of human, limited to 100 items". let us try fetching this data in pywikibot:

import pywikibot
import pywikibot.pagegenerators as pagegen
from pprint import pprint

wikidata = pywikibot.Site("wikidata", "wikidata")
human_list = list(pagegen.WikidataSPARQLPageGenerator("SELECT ?item WHERE { ?item wdt:P31 wd:Q5 . } LIMIT 5", site=wikidata))
pprint(human_list)
[ItemPage('Q42'),
 ItemPage('Q80'),
 ItemPage('Q91'),
 ItemPage('Q76'),
 ItemPage('Q23')]
for human in human_list:
    print(human, human.get()['labels']['en'])
WARNING: API error mwoauth-invalid-authorization-invalid-user: The authorization headers in your request are for a user that does not exist here
---------------------------------------------------------------------------
NoUsername                                Traceback (most recent call last)
<ipython-input-6-a473a1ead5b0> in <module>()
      1 for human in human_list:
----> 2     print(human, human.get()['labels']['en'])

/srv/paws/lib/python3.4/site-packages/pywikibot/page.py in get(self, force, get_redirect, *args, **kwargs)
   3924         @raise NotImplementedError: a value in args or kwargs
   3925         """
-> 3926         data = super(ItemPage, self).get(force, *args, **kwargs)
   3927 
   3928         if self.isRedirectPage() and not get_redirect:

/srv/paws/lib/python3.4/site-packages/pywikibot/page.py in get(self, force, *args, **kwargs)
   3539                 self.claims[pid] = []
   3540                 for claim in self._content['claims'][pid]:
-> 3541                     c = Claim.fromJSON(self.repo, claim)
   3542                     c.on_item = self
   3543                     self.claims[pid].append(c)

/srv/paws/lib/python3.4/site-packages/pywikibot/page.py in fromJSON(cls, site, data)
   4335             # The default covers string, url types
   4336             claim.target = Claim.TARGET_CONVERTER.get(
-> 4337                 claim.type, lambda value, site: value)(value, site)
   4338         if 'rank' in data:  # References/Qualifiers don't have ranks
   4339             claim.rank = data['rank']

/srv/paws/lib/python3.4/site-packages/pywikibot/page.py in <lambda>(value, site)
   4275             ItemPage(site, 'Q' + str(value['numeric-id'])),
   4276         'commonsMedia': lambda value, site:
-> 4277             FilePage(pywikibot.Site('commons', 'commons'), value),
   4278         'globe-coordinate': pywikibot.Coordinate.fromWikibase,
   4279         'time': lambda value, site: pywikibot.WbTime.fromWikibase(value),

/srv/paws/lib/python3.4/site-packages/pywikibot/tools/__init__.py in wrapper(*__args, **__kw)
   1445                              cls, depth)
   1446                     del __kw[old_arg]
-> 1447             return obj(*__args, **__kw)
   1448 
   1449         if not __debug__:

/srv/paws/lib/python3.4/site-packages/pywikibot/page.py in __init__(self, source, title)
   2300         """Constructor."""
   2301         self._file_revisions = {}  # dictionary to cache File history.
-> 2302         super(FilePage, self).__init__(source, title, 6)
   2303         if self.namespace() != 6:
   2304             raise ValueError(u"'%s' is not in the file namespace!" % title)

/srv/paws/lib/python3.4/site-packages/pywikibot/tools/__init__.py in wrapper(*__args, **__kw)
   1445                              cls, depth)
   1446                     del __kw[old_arg]
-> 1447             return obj(*__args, **__kw)
   1448 
   1449         if not __debug__:

/srv/paws/lib/python3.4/site-packages/pywikibot/tools/__init__.py in wrapper(*__args, **__kw)
   1445                              cls, depth)
   1446                     del __kw[old_arg]
-> 1447             return obj(*__args, **__kw)
   1448 
   1449         if not __debug__:

/srv/paws/lib/python3.4/site-packages/pywikibot/page.py in __init__(self, source, title, ns)
   2176                 raise ValueError(u'Title must be specified and not empty '
   2177                                  'if source is a Site.')
-> 2178         super(Page, self).__init__(source, title, ns)
   2179 
   2180     @deprecate_arg("get_redirect", None)

/srv/paws/lib/python3.4/site-packages/pywikibot/page.py in __init__(self, source, title, ns)
    158 
    159         if isinstance(source, pywikibot.site.BaseSite):
--> 160             self._link = Link(title, source=source, defaultNamespace=ns)
    161             self._revisions = {}
    162         elif isinstance(source, Page):

/srv/paws/lib/python3.4/site-packages/pywikibot/page.py in __init__(self, text, source, defaultNamespace)
   4942         # See bug T104864, defaultNamespace might have been deleted.
   4943         try:
-> 4944             self._defaultns = self._source.namespaces[defaultNamespace]
   4945         except KeyError:
   4946             self._defaultns = defaultNamespace

/srv/paws/lib/python3.4/site-packages/pywikibot/site.py in namespaces(self)
   1012         """Return dict of valid namespaces on this wiki."""
   1013         if not hasattr(self, '_namespaces'):
-> 1014             self._namespaces = NamespacesDict(self._build_namespaces())
   1015         return self._namespaces
   1016 

/srv/paws/lib/python3.4/site-packages/pywikibot/site.py in _build_namespaces(self)
   2608         # For versions lower than 1.14, APISite needs to override
   2609         # the defaults defined in Namespace.
-> 2610         is_mw114 = MediaWikiVersion(self.version()) >= MediaWikiVersion('1.14')
   2611 
   2612         for nsdata in self.siteinfo.get('namespaces', cache=False).values():

/srv/paws/lib/python3.4/site-packages/pywikibot/site.py in version(self)
   2715         if not version:
   2716             try:
-> 2717                 version = self.siteinfo.get('generator', expiry=1).split(' ')[1]
   2718             except pywikibot.data.api.APIError:
   2719                 # May occur if you are not logged in (no API read permissions).

/srv/paws/lib/python3.4/site-packages/pywikibot/site.py in get(self, key, get_default, cache, expiry)
   1674                 elif not Siteinfo._is_expired(cached[1], expiry):
   1675                     return copy.deepcopy(cached[0])
-> 1676         preloaded = self._get_general(key, expiry)
   1677         if not preloaded:
   1678             preloaded = self._get_siteinfo(key, expiry)[key]

/srv/paws/lib/python3.4/site-packages/pywikibot/site.py in _get_general(self, key, expiry)
   1620                         u"', '".join(props)), _logger)
   1621             props += ['general']
-> 1622             default_info = self._get_siteinfo(props, expiry)
   1623             for prop in props:
   1624                 self._cache[prop] = default_info[prop]

/srv/paws/lib/python3.4/site-packages/pywikibot/site.py in _get_siteinfo(self, prop, expiry)
   1546             # warnings are handled later
   1547             request._warning_handler = warn_handler
-> 1548             data = request.submit()
   1549         except api.APIError as e:
   1550             if e.code == 'siunknown_siprop':

/srv/paws/lib/python3.4/site-packages/pywikibot/data/api.py in submit(self)
   2340         cached_available = self._load_cache()
   2341         if not cached_available:
-> 2342             self._data = super(CachedRequest, self).submit()
   2343             self._write_cache(self._data)
   2344         else:

/srv/paws/lib/python3.4/site-packages/pywikibot/data/api.py in submit(self)
   2173                     continue
   2174                 raise NoUsername('Failed OAuth authentication for %s: %s'
-> 2175                                  % (self.site, info))
   2176             # raise error
   2177             try:

NoUsername: Failed OAuth authentication for commons:commons: The authorization headers in your request are for a user that does not exist here

The pywikibotpagegenerators.WikidataSPARQLPageGenerator function is restricted, as it can only accept queries which gives out a single ItemPage. It also expects the variabe name to be ?item. But SparQL is considerably more flexible, as it can generate different types of output.

For example, try the following query which should list all the places Douglas Adams (Q42) was educated at (P69):

pprint(list(pagegen.WikidataSPARQLPageGenerator("SELECT ?val WHERE { wd:Q42 wdt:P69 ?val . } LIMIT 5", site=wikidata)))
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-4-f83cd9dda198> in <module>()
----> 1 pprint(list(pagegen.WikidataSPARQLPageGenerator("SELECT ?val WHERE { wd:Q42 wdt:P69 ?val . } LIMIT 5", site=wikidata)))

/srv/paws/lib/python3.4/site-packages/pywikibot/pagegenerators.py in WikidataSPARQLPageGenerator(query, site, item_name, endpoint, result_type)
   2746     data = query_object.get_items(query,
   2747                                   item_name=item_name,
-> 2748                                   result_type=result_type)
   2749     items_pages = (pywikibot.ItemPage(repo, item) for item in data)
   2750     if isinstance(site, pywikibot.site.DataSite):

/srv/paws/lib/python3.4/site-packages/pywikibot/data/sparql.py in get_items(self, query, item_name, result_type)
    121         res = self.select(query, full_data=True)
    122         if res:
--> 123             return result_type(r[item_name].getID() for r in res)
    124         return result_type()
    125 

/srv/paws/lib/python3.4/site-packages/pywikibot/data/sparql.py in <genexpr>(.0)
    121         res = self.select(query, full_data=True)
    122         if res:
--> 123             return result_type(r[item_name].getID() for r in res)
    124         return result_type()
    125 

KeyError: 'item'

This would give the KeyError saying that item was not found. Running the same query on https://query.wikidata.org gives the appropriate result. Run the next code block and click the "Run" button to see the query:

from IPython.display import IFrame
IFrame('https://query.wikidata.org/#SELECT ?val WHERE { wd:Q42 wdt:P69 ?val . } LIMIT 5', width="100%", height="400px")

2. Running generic SparQL queries in Pywikibot

Pywikibot can also be used to run any generic SparQL queries using the SparqlQuery class:

from pywikibot.data.sparql import SparqlQuery

wikiquery = SparqlQuery()
wikiquery.query('SELECT ?val WHERE { wd:Q42 wdt:P69 ?val . } LIMIT 5')

The result given by the SparqlQuery is a bit raw and just gives the raw RDF converted to JSON. Hence normally the pywikibot API using ItemPage and Claim is an easier way to get data from the pages after creating the appropraite Page Generator.

If you're sure that the value is going to be a SELECT query, then the .select() function is a much cleaner way to get the data as it parses the JSON and sanitizes it:

wikiquery.select('SELECT ?val WHERE { wd:Q42 wdt:P69 ?val . } LIMIT 5')

But the data here still gives the url given by RDF rather than the ItemPage, hence it is rather limited in functionaity.

Resources

For a more elaborate RDF quide on SparQL check out https://commons.wikimedia.org/wiki/File:Wikidata%27s_SPARQL_introduction_presentation.pdf

For the complete guide to wikidata's SparQL check out https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries

Also, check out the example queries in https://www.mediawiki.org/wiki/Wikibase/Indexing/SPARQL_Query_Examples#Cats and https://query.wikidata.org/ to understand more complex queries.