Wikidata is one of the newer families added to the wikimedia projects. It acts as a central storage for structured data for all the wikimedia projects. It solves 2 major problems that wikimedia projects used to face:
wikidata = pywikibot.Site('wikidata', 'wikidata') wikidata
testwikidata = pywikibot.Site('test', 'wikidata') testwikidata
In wikidata, every page is either an item or a property.
Items are used to represent all the things in human knowledge: including topics, concepts, and objects. For example; color (Q1075), Albert Einstein (Q937), Earth (Q2), and cat (Q146) are all considered as items in Wikidata.
Properties are the things that describe and define a item. Each data bit related to an item is a type of property. Properties are different for different types of items. Examples of properties for Python (Q28865) are: license (P275), bug tracking system (P1401), official website (P856), Stack exchange tag (P1482).
The wikidata API helps to query data form wikidata using SparkQL to filter properties. So, for example you can find all countries in the world which have a population between 10 million to 300 million with just 1 query. In the earlier category interface, there would have to be a category for this information or you would have to parse evry country's page to find the population using Natural language processing!
Find an item on wikidata and edit it to add some additional information. Some tips on finding an item:
The wikidata game is an example of a bot which helps users contribute better to wikidata. You can check out the wikidata game at https://tools.wmflabs.org/wikidata-game
The wikidata game finds possible pages which do not have a certain type of information using the structured queries (For example human items which have no gender) and shows the wikipedia page related to that item. Then the user is expected to identify a specific property of the item (For example male or female for the property gender).
Wikidata resonator is a script which pulls data from wikidata and joins all the property data of an item to form a descriptive paragraph about the item. You can check it out at https://tools.wmflabs.org/reasonator/
Other than forming a descriptive paragraph of the item, it also groups similar properites like "Relative" group, "External sources" etc. based on some simple conditions. It also creates a timeline of the item if possible based on the properties that would have a datetime data type. It also generates a QR Code to use for the related wikipedia page and shows related images pulled from the related commons.wikimedia page!
The first thing we're going to do is figure out how to get data from wikidata using pywikibot. Here, it's not like a generic mediawiki website, where the text of the page is pulled. Here, the data is structured. We use a
ItemPage class in pywikibot which can handle these items in a better way:
itempage = pywikibot.ItemPage(wikidata, "Q42") # Q42 is Douglas Adams itempage
In wikidata, the
page.text won't work like other mediawiki websites where it gives a string of the whole content in the page. The data and properties are stored in Python dictionary structure:
If you want to get the data using the title of the page rather than the item ID, we can get the wikidata article associated to a wikipedia page using:
itempage == pywikibot.ItemPage.fromPage(pywikibot.Page(pywikibot.Site('en', 'wikipedia'), 'Douglas Adams'))
Let's take a closer look at the data items given by an item page. It is a dictionary with the following keys:
itemdata = itempage.get() itemdata.keys()
Labels are the name or title of the wikidata item.
Aliases are alternate labels for the same item in the same lanugage. For example, "Python (Q28865)" The programming language has the alias "Python language", "Python programming language", "/usr/bin/python", etc.
Descriptions are useful statements which can help distinguish items with similar labels. Wikidata items are unique only by their item ID (Qxx) hence the description helps differentiate behind "Python (Q271218)" the genus of reptiles, and "Python (Q28865)" the programming language, and "Python (Q15728)" the family of missiles!
As wikidata does not have a specific code/language (the code we use is "wikidata") it has data for all languages. Hence, the same item can have a different label in Engligh, Arabic, or French. So, these fields in the data are dictionaries with the key as the language code and the value as the label in that language.
For convenience, after the
itempage.get() is called, the data is stored in the page variable also:
itemdata['labels'] == itempage.labels
Claims Are other wikidtaa pages that are linked to the given item using properties. The 'claims' are stored as another dictionary with the keys as property IDs (P1003, P1005, P1006, etc.) and the value is a list of objects
Hence, the claim is the value of all the properties that have been set for the given item.
# Similarly, this is available in the page object using: itemdata['claims'] == itempage.claims
There are multiple claims for the property "P800" and we can ask pywikibot to resolve the claim and fetch the data about the claim:
So, we notice that the claim for the first "notable work (P800)" of "Douglas Adams (Q42)" is the item Q25169. As this is another
ItemPage we can fetch the english label for this item by doing:
p800_claim_target = itempage.claims['P800'].getTarget() p800_claim_target.get() p800_claim_target.labels['en']
So, finally we were able to find one of the most notable work of Douglas Adams using the wikidata API exposed by pywikibot. Imagine doing the same in the english wikipedia !
Thought exercise: How would you figure out the most notable work of the author using the chunk of text given by an English wikipedia Page ?
Sometimes, it is important to be able to fetch data about the property we find itself. For example, if we want to list the english label of the property and the value in a tabular form.
To do this, we use a
PropertyPage object to deal with properties:
propertypage = pywikibot.PropertyPage(wikidata, 'P512') propertypage
PropertyPage, we again can access the data similar to how it was accessed in the
On Wikidata, we've already seen ItemPages and PropertyPages. But sometimes, a Claim's value need not be another Itempage, and can be some other data type like text, number, datetime, etc. Pywikibot provides a class for each of these data-types for easier accesibility to the value and resolve the claim.
The Data types available in wikidata can be seen at: https://www.wikidata.org/wiki/Special:ListDatatypes
The wikidata datatypes provided and the corresponding name of the wikidata data-type are:
pywikibot.page.ItemPage- Link to other items at the project.
pywikibot.page.PropertyPage- Link to properties at the project.
pywikibot.Coordinate- Literal data for a geographical position given as a latitude-longitude pair in gms or decimal degrees.
pywikibot.WbTime- Literal data field for a point in time.
pywikibot.WbQuantity- Literal data field for a quantity that relates to some kind of well-defined unit.
pywikibot.MonoLingualText- Literal data field for a string that is not translated into other languages.
Some types of wikidata have are made specially to show them using a different method for example, showing it as a link, etc. But they all map to the python
str. They are:
str- Literal data field for a string of glyphs. Generally do not depend on language of reader.
str- Literal data field for a URL.
str- Literal data field for an external identifier. External identifiers may automatically be linked to an authoritative resource for display.
str- Literal data field for mathematical expressions, formula, equations and such, expressed in a variant of LaTeX.
# Item item = pywikibot.ItemPage(wikidata, "Q42").get()['claims']['P31'].getTarget() print("Type:", type(item)) print("Instance of Douglas Adams:", item, '(', item.get()['labels']['en'], ')')
# Property _property = pywikibot.PropertyPage(wikidata, "Property:P31") _property.get() print("Type:", type(_property)) print("Property 'instance of':", _property, '(', _property.labels['en'], ')')
# Global Coordinate coord = pywikibot.ItemPage(wikidata, "Q668").get()['claims']['P625'].getTarget() print("Type:", type(coord)) print("Coordinate location of India:", coord)
# Time _time = pywikibot.ItemPage(wikidata, "Q28865").get()['claims']['P571'].getTarget() print("Type:", type(_time)) print("Inception of Python (programming language):", _time)
# Quantity qty = pywikibot.ItemPage(wikidata, "Q668").get()['claims']['P1082'].getTarget() print("Type:", type(qty)) print("Population in India:", qty)
# Monolingual text monolingual_text = pywikibot.ItemPage(wikidata, "Q42").get()['claims']['P1477'].getTarget() print("Type:", type(monolingual_text)) print("Birth name of Douglas Adams:", monolingual_text)
# String _string = pywikibot.ItemPage(wikidata, "Q28865").get()['claims']['P348'].getTarget() print("Type:", type(_string)) print("Version of Python:", _string)
# URL website = pywikibot.ItemPage(wikidata, "Q28865").get()['claims']['P856'].getTarget() print("Type:", type(website)) print("Official website of Python:", website)
# External Identifier iden = pywikibot.ItemPage(wikidata, "Q28865").get()['claims']['P646'].getTarget() print("Type:", type(iden)) print("Freebase identifier of Python:", iden)
# Mathematical Formula formula = pywikibot.ItemPage(wikidata, "Q11518").get()['claims']['P2534'].getTarget() print("Type:", type(formula)) print("Formula of Pythagorean theorem:", formula)
Frequently, a property value may require additional data. For example, consider the property educated at (P69) in the earlier data fetched from Douglas Adams (Q42). It can be seen on WikiData that "St John's College" is mentioned as one of his education schools from "start time" of 1971 to "end time" of 1974. And it also says his "academic major" is English literature and "academic degree" was Bachelor of Arts.
All this information would not be found if Wikidata was restricted to a (property, value) storage structure. Hence, Wikidata also allows Qualifiers. Qualifiers expand on the dta provided by a (property, value) pair by giving it context. Qualifiers also consist of a (property, value) !
In the above example, the properties which are being used as qualifiers are:
The qualifiers are again claims, as they are similar to the (property, value) pair for item pages. Let us see what the value of the qualifier is by resolving the claim:
# Fetch the label of the P512 (academic degree) property claim = itempage.claims['P69'] claim.qualifiers['P512'].getTarget().get() claim.qualifiers['P512'].getTarget().labels['en']
Some qualifiers may have a value which is not another item, for example the "start date" qualifier. In such cases, we need to check the type of the item:
claim = itempage.claims['P69'] claim.qualifiers['P580'].getTarget()
WBTime is a pywikibot class which handles the format of WikiBase Time. Wiki base is the underlying technology that powers the structured editing and so on of Wikidata.
Other functions to modify qualifiers are:
Other than the qualifier, we would also want the source of the data. Hence, the reference or source field helps in adding, removing, editing these:
Again, the source is a (property, value) where the property describes what type of source it is. Some popular properties are:
A source is again a list of (property, value) tuples. It can have additional properties like: "original language of work", "publisher", "author", "title", "retrieved", etc. if necessary.
Let us take a look at a source here:
source = itempage.claims['P69'].getSources() source
# Get the value of the first tuple in the source: source['P248']
To read more ways of using pywikibot to access wikidata, go to https://www.wikidata.org/wiki/Wikidata:Pywikibot_-_Python_3_Tutorial
SparQL queries are SQL like queries that can be run on Wikidata to fetch data from it. To try out SparkQL queries and visualize the data using nice plots, you can use https://query.wikidata.org. It has a lot of example SparQL queries which can be useful to learn SparQL.