# Pywikibot Introduction¶

Pywikibot is a set of python functions which make it much much easier to make automated edits on mediawiki.

**Warning**: You are accountable for every edit you or your python script makes. Be careful and don't get banned!

import pywikibot


# 1. Using a mediawiki site¶

The first thing pywikibot needs to know, is which mediawiki website to target. There are many official sites like en.wikipedia.org, commons.wikimedia.org, en.wikitionary.org, en.wikiquote.org, en.wikinews.org, en.wikisource.org, etc. And each has their own versions with different languages like ml.wikipedia.org, ml.wikitionary.org, etc.

The default website seen on PAWS is the test.wikipedia.org To check the website out, go on to https://test.wikipedia.org

testwiki = pywikibot.Site()
testwiki

APISite("test", "wikipedia")

A mediawiki website has 2 parts which are important. The code and the family. The pywikibot API supports a LOT of official families and codes, and can also add a local instance or a personal deployment of mediawiki.

The family tells pywikibot which type of mediawiki site should be used, and it can read and write data specific to the family. Examples of family are: wikipedia, wikitionary, wikisource, etc.

The code tells pywikibot which variant of the family should be used. Common examples of codes are: en, es, ml, etc. The code depends on the family though. For example, the "commons" family has only the "commons" code.

enwiki = pywikibot.Site(code="en", fam="wikipedia")
enwiki

APISite("en", "wikipedia")
commons = pywikibot.Site(code="commons", fam="commons")
commons

APISite("commons", "commons")
wikidata = pywikibot.Site(code="wikidata", fam="wikidata")
wikidata

DataSite("wikidata", "wikidata")
testwikidata = pywikibot.Site(code="test", fam="wikidata")
testwikidata

DataSite("test", "wikidata")

# 2. Logging in¶

In the PAWS interface, the user is set by default to the user account that has been used to login to PAWS. But in a local script, we would need to modify the user-config.py file to add the username and password. We will see this later.

We tell pywikibot to login with the login() function. Then we check which user has been used to login:

testwiki.login()
print('Logged in user is:', testwiki.user())

Logged in user is: Rahul24kde


# 3. Reading data on Pages¶

To pull data from pywikibot, we use the Page class which holds information about a page from the mediawiki website.

First, we create a Page object using the name of the page. Here, we use the page "User:AbdealiJK/Pywikibot_Tutorial" as an example:

demo_page = pywikibot.Page(testwiki, 'User:Rahul24kde/Sandbox')
demo_page

Page('User:Rahul24kde/Sandbox')

Now we use the class to fetch other information about the page. For example, to get the text of the page:

print(demo_page.text)

== About Me ==

Hello!

My name is '''rahul'''.

I am from kannur and am learning how to use pywikibot !



You can get a lot of other information about the page by using various helper functions provided by pywikibot:

print("Check if page exists:", demo_page.exists())
print("Title of the page:", demo_page.title())
print("Contributors of the page:", demo_page.contributors())
print("Last edit made on page:", demo_page.editTime())
print("Full URL to page:", demo_page.full_url())

Check if page exists: True
Title of the page: User:Rahul24kde/Sandbox
Contributors of the page: Counter({'Rahul24kde': 1})
Last edit made on page: 2016-10-01T09:16:05Z
Full URL to page: https://test.wikipedia.org/wiki/User%3ARahul24kde/Sandbox


# 4. Writing data to Pages¶

In general use the test wikipedia website for writing data, and ensure that you make changes in your User space (pages starting with User:<Your user name> as these are meant for your personal usage like testing these scripts :)

For example, let's create the object for your personal Sandbox page on test wiki:

sandbox = pywikibot.Page(testwiki, 'User:' + testwiki.user() + '/Sandbox')
sandbox

Page('User:Rahul24kde/Sandbox')

Here, let's try writing some wiki markup to the page. For example, let's try making your profile !

sandbox.text ="""

Hello!

My name is '''{name}'''.

I am from {hometown} and am learning how to use pywikibot !

""".format(name='rahul',
hometown='kannur')
sandbox.save()

VERBOSE:pywiki:Found 1 wikipedia:test processes running, including this one.
Page [[User:Rahul24kde/Sandbox]] saved
INFO:pywiki:Page [[User:Rahul24kde/Sandbox]] saved


Let's open up the webpage and see if our changes have been added there.

Using Jupyter and IPython, we can even embed the webpage into the notebook:

from IPython.display import IFrame
IFrame(sandbox.full_url(), width="100%", height="400px")


# 5. Textlib functions¶

Once you can get content and save new content, there are many times you'd like to get a list of categories or templates from a mediawiki instance.

A category is a special namespace (Similar to the user space) which holds categories that are used to classify pages. For example the "Python (programming language)" page on wikipedia has the categories "Category:Class-based programming languages", "Category:Cross-platform free software", "Category:Dynamically typed programming languages" and so on.

To add a category to a page, a link to the category must be added to the medaiwiki page. Hence, something like [[Category:<name of category>]] should be added according to the wiki markup.

A template is a snippet of text which can be included into multiple other pages (Something like a #include or import). The wiki markup to add a template is {{<template name>}} and it can also take in arguments, for example {{<template name>|arg1|arg2}}.

python = pywikibot.Page(enwiki, 'Python_(programming_language)')
python

VERBOSE:pywiki:Found 1 wikipedia:en processes running, including this one.
WARNING: API error mwoauth-invalid-authorization-invalid-user: The authorization headers in your request are for a user that does not exist here
WARNING:pywiki:API error mwoauth-invalid-authorization-invalid-user: The authorization headers in your request are for a user that does not exist here

---------------------------------------------------------------------------
NoUsername                                Traceback (most recent call last)
<ipython-input-45-6f5e49fc5909> in <module>()
----> 1 python = pywikibot.Page(enwiki, 'Python_(programming_language)')
2 python

/srv/paws/lib/python3.4/site-packages/pywikibot/tools/__init__.py in wrapper(*__args, **__kw)
1445                              cls, depth)
1446                     del __kw[old_arg]
-> 1447             return obj(*__args, **__kw)
1448
1449         if not __debug__:

/srv/paws/lib/python3.4/site-packages/pywikibot/tools/__init__.py in wrapper(*__args, **__kw)
1445                              cls, depth)
1446                     del __kw[old_arg]
-> 1447             return obj(*__args, **__kw)
1448
1449         if not __debug__:

/srv/paws/lib/python3.4/site-packages/pywikibot/page.py in __init__(self, source, title, ns)
2176                 raise ValueError(u'Title must be specified and not empty '
2177                                  'if source is a Site.')
-> 2178         super(Page, self).__init__(source, title, ns)
2179
2180     @deprecate_arg("get_redirect", None)

/srv/paws/lib/python3.4/site-packages/pywikibot/page.py in __init__(self, source, title, ns)
158
159         if isinstance(source, pywikibot.site.BaseSite):
161             self._revisions = {}
162         elif isinstance(source, Page):

/srv/paws/lib/python3.4/site-packages/pywikibot/page.py in __init__(self, text, source, defaultNamespace)
4942         # See bug T104864, defaultNamespace might have been deleted.
4943         try:
-> 4944             self._defaultns = self._source.namespaces[defaultNamespace]
4945         except KeyError:
4946             self._defaultns = defaultNamespace

/srv/paws/lib/python3.4/site-packages/pywikibot/site.py in namespaces(self)
1012         """Return dict of valid namespaces on this wiki."""
1013         if not hasattr(self, '_namespaces'):
-> 1014             self._namespaces = NamespacesDict(self._build_namespaces())
1015         return self._namespaces
1016

/srv/paws/lib/python3.4/site-packages/pywikibot/site.py in _build_namespaces(self)
2608         # For versions lower than 1.14, APISite needs to override
2609         # the defaults defined in Namespace.
-> 2610         is_mw114 = MediaWikiVersion(self.version()) >= MediaWikiVersion('1.14')
2611
2612         for nsdata in self.siteinfo.get('namespaces', cache=False).values():

/srv/paws/lib/python3.4/site-packages/pywikibot/site.py in version(self)
2715         if not version:
2716             try:
-> 2717                 version = self.siteinfo.get('generator', expiry=1).split(' ')[1]
2718             except pywikibot.data.api.APIError:
2719                 # May occur if you are not logged in (no API read permissions).

/srv/paws/lib/python3.4/site-packages/pywikibot/site.py in get(self, key, get_default, cache, expiry)
1674                 elif not Siteinfo._is_expired(cached[1], expiry):
1675                     return copy.deepcopy(cached[0])
-> 1676         preloaded = self._get_general(key, expiry)

/srv/paws/lib/python3.4/site-packages/pywikibot/site.py in _get_general(self, key, expiry)
1620                         u"', '".join(props)), _logger)
1621             props += ['general']
-> 1622             default_info = self._get_siteinfo(props, expiry)
1623             for prop in props:
1624                 self._cache[prop] = default_info[prop]

/srv/paws/lib/python3.4/site-packages/pywikibot/site.py in _get_siteinfo(self, prop, expiry)
1546             # warnings are handled later
1547             request._warning_handler = warn_handler
-> 1548             data = request.submit()
1549         except api.APIError as e:
1550             if e.code == 'siunknown_siprop':

/srv/paws/lib/python3.4/site-packages/pywikibot/data/api.py in submit(self)
2341         if not cached_available:
-> 2342             self._data = super(CachedRequest, self).submit()
2343             self._write_cache(self._data)
2344         else:

/srv/paws/lib/python3.4/site-packages/pywikibot/data/api.py in submit(self)
2173                     continue
2174                 raise NoUsername('Failed OAuth authentication for %s: %s'
-> 2175                                  % (self.site, info))
2176             # raise error
2177             try:

NoUsername: Failed OAuth authentication for wikipedia:en: The authorization headers in your request are for a user that does not exist here

Let's get a list of all categories added to the page:

list(python.categories())


The textlib functions help to modify the text content on the page for specific needs like adding or removing categories. Hence, it has it's parsers which read through the text and pull out all the category links it finds based on the wiki markup.

pywikibot.textlib.getCategoryLinks(python.text)

print("Text categories in page:", len(pywikibot.textlib.getCategoryLinks(python.text)))
print("All categories associated with page:", len(list(python.categories())))


Let's try removing a category using the textlib functions:

new_text = pywikibot.textlib.removeCategoryLinks(python.text)
print("List of categories after the remove function:", pywikibot.textlib.getCategoryLinks(new_text))


### Other useful methods¶

Textlib contains many other websties that make editing the tet in mediawiki pages easier. For example:

• TimeStripper() - Helps to pull out all time strings in the text and converts it into python time object
• does_text_contain_section() - Checks whether the section with given name exists in the text
• extract_templates_and_params() - Fetches a list of templates with the arguments used in the template markup
• glue_template_and_params() - Takes a template and arguments and creates the appropriate wiki markup for it.
• removeHTMLParts() - Cleans the data by removing all HTML code in the page
• replaceCategoryInPlace() - If a category needs to be modified to another category, this replaces it inplace

# 6. Page Generators¶

There are many instances where it is useful to create a "page generator" which helps iterate over multiple pages that share a common property. For example, consider you want to find all pages of wikimedia projects:

import pywikibot.pagegenerators

wiki_projects = pywikibot.pagegenerators.CategorizedPageGenerator(
pywikibot.Category(enwiki, 'Category:Wikimedia projects'),
recurse=False)

from pprint import pprint
pprint(list(wiki_projects))


We use the python module pprint (pretty print) to format the output in a better way rather than dumping it as a list.

# 7. Exercises¶

### Exercise 1 - Write a script to remove trailing whitespace from a given page¶

In many mediawiki pages, we see that editors leave trailing whitespace at the bottom of the page. While this does not matter when the page is rendered for viewing, it adds unnecessary length to the article when downloading the text and raw wikicode.

Write a script to remove the trailing whitespace and keep only 1 newline at the end of the page. (Test this on a testwiki !)

### Exercise 2 - Write a script to find the number of devices using Android Operating System¶

Find the number of pages that exist that are related to devices that use the Android Operating System's category.

# 8. Setting up pywkibot locally¶

PAWs provides a method to run pywikibot and related commands through Jupyter notebooks. It has already installed various requirements and so on that are needed for pywikibot scripts. Hence, it's an easy way to get users started. As it's only 1 server on the internet, if everyone began using PAWs, it gets crowded and slow. In such cases, it may be easier to run these scripts locally in your own desktop/laptop.

### Installing basic requirements¶

First, install the basic requirements. This depends on your specific OS.

### Installing pywikibot¶

Pywikibot is currently still a release candidate, hence rather than installing the rc5 from pip, we will get the latest source code at the master branch using git. To do this, run the following command on your terminal or command prompt:

/home/user/git_repos/$git clone https://github.com/wikimedia/pywikibot-core.git  You will find the folder pywikibot-core has been created in the current working directory. If you wish to move the folder simple move it to another directory, or use the cd command to change directory before running the above git command. Once the git repository has been downloaded, cd into the directory and run:  /home/user/git_repos/pywikibot-core/$ pip install .



Which installs the pywikibot repository to your python installation. The . (dot) is required as it tells pip to find the python package at the current directory. Pywikibot also has a lot of optional dependencies which are used to run specific scripts and unittests. To install all of these (to avoid errors later) run:

/home/user/git_repos/pywikibot-core/$pip install -r dev-requirements.txt -r requirements.txt ### Configuring pywikibot¶ Once the pywikibot library has been installed, simply use the pwb.py script provided in the git repo: /home/user/git_repos/pywikibot-core/$ python pwb.py login



And follow the questions to create a user-config.py which holds your configuration information.