Pywikibot Introduction

Pywikibot is a set of python functions which make it much much easier to make automated edits on mediawiki.

**Warning**: You are accountable for every edit you or your python script makes. Be careful and don't get banned!

import pywikibot

1. Using a mediawiki site

The first thing pywikibot needs to know, is which mediawiki website to target. There are many official sites like en.wikipedia.org, commons.wikimedia.org, en.wikitionary.org, en.wikiquote.org, en.wikinews.org, en.wikisource.org, etc. And each has their own versions with different languages like ml.wikipedia.org, ml.wikitionary.org, etc.

The default website seen on PAWS is the test.wikipedia.org To check the website out, go on to https://test.wikipedia.org

testwiki = pywikibot.Site()
testwiki
APISite("test", "wikipedia")

A mediawiki website has 2 parts which are important. The code and the family. The pywikibot API supports a LOT of official families and codes, and can also add a local instance or a personal deployment of mediawiki.

The family tells pywikibot which type of mediawiki site should be used, and it can read and write data specific to the family. Examples of family are: wikipedia, wikitionary, wikisource, etc.

The code tells pywikibot which variant of the family should be used. Common examples of codes are: en, es, ml, etc. The code depends on the family though. For example, the "commons" family has only the "commons" code.

enwiki = pywikibot.Site(code="en", fam="wikipedia")
enwiki
APISite("en", "wikipedia")
commons = pywikibot.Site(code="commons", fam="commons")
commons
APISite("commons", "commons")
wikidata = pywikibot.Site(code="wikidata", fam="wikidata")
wikidata
DataSite("wikidata", "wikidata")
testwikidata = pywikibot.Site(code="test", fam="wikidata")
testwikidata
DataSite("test", "wikidata")

2. Logging in

In the PAWS interface, the user is set by default to the user account that has been used to login to PAWS. But in a local script, we would need to modify the user-config.py file to add the username and password. We will see this later.

We tell pywikibot to login with the login() function. Then we check which user has been used to login:

testwiki.login()
print('Logged in user is:', testwiki.user())
Logged in user is: Krisananthu

3. Reading data on Pages

To pull data from pywikibot, we use the Page class which holds information about a page from the mediawiki website.

First, we create a Page object using the name of the page. Here, we use the page "User:AbdealiJK/Pywikibot_Tutorial" as an example:

demo_page = pywikibot.Page(testwiki, 'User:AbdealiJK/Pywikibot_Tutorial')
demo_page
Page('User:AbdealiJK/Pywikibot Tutorial')

Now we use the class to fetch other information about the page. For example, to get the text of the page:

print(demo_page.text)
Hi !

This is a example page for the pywikibot tutorial users to see.

You can get a lot of other information about the page by using various helper functions provided by pywikibot:

print("Check if page exists:", demo_page.exists())
print("Title of the page:", demo_page.title())
print("Contributors of the page:", demo_page.contributors())
print("Last edit made on page:", demo_page.editTime())
print("Full URL to page:", demo_page.full_url())
Check if page exists: True
Title of the page: User:AbdealiJK/Pywikibot Tutorial
Contributors of the page: Counter({'AbdealiJK': 2})
Last edit made on page: 2016-09-13T11:43:35Z
Full URL to page: https://test.wikipedia.org/wiki/User%3AAbdealiJK/Pywikibot_Tutorial

4. Writing data to Pages

In general use the test wikipedia website for writing data, and ensure that you make changes in your User space (pages starting with User:<Your user name> as these are meant for your personal usage like testing these scripts :)

For example, let's create the object for your personal Sandbox page on test wiki:

sandbox = pywikibot.Page(testwiki, 'User:' + testwiki.user() + '/Sandbox')
sandbox
Page('User:Krisananthu/Sandbox')

Here, let's try writing some wiki markup to the page. For example, let's try making your profile !

Note: To get more information about the wikimarkup visit Help:Wiki markup

sandbox.text ="""
== About Me ==

Hello!

My name is '''{name}'''.

I am from {hometown} and am learning how to use pywikibot !

This page has been written using the pywikibot API.
""".format(name=,
           hometown=)
sandbox.save()
  File "<ipython-input-16-876c5ca501de>", line 11
    """.format(name=,
                     
^
SyntaxError: invalid syntax

Let's open up the webpage and see if our changes have been added there.

Using Jupyter and IPython, we can even embed the webpage into the notebook:

from IPython.display import IFrame
IFrame(sandbox.full_url(), width="100%", height="400px")

5. Textlib functions

Once you can get content and save new content, there are many times you'd like to get a list of categories or templates from a mediawiki instance.

A category is a special namespace (Similar to the user space) which holds categories that are used to classify pages. For example the "Python (programming language)" page on wikipedia has the categories "Category:Class-based programming languages", "Category:Cross-platform free software", "Category:Dynamically typed programming languages" and so on.

To add a category to a page, a link to the category must be added to the medaiwiki page. Hence, something like [[Category:<name of category>]] should be added according to the wiki markup.

A template is a snippet of text which can be included into multiple other pages (Something like a #include or import). The wiki markup to add a template is {{<template name>}} and it can also take in arguments, for example {{<template name>|arg1|arg2}}.

python = pywikibot.Page(enwiki, 'Python_(programming_language)')
python

Let's get a list of all categories added to the page:

list(python.categories())

The textlib functions help to modify the text content on the page for specific needs like adding or removing categories. Hence, it has it's parsers which read through the text and pull out all the category links it finds based on the wiki markup.

pywikibot.textlib.getCategoryLinks(python.text)
print("Text categories in page:", len(pywikibot.textlib.getCategoryLinks(python.text)))
print("All categories associated with page:", len(list(python.categories())))

Let's try removing a category using the textlib functions:

new_text = pywikibot.textlib.removeCategoryLinks(python.text)
print("List of categories after the remove function:", pywikibot.textlib.getCategoryLinks(new_text))

Other useful methods

Textlib contains many other websties that make editing the tet in mediawiki pages easier. For example:

  • TimeStripper() - Helps to pull out all time strings in the text and converts it into python time object
  • does_text_contain_section() - Checks whether the section with given name exists in the text
  • extract_templates_and_params() - Fetches a list of templates with the arguments used in the template markup
  • glue_template_and_params() - Takes a template and arguments and creates the appropriate wiki markup for it.
  • removeHTMLParts() - Cleans the data by removing all HTML code in the page
  • replaceCategoryInPlace() - If a category needs to be modified to another category, this replaces it inplace

6. Page Generators

There are many instances where it is useful to create a "page generator" which helps iterate over multiple pages that share a common property. For example, consider you want to find all pages of wikimedia projects:

import pywikibot.pagegenerators

wiki_projects = pywikibot.pagegenerators.CategorizedPageGenerator(
    pywikibot.Category(enwiki, 'Category:Wikimedia projects'),
    recurse=False)

from pprint import pprint
pprint(list(wiki_projects))

We use the python module pprint (pretty print) to format the output in a better way rather than dumping it as a list.

For more information on Page generators check pywikibot documentation on pagegenerators

7. Exercises

Exercise 1 - Write a script to remove trailing whitespace from a given page

In many mediawiki pages, we see that editors leave trailing whitespace at the bottom of the page. While this does not matter when the page is rendered for viewing, it adds unnecessary length to the article when downloading the text and raw wikicode.

Write a script to remove the trailing whitespace and keep only 1 newline at the end of the page. (Test this on a testwiki !)

Exercise 2 - Write a script to find the number of devices using Android Operating System

Find the number of pages that exist that are related to devices that use the Android Operating System's category.

8. Setting up pywkibot locally

PAWs provides a method to run pywikibot and related commands through Jupyter notebooks. It has already installed various requirements and so on that are needed for pywikibot scripts. Hence, it's an easy way to get users started. As it's only 1 server on the internet, if everyone began using PAWs, it gets crowded and slow. In such cases, it may be easier to run these scripts locally in your own desktop/laptop.

Installing basic requirements

First, install the basic requirements. This depends on your specific OS.

Installing pywikibot

Pywikibot is currently still a release candidate, hence rather than installing the rc5 from pip, we will get the latest source code at the master branch using git. To do this, run the following command on your terminal or command prompt:

/home/user/git_repos/$ git clone https://github.com/wikimedia/pywikibot-core.git

You will find the folder pywikibot-core has been created in the current working directory. If you wish to move the folder simple move it to another directory, or use the cd command to change directory before running the above git command.

Once the git repository has been downloaded, cd into the directory and run:

/home/user/git_repos/pywikibot-core/$ pip install .

Which installs the pywikibot repository to your python installation. The . (dot) is required as it tells pip to find the python package at the current directory. Pywikibot also has a lot of optional dependencies which are used to run specific scripts and unittests. To install all of these (to avoid errors later) run:

/home/user/git_repos/pywikibot-core/$ pip install -r dev-requirements.txt -r requirements.txt

Configuring pywikibot

Once the pywikibot library has been installed, simply use the pwb.py script provided in the git repo:

/home/user/git_repos/pywikibot-core/$ python pwb.py login

And follow the questions to create a user-config.py which holds your configuration information.

References

For more information about the configuration and other aspects of pywikibot, check the Pywikibot manual