mwapi Example

In this notebook, we'll show you the basics of using mwapi to get data out of MediaWiki APIs like those available for Wikipedia, Wiktionary, Commons, and Wikidata. The mwapi library is very basic. It provides a thin wrapper and some simple convenience functions around the basic MediaWiki API structure.

This notebook will procede in 3 parts that perform increasingly advanced actions.

  1. Running a basic query
  2. Using query continuation
  3. Connecting your bot via OAuth

Part 1: Running a basic query

We'll start by constructing a session object.

In [1]:
import mwapi

session = mwapi.Session("")
Sending requests with default User-Agent.  Set 'user_agent' on mwapi.Session to quiet this message.

Note that the library complains that a user_agent argument wasn't provided. This is OK and you'll be allowed to continue, but it's highly recommended that you use this to provide a description of what you are doing and who you are to enable the operations engineers to contact you about your API usage.

In [2]:
session = mwapi.Session("",
                        user_agent="Demo mwapi <>")

OK. No more warning. :) Now to actually perform a query. In the request below, we're going to get the content of the last 10 edits to my talk page.

In [3]:
doc = session.get(action='query', prop='revisions', titles='User talk:EpochFail', 
                  rvlimit=5, rvprop=['user', 'ids', 'timestamp'], rvdir="older")
{'continue': {'continue': '||', 'rvcontinue': '20160812193850|734203084'},
 'query': {'pages': {'15661779': {'ns': 3,
    'pageid': 15661779,
    'revisions': [{'parentid': 747555491,
      'revid': 747644815,
      'timestamp': '2016-11-03T15:01:17Z',
      'user': 'EpochFail'},
     {'parentid': 746624827,
      'revid': 747555491,
      'timestamp': '2016-11-03T01:14:56Z',
      'user': 'Esquivalience'},
     {'parentid': 737161460,
      'revid': 746624827,
      'timestamp': '2016-10-28T14:37:01Z',
      'user': 'EpochFail'},
     {'parentid': 737085080,
      'revid': 737161460,
      'timestamp': '2016-09-01T03:10:48Z',
      'user': 'Lowercase sigmabot III'},
     {'parentid': 734203084,
      'revid': 737085080,
      'timestamp': '2016-08-31T17:34:43Z',
      'user': 'Funcrunch'}],
    'title': 'User talk:EpochFail'}}}}

As you can see, the library will give you a back a JSON style python dict. Let's list out the fields we got back.

In [4]:
page_docs = doc['query']['pages'].values()
rev_docs = list(page_docs)[0]['revisions']

for rev_doc in rev_docs:
    print(rev_doc['revid'], rev_doc['timestamp'], rev_doc['user'])
747644815 2016-11-03T15:01:17Z EpochFail
747555491 2016-11-03T01:14:56Z Esquivalience
746624827 2016-10-28T14:37:01Z EpochFail
737161460 2016-09-01T03:10:48Z Lowercase sigmabot III
737085080 2016-08-31T17:34:43Z Funcrunch

Part 2: Using query continuation

The example we worked through in part 1 is great when we only want a few items out of the API, but what about when we want to read the entire history of a page? The API only returns so many revisions at a time and provides a continuation strategy to allow for sequential queries to retrieve large responses. mwapi provides some nice utilities for automating continuation. We'll explore this by providing the continuation=True parameter to get() and using the continuation to analyze the entire history of my user talk page.

In [5]:
docs = session.get(action='query', prop='revisions', titles='User talk:EpochFail', 
                   rvlimit=50, rvprop=['user', 'ids', 'timestamp'], rvdir="newer", continuation=True)

The docs variable now contains a generator of query results that make up this continuation. We can process them in a loop to generate some stats.

Note: The next code block is broken because PAWS uses mwapi 0.3.1 instead of 0.4.0+

In [6]:
revisions = 0
posting_users = set()

for doc in docs:
    page_docs = doc['query']['pages'].values()
    rev_docs = list(page_docs)[0]['revisions']

    for rev_doc in rev_docs:
        revisions += 1

print("Revisions:", revisions)
print("Users:", len(posting_users))
Revisions: 836
Users: 145
In [ ]: