Extracting pages containing keyword from a dump

This notebook extracts the pages of a given dump containing a keyword from a set of given ones.

import mwxml
import re

Define paths to visit

import glob

paths = glob.glob('/public/dumps/public/eswiki/20190301/eswiki-20190301-pages-meta-current.xml.bz2')
keywords_raw = "inundación,inundacion,inundaciones,inundado,inundada,anegado,anegada,sistema de drenaje,urbano de drenaje,muy lluvioso,muy lluviosa,fuertes lluvias,fuerte lluvia,desbordamiento,desborde alcantarillas,desborde colector,desborde rio,desborde río,desborde lluvia,desborde agua"
keywords = keywords_raw.split(',')
['inundación', 'inundacion', 'inundaciones', 'inundado', 'inundada', 'anegado', 'anegada', 'sistema de drenaje', 'urbano de drenaje', 'muy lluvioso', 'muy lluviosa', 'fuertes lluvias', 'fuerte lluvia', 'desbordamiento', 'desborde alcantarillas', 'desborde colector', 'desborde rio', 'desborde río', 'desborde lluvia', 'desborde agua']

Find keywords function

Returns a boolean defining whether a keyword was found or not

def find_keywords(text, keywords):
    if any(k in text for k in keywords):
        return True
    return False

XML Processor on path

results = []

def process_dump(dump, path):
    for page in dump:
        last_count = 0
        if (page.namespace == 0):
            for revision in page:
                if (find_keywords((revision.text or ""), keywords)):
                    yield page

OK. Now that everything is defined, it's time to run the code. mwxml has a map() function that applied the process_dump function each of the XML dump file in paths -- in parallel -- using python's multiprocessing library and collects all of the yielded values in a generator. As the code below demonstrates, it's easy to collect this output and write it to a new output file or print it out to the console (not recommended for large amounts of output).

count = 0
pages = []

for page in mwxml.map(process_dump, paths, keywords):
    aux = ("\t".join(str(v) for v in [page]))
    pages.append((re.findall('title=(.+?),', aux))[0].replace("'", ''))
    count += 1
import csv

with open('./pages_es.csv', 'w', newline='') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
string = ""

with open('pages_es.csv', mode='r') as csv_file:
    for element in csv_file:
        string = element
titles_list = string.split(',')