Extracting pages containing keyword from a dump

This notebook extracts the pages of a given dump containing a keyword from a set of given ones.

! pip install -U spacy
! python -m spacy download es_core_news_sm
Requirement already up-to-date: spacy in /srv/paws/lib/python3.6/site-packages
Requirement already up-to-date: blis<0.3.0,>=0.2.2 in /srv/paws/lib/python3.6/site-packages (from spacy)
Requirement already up-to-date: thinc<7.1.0,>=7.0.2 in /srv/paws/lib/python3.6/site-packages (from spacy)
Requirement already up-to-date: plac<1.0.0,>=0.9.6 in /srv/paws/lib/python3.6/site-packages (from spacy)
Requirement already up-to-date: numpy>=1.15.0 in /srv/paws/lib/python3.6/site-packages (from spacy)
Requirement already up-to-date: requests<3.0.0,>=2.13.0 in /srv/paws/lib/python3.6/site-packages (from spacy)
Requirement already up-to-date: jsonschema<3.0.0,>=2.6.0 in /srv/paws/lib/python3.6/site-packages (from spacy)
Requirement already up-to-date: preshed<2.1.0,>=2.0.1 in /srv/paws/lib/python3.6/site-packages (from spacy)
Requirement already up-to-date: cymem<2.1.0,>=2.0.2 in /srv/paws/lib/python3.6/site-packages (from spacy)
Requirement already up-to-date: srsly<1.1.0,>=0.0.5 in /srv/paws/lib/python3.6/site-packages (from spacy)
Requirement already up-to-date: wasabi<1.1.0,>=0.2.0 in /srv/paws/lib/python3.6/site-packages (from spacy)
Requirement already up-to-date: murmurhash<1.1.0,>=0.28.0 in /srv/paws/lib/python3.6/site-packages (from spacy)
Requirement already up-to-date: tqdm<5.0.0,>=4.10.0 in /srv/paws/lib/python3.6/site-packages (from thinc<7.1.0,>=7.0.2->spacy)
Requirement already up-to-date: idna<2.9,>=2.5 in /srv/paws/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy)
Requirement already up-to-date: certifi>=2017.4.17 in /srv/paws/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy)
Requirement already up-to-date: urllib3<1.25,>=1.21.1 in /srv/paws/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy)
Requirement already up-to-date: chardet<3.1.0,>=3.0.2 in /srv/paws/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy)
Requirement already satisfied: es_core_news_sm==2.1.0 from https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-2.1.0/es_core_news_sm-2.1.0.tar.gz#egg=es_core_news_sm==2.1.0 in /srv/paws/lib/python3.6/site-packages
✔ Download and installation successful
You can now load the model via spacy.load('es_core_news_sm')
import spacy
import es_core_news_sm
nlp = es_core_news_sm.load()
import mwxml
import re

Define paths to visit

import glob

paths = glob.glob('/public/dumps/public/eswiki/20190301/eswiki-20190301-pages-meta-history1.xml-*.bz2')
paths
['/public/dumps/public/eswiki/20190301/eswiki-20190301-pages-meta-history1.xml-p39935p51316.bz2',
 '/public/dumps/public/eswiki/20190301/eswiki-20190301-pages-meta-history1.xml-p115848p135417.bz2',
 '/public/dumps/public/eswiki/20190301/eswiki-20190301-pages-meta-history1.xml-p135418p143635.bz2',
 '/public/dumps/public/eswiki/20190301/eswiki-20190301-pages-meta-history1.xml-p51317p65596.bz2',
 '/public/dumps/public/eswiki/20190301/eswiki-20190301-pages-meta-history1.xml-p97974p115847.bz2',
 '/public/dumps/public/eswiki/20190301/eswiki-20190301-pages-meta-history1.xml-p65597p81728.bz2',
 '/public/dumps/public/eswiki/20190301/eswiki-20190301-pages-meta-history1.xml-p81729p97973.bz2']
keywords_raw = "inundación,inundacion,inundaciones,inundado,inundada,anegado,anegada,sistema de drenaje,urbano de drenaje,muy lluvioso,muy lluviosa,fuertes lluvias,fuerte lluvia,desbordamiento,desborde alcantarillas,desborde colector,desborde rio,desborde río,desborde lluvia,desborde agua"
keywords = keywords_raw.split(',')
print(keywords)
['inundación', 'inundacion', 'inundaciones', 'inundado', 'inundada', 'anegado', 'anegada', 'sistema de drenaje', 'urbano de drenaje', 'muy lluvioso', 'muy lluviosa', 'fuertes lluvias', 'fuerte lluvia', 'desbordamiento', 'desborde alcantarillas', 'desborde colector', 'desborde rio', 'desborde río', 'desborde lluvia', 'desborde agua']

Find keywords function

Returns a boolean defining whether a keyword was found or not

def find_keywords(text, keywords):
    if any(k in text for k in keywords):
        return True
    return False

XML Processor on path

results = []

def process_dump(dump, path):
    for page in dump:
        last_count = 0
        if (page.namespace == 0):
            for revision in page:
                if (find_keywords((revision.text or ""), keywords)):
                    yield page, revision.timestamp
                    break

OK. Now that everything is defined, it's time to run the code. mwxml has a map() function that applied the process_dump function each of the XML dump file in paths -- in parallel -- using python's multiprocessing library and collects all of the yielded values in a generator. As the code below demonstrates, it's easy to collect this output and write it to a new output file or print it out to the console (not recommended for large amounts of output).

count = 0
pages = []

for page, rev_timestamp in mwxml.map(process_dump, paths):
    aux = ("\t".join(str(v) for v in [page, rev_timestamp]))
    aux_list = aux.split("\t")
    aux_list[0] = re.findall('title=(.+?),', aux_list[0])[0].replace("'", '')
    pages.append(aux_list)
    count += 1
KeyboardInterrupt detected.  Finishing...
Process Mapper 3:
Process Mapper 2:
Process Mapper 1:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/srv/paws/lib/python3.6/site-packages/para/map.py", line 142, in run
    for value in self.process(item):
  File "/srv/paws/lib/python3.6/site-packages/para/map.py", line 142, in run
    for value in self.process(item):
  File "/srv/paws/lib/python3.6/site-packages/para/map.py", line 142, in run
    for value in self.process(item):
  File "/srv/paws/lib/python3.6/site-packages/mwxml/map/map.py", line 47, in process_path
    yield from process(dump, path)
  File "<ipython-input-26-1c399ff323e7>", line 7, in process_dump
    for revision in page:
  File "/srv/paws/lib/python3.6/site-packages/mwxml/map/map.py", line 47, in process_path
    yield from process(dump, path)
  File "/srv/paws/lib/python3.6/site-packages/mwxml/map/map.py", line 47, in process_path
    yield from process(dump, path)
  File "<ipython-input-26-1c399ff323e7>", line 7, in process_dump
    for revision in page:
Process Mapper 0:
  File "/srv/paws/lib/python3.6/site-packages/mwxml/iteration/page.py", line 32, in __iter__
    for revision in self.__revisions:
  File "<ipython-input-26-1c399ff323e7>", line 7, in process_dump
    for revision in page:
  File "/srv/paws/lib/python3.6/site-packages/mwxml/iteration/page.py", line 32, in __iter__
    for revision in self.__revisions:
  File "/srv/paws/lib/python3.6/site-packages/mwxml/iteration/page.py", line 50, in load_revisions
    yield Revision.from_element(sub_element)
Traceback (most recent call last):
  File "/srv/paws/lib/python3.6/site-packages/mwxml/iteration/page.py", line 32, in __iter__
    for revision in self.__revisions:
  File "/srv/paws/lib/python3.6/site-packages/mwxml/iteration/page.py", line 50, in load_revisions
    yield Revision.from_element(sub_element)
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/srv/paws/lib/python3.6/site-packages/mwxml/iteration/page.py", line 50, in load_revisions
    yield Revision.from_element(sub_element)
  File "/srv/paws/lib/python3.6/site-packages/mwxml/iteration/revision.py", line 58, in from_element
    text = sub_element.text
  File "/srv/paws/lib/python3.6/site-packages/mwxml/iteration/revision.py", line 58, in from_element
    text = sub_element.text
  File "/srv/paws/lib/python3.6/site-packages/mwxml/element_iterator.py", line 89, in __getattr__
    self.complete()
  File "/srv/paws/lib/python3.6/site-packages/mwxml/element_iterator.py", line 89, in __getattr__
    self.complete()
  File "/srv/paws/lib/python3.6/site-packages/mwxml/element_iterator.py", line 72, in complete
    event, element = next(self.pointer)
  File "/srv/paws/lib/python3.6/site-packages/mwxml/iteration/revision.py", line 58, in from_element
    text = sub_element.text
  File "/srv/paws/lib/python3.6/site-packages/mwxml/element_iterator.py", line 72, in complete
    event, element = next(self.pointer)
  File "/srv/paws/lib/python3.6/site-packages/para/map.py", line 142, in run
    for value in self.process(item):
  File "/srv/paws/lib/python3.6/site-packages/mwxml/element_iterator.py", line 22, in __next__
    event, element = next(self.etree_events)
  File "/srv/paws/lib/python3.6/site-packages/mwxml/element_iterator.py", line 22, in __next__
    event, element = next(self.etree_events)
  File "/srv/paws/lib/python3.6/site-packages/mwxml/map/map.py", line 47, in process_path
    yield from process(dump, path)
  File "/srv/paws/lib/python3.6/site-packages/mwxml/element_iterator.py", line 89, in __getattr__
    self.complete()
  File "/srv/paws/lib/python3.6/site-packages/mwxml/element_iterator.py", line 72, in complete
    event, element = next(self.pointer)
  File "<ipython-input-26-1c399ff323e7>", line 8, in process_dump
    if (find_keywords((revision.text or ""), keywords)):
  File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 1223, in iterator
    data = source.read(16 * 1024)
  File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 1223, in iterator
    data = source.read(16 * 1024)
  File "<ipython-input-4-570926cb0307>", line 2, in find_keywords
    if any(k in text for k in keywords):
  File "/usr/lib/python3.6/bz2.py", line 195, in read1
    return self._buffer.read1(size)
  File "/srv/paws/lib/python3.6/site-packages/mwxml/element_iterator.py", line 22, in __next__
    event, element = next(self.etree_events)
  File "<ipython-input-4-570926cb0307>", line 2, in <genexpr>
    if any(k in text for k in keywords):
  File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 1223, in iterator
    data = source.read(16 * 1024)
  File "/usr/lib/python3.6/bz2.py", line 195, in read1
    return self._buffer.read1(size)
  File "/usr/lib/python3.6/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
KeyboardInterrupt
  File "/usr/lib/python3.6/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/usr/lib/python3.6/_compression.py", line 103, in read
    data = self._decompressor.decompress(rawblock, size)
  File "/usr/lib/python3.6/_compression.py", line 103, in read
    data = self._decompressor.decompress(rawblock, size)
KeyboardInterrupt
KeyboardInterrupt
  File "/usr/lib/python3.6/bz2.py", line 195, in read1
    return self._buffer.read1(size)
  File "/usr/lib/python3.6/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/usr/lib/python3.6/_compression.py", line 103, in read
    data = self._decompressor.decompress(rawblock, size)
KeyboardInterrupt
import csv

with open('./pages_es.csv', 'w', newline='') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
    wr.writerow(pages)
string = ""

with open('pages_es.csv', mode='r') as csv_file:
    for element in csv_file:
        string = element
titles_list = string.split(',')
print(len(titles_list))
11377
print(pages)
[['Río Miño', '2010-06-27T12:18:23Z'], ['Archipiélago de Revillagigedo', '2008-12-05T07:45:56Z']]