Extracting pages containing keyword from a dump

This notebook extracts the pages of a given dump containing a keyword from a set of given ones.

! pip install -U spacy
! python -m spacy download es_core_news_sm
! pip install dateparser
! pip install datefinder
Collecting spacy
  Using cached https://files.pythonhosted.org/packages/52/da/3a1c54694c2d2f40df82f38a19ae14c6eb24a5a1a0dae87205ebea7a84d8/spacy-2.1.3-cp36-cp36m-manylinux1_x86_64.whl
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Using cached https://files.pythonhosted.org/packages/a6/e6/63f160a4fdf0e875d16b28f972083606d8d54f56cd30cb8929f9a1ee700e/murmurhash-1.0.2-cp36-cp36m-manylinux1_x86_64.whl
Collecting wasabi<1.1.0,>=0.2.0 (from spacy)
  Using cached https://files.pythonhosted.org/packages/76/6c/0376977df1ba9f0ec27835d80456d9284c79737cb5205649451db1181f01/wasabi-0.2.1-py3-none-any.whl
Collecting requests<3.0.0,>=2.13.0 (from spacy)
  Using cached https://files.pythonhosted.org/packages/7d/e3/20f3d364d6c8e5d2353c72a67778eb189176f08e873c9900e10c0287b84b/requests-2.21.0-py2.py3-none-any.whl
Collecting preshed<2.1.0,>=2.0.1 (from spacy)
  Using cached https://files.pythonhosted.org/packages/20/93/f222fb957764a283203525ef20e62008675fd0a14ffff8cc1b1490147c63/preshed-2.0.1-cp36-cp36m-manylinux1_x86_64.whl
Collecting srsly<1.1.0,>=0.0.5 (from spacy)
  Using cached https://files.pythonhosted.org/packages/6b/97/47753e3393aa4b18de9f942fac26f18879d1ae950243a556888f389d1398/srsly-0.0.5-cp36-cp36m-manylinux1_x86_64.whl
Collecting blis<0.3.0,>=0.2.2 (from spacy)
  Using cached https://files.pythonhosted.org/packages/34/46/b1d0bb71d308e820ed30316c5f0a017cb5ef5f4324bcbc7da3cf9d3b075c/blis-0.2.4-cp36-cp36m-manylinux1_x86_64.whl
Collecting numpy>=1.15.0 (from spacy)
  Using cached https://files.pythonhosted.org/packages/35/d5/4f8410ac303e690144f0a0603c4b8fd3b986feb2749c435f7cdbb288f17e/numpy-1.16.2-cp36-cp36m-manylinux1_x86_64.whl
Collecting plac<1.0.0,>=0.9.6 (from spacy)
  Using cached https://files.pythonhosted.org/packages/9e/9b/62c60d2f5bc135d2aa1d8c8a86aaf84edb719a59c7f11a4316259e61a298/plac-0.9.6-py2.py3-none-any.whl
Requirement already up-to-date: jsonschema<3.0.0,>=2.6.0 in /srv/paws/lib/python3.6/site-packages (from spacy)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Using cached https://files.pythonhosted.org/packages/3d/61/9b0520c28eb199a4b1ca667d96dd625bba003c14c75230195f9691975f85/cymem-2.0.2-cp36-cp36m-manylinux1_x86_64.whl
Collecting thinc<7.1.0,>=7.0.2 (from spacy)
  Using cached https://files.pythonhosted.org/packages/a9/f1/3df317939a07b2fc81be1a92ac10bf836a1d87b4016346b25f8b63dee321/thinc-7.0.4-cp36-cp36m-manylinux1_x86_64.whl
Collecting certifi>=2017.4.17 (from requests<3.0.0,>=2.13.0->spacy)
  Using cached https://files.pythonhosted.org/packages/60/75/f692a584e85b7eaba0e03827b3d51f45f571c2e793dd731e598828d380aa/certifi-2019.3.9-py2.py3-none-any.whl
Collecting urllib3<1.25,>=1.21.1 (from requests<3.0.0,>=2.13.0->spacy)
  Downloading https://files.pythonhosted.org/packages/df/1c/59cca3abf96f991f2ec3131a4ffe72ae3d9ea1f5894abe8a9c5e3c77cfee/urllib3-1.24.2-py2.py3-none-any.whl (131kB)
    100% |████████████████████████████████| 133kB 2.5MB/s eta 0:00:01
Requirement already up-to-date: chardet<3.1.0,>=3.0.2 in /srv/paws/lib/python3.6/site-packages (from requests<3.0.0,>=2.13.0->spacy)
Collecting idna<2.9,>=2.5 (from requests<3.0.0,>=2.13.0->spacy)
  Using cached https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl
Collecting tqdm<5.0.0,>=4.10.0 (from thinc<7.1.0,>=7.0.2->spacy)
  Using cached https://files.pythonhosted.org/packages/6c/4b/c38b5144cf167c4f52288517436ccafefe9dc01b8d1c190e18a6b154cd4a/tqdm-4.31.1-py2.py3-none-any.whl
Installing collected packages: murmurhash, wasabi, certifi, urllib3, idna, requests, cymem, preshed, srsly, numpy, blis, plac, tqdm, thinc, spacy
  Found existing installation: certifi 2018.4.16
    Uninstalling certifi-2018.4.16:
      Successfully uninstalled certifi-2018.4.16
  Found existing installation: urllib3 1.23
    Uninstalling urllib3-1.23:
      Successfully uninstalled urllib3-1.23
  Found existing installation: idna 2.7
    Uninstalling idna-2.7:
      Successfully uninstalled idna-2.7
  Found existing installation: requests 2.19.1
    Uninstalling requests-2.19.1:
      Successfully uninstalled requests-2.19.1
  Found existing installation: numpy 1.14.5
    Uninstalling numpy-1.14.5:
      Successfully uninstalled numpy-1.14.5
Successfully installed blis-0.2.4 certifi-2019.3.9 cymem-2.0.2 idna-2.8 murmurhash-1.0.2 numpy-1.16.2 plac-0.9.6 preshed-2.0.1 requests-2.21.0 spacy-2.1.3 srsly-0.0.5 thinc-7.0.4 tqdm-4.31.1 urllib3-1.24.2 wasabi-0.2.1
Collecting es_core_news_sm==2.1.0 from https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-2.1.0/es_core_news_sm-2.1.0.tar.gz#egg=es_core_news_sm==2.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-2.1.0/es_core_news_sm-2.1.0.tar.gz (11.1MB)
    100% |████████████████████████████████| 11.1MB 52.9MB/s ta 0:00:01
Installing collected packages: es-core-news-sm
  Running setup.py install for es-core-news-sm ... done
Successfully installed es-core-news-sm-2.1.0
✔ Download and installation successful
You can now load the model via spacy.load('es_core_news_sm')
Requirement already satisfied: dateparser in /srv/paws/lib/python3.6/site-packages
Requirement already satisfied: tzlocal in /srv/paws/lib/python3.6/site-packages (from dateparser)
Requirement already satisfied: pytz in /srv/paws/lib/python3.6/site-packages (from dateparser)
Requirement already satisfied: regex in /srv/paws/lib/python3.6/site-packages (from dateparser)
Requirement already satisfied: python-dateutil in /srv/paws/lib/python3.6/site-packages (from dateparser)
Requirement already satisfied: six>=1.5 in /srv/paws/lib/python3.6/site-packages (from python-dateutil->dateparser)
Requirement already satisfied: datefinder in /srv/paws/lib/python3.6/site-packages
Requirement already satisfied: regex>=2017.02.08 in /srv/paws/lib/python3.6/site-packages (from datefinder)
Requirement already satisfied: python-dateutil>=2.4.2 in /srv/paws/lib/python3.6/site-packages (from datefinder)
Requirement already satisfied: pytz in /srv/paws/lib/python3.6/site-packages (from datefinder)
Requirement already satisfied: six>=1.5 in /srv/paws/lib/python3.6/site-packages (from python-dateutil>=2.4.2->datefinder)
import spacy
import es_core_news_sm
nlp = es_core_news_sm.load()
import mwxml
import re
import datefinder
meses = ['enero', 'febrero', 'marzo', 'abril', 'mayo', 'junio','julio', 'agosto','septiembre', 'octubre', 'noviembre', 'diciembre']
keywords_raw = "inundación,inundacion,inundaciones,inundado,inundada,anegado,anegada,sistema de drenaje,urbano de drenaje,muy lluvioso,muy lluviosa,fuertes lluvias,fuerte lluvia,desbordamiento,desborde alcantarillas,desborde colector,desborde rio,desborde río,desborde lluvia,desborde agua"
keywords = keywords_raw.split(',')
numeros = [str(i) for i in range(1, 32)]
doc = nlp(u'La ciudad sufrió grandes inundaciones en 1788, provocadas al desbordarse el río Esgueva.')

for sent in doc.sents:
    for token in sent:
        print(str(token.i - sent.start + 1)+"\t"+token.text+"\t"+"\t"+token.dep_+"\t"+str(token.head.i - sent.start + 1))
1	La		det	2
2	ciudad		nsubj	3
3	sufrió		ROOT	3
4	grandes		amod	5
5	inundaciones		obj	3
6	en		case	7
7	1788		obl	3
8	,		punct	9
9	provocadas		obj	3
10	al		mark	11
11	desbordarse		acl	9
12	el		det	13
13	río		nsubj	11
14	Esgueva		appos	13
15	.		punct	3
doc = nlp(u'Tras la inundación de mayo de 1933 ocurrida en Inglaterra')
for sent in doc.sents:
    for token in sent:
        print(str(token.i - sent.start + 1)+"\t"+token.text+"\t"+"\t"+token.dep_+"\t"+str(token.head.i - sent.start + 1))
        if (token.text.lower() in keywords):
            print(token.text)
            check = False   
            for child1 in token.children:
                print(child1.text)
                if (child1.text in meses):
                    print(child1.text)
                    
                    
                  
1	Tras		case	3
2	la		det	3
3	inundación		ROOT	3
inundación
Tras
la
mayo
mayo
ocurrida
4	de		case	5
5	mayo		nmod	3
6	de		case	7
7	1933		compound	5
8	ocurrida		amod	3
9	en		case	10
10	Inglaterra		obl	8
string = u"entries are due by January 4th, 2017 at 8:00pm"
matches = datefinder.find_dates(string)
for match in matches:
    print (match)
2017-01-04 20:00:00