Setup

  1. Install Gensim a Python NLP package.
  2. Process the simple English Wikipedia.
  3. Fix jupter logging
  4. Fix paws issue with graphs

The dump I used (simplewiki-20160720-pages-articles.xml.bz2) has 78550 articles and may be small enough to allow working with a notebook in Jupyter without having to wait for hours for models to be buit.

Note: NLP and ML models often take a long time to train or update even on a strong cluster. However, once trained they can be stored compressed and reused much more efficiently. The Gensym Wikipedia tutorial mentions the LDA model took over 6 hours to train for the English Wikipedia, which would time out. Also Gensym is not so great with cleaning wikitext.

Sources

!pip install gensim
!pip install pattern
Requirement already satisfied (use --upgrade to upgrade): gensim in /srv/paws/lib/python3.4/site-packages
Requirement already satisfied (use --upgrade to upgrade): numpy>=1.3 in /srv/paws/lib/python3.4/site-packages (from gensim)
Requirement already satisfied (use --upgrade to upgrade): scipy>=0.7.0 in /srv/paws/lib/python3.4/site-packages (from gensim)
Requirement already satisfied (use --upgrade to upgrade): six>=1.5.0 in /srv/paws/lib/python3.4/site-packages (from gensim)
Requirement already satisfied (use --upgrade to upgrade): smart_open>=1.2.1 in /srv/paws/lib/python3.4/site-packages (from gensim)
Requirement already satisfied (use --upgrade to upgrade): boto>=2.32 in /srv/paws/lib/python3.4/site-packages (from smart_open>=1.2.1->gensim)
Requirement already satisfied (use --upgrade to upgrade): bz2file in /srv/paws/lib/python3.4/site-packages (from smart_open>=1.2.1->gensim)
Requirement already satisfied (use --upgrade to upgrade): requests in /srv/paws/lib/python3.4/site-packages (from smart_open>=1.2.1->gensim)
Collecting pattern
  Using cached pattern-2.6.zip
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-mofv1blg/pattern/setup.py", line 40
        print n
              ^
    SyntaxError: Missing parentheses in call to 'print'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-mofv1blg/pattern/
%matplotlib inline
import logging
import gensim
import bz2
logging.root.handlers = []  # Jupyter messes up logging so needs a reset
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from smart_open import smart_open
import pandas as pd
import numpy as np
import gensim
import nltk
from sklearn.cross_validation import train_test_split
from sklearn import linear_model
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
from gensim.models import Word2Vec
from sklearn.neighbors import KNeighborsClassifier
from sklearn import linear_model

%matplotlib inline

The corpus

# load id->word mapping (the dictionary), one of the results of step 2 above
id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt.bz2')
print("There are " + str(len(id2word)) + " words in the dictionaty\n")
print(id2word[1001])

# load corpus iterator
mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
# mm = gensim.corpora.MmCorpus(bz2.BZ2File('wiki_en_tfidf.mm.bz2')) # use this if you compressed the TFIDF output (recommended)
2016-08-06 07:27:50,586 : INFO : loaded corpus index from wiki_en_tfidf.mm.index
2016-08-06 07:27:50,587 : INFO : initializing corpus reader from wiki_en_tfidf.mm
There are 30736 words in the dictionaty

revolution
2016-08-06 07:27:50,876 : INFO : accepted corpus with 78550 documents, 30736 features, 6839040 non-zero entries
import pandas as pd

print(mm)
print("\nThe documents match the dump article count, features match the dictionary size and there are 6839040 unique words in the simple english articles.\n")
print("There are " + str(len(mm)) + " entries in the model")

article_id=10001

print("There are " + str(len(mm[article_id])) + " words in the article\n")
MmCorpus(78550 documents, 30736 features, 6839040 non-zero entries)

The documents match the dump article count, features match the dictionary size and there are 6839040 unique words in the simple english articles.

There are 78550 entries in the model
There are 38 words in the article

tf–idf

let us examine the format of the data for an article in the tdidf index

print(mm[article_id])
[(150, 0.06936617573014087), (1710, 0.22416658592143549), (2787, 0.08443660156370959), (2862, 0.44486566876708034), (4631, 0.0757920511716846), (5322, 0.0608258231463494), (6041, 0.1620482351438078), (6354, 0.29351989404939766), (6584, 0.05787503281891094), (7658, 0.33407482710795816), (7916, 0.13204876200715654), (8885, 0.11196817651650032), (10640, 0.1446370036092855), (10692, 0.0829325311204531), (11253, 0.12974888661758452), (12287, 0.04897041172960398), (12455, 0.08788320083247018), (14565, 0.05399304433648005), (16282, 0.10674586894658063), (16819, 0.04779394298131224), (17736, 0.07910312419031695), (18187, 0.1992682965874072), (18494, 0.05559723515496268), (18953, 0.08169259677425418), (19603, 0.053857682683241495), (19829, 0.09404290244248324), (21304, 0.2861281909473468), (21664, 0.0716706756482601), (21903, 0.08832091576422335), (24495, 0.11572271515857237), (24966, 0.13396880991882096), (25157, 0.05800778420828681), (25174, 0.4168763688404558), (25928, 0.11361635174690378), (26106, 0.04803630755050057), (26595, 0.05439687185950228), (26937, 0.08094783798754876), (29943, 0.07917257916866322)]
\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)

Since the above is not a human readable format and accordingly difficult to understand. Let's put this entry into a dataframe where rows are named using the word id's from the dictionary show the word.

table = [["id","tf-idf"]]
words = []
for wf in mm[article_id] : 
    print('{:>12}'.format(id2word[wf[0]]) + ":\t" + str(wf[1]) )
    table.append([wf[0],wf[1]])
    words.append([id2word[wf[0]]])
    

df = pd.DataFrame(table, columns=table.pop(0),index=words)

print("\nand if we sort - all terms are:\n")

df.sort_values(by='tf-idf').head(n=len(mm[article_id]))
       civil:	0.06936617573014087
       cargo:	0.22416658592143549
 temperature:	0.08443660156370959
   frankfurt:	0.44486566876708034
     biggest:	0.0757920511716846
         air:	0.0608258231463494
    distance:	0.1620482351438078
       rhein:	0.29351989404939766
     germany:	0.05787503281891094
    terminal:	0.33407482710795816
     hottest:	0.13204876200715654
     million:	0.11196817651650032
     station:	0.1446370036092855
     railway:	0.0829325311204531
      tonnes:	0.12974888661758452
    december:	0.04897041172960398
        walk:	0.08788320083247018
        five:	0.05399304433648005
        flew:	0.10674586894658063
        july:	0.04779394298131224
        base:	0.07910312419031695
  passengers:	0.1992682965874072
      german:	0.05559723515496268
     minutes:	0.08169259677425418
       third:	0.053857682683241495
        main:	0.09404290244248324
      trains:	0.2861281909473468
    recorded:	0.0716706756482601
    stations:	0.08832091576422335
       align:	0.11572271515857237
    commuter:	0.13396880991882096
      europe:	0.05800778420828681
     airport:	0.4168763688404558
     largest:	0.11361635174690378
       under:	0.04803630755050057
       right:	0.05439687185950228
       train:	0.08094783798754876
       terms:	0.07917257916866322
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/srv/paws/lib/python3.4/site-packages/pandas/core/internals.py in create_block_manager_from_arrays(arrays, names, axes)
   3999         blocks = form_blocks(arrays, names, axes)
-> 4000         mgr = BlockManager(blocks, axes)
   4001         mgr._consolidate_inplace()

/srv/paws/lib/python3.4/site-packages/pandas/core/internals.py in __init__(self, blocks, axes, do_integrity_check, fastpath)
   2593         if do_integrity_check:
-> 2594             self._verify_integrity()
   2595 

/srv/paws/lib/python3.4/site-packages/pandas/core/internals.py in _verify_integrity(self)
   2803             if block._verify_integrity and block.shape[1:] != mgr_shape[1:]:
-> 2804                 construction_error(tot_items, block.shape[1:], self.axes)
   2805         if len(self.items) != tot_items:

/srv/paws/lib/python3.4/site-packages/pandas/core/internals.py in construction_error(tot_items, block_shape, axes, e)
   3969     raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 3970         passed, implied))
   3971 

ValueError: Shape of passed values is (2, 38), indices imply (2, 1)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-7-5e06dd373e9f> in <module>()
      7 
      8 
----> 9 df = pd.DataFrame(table, columns=table.pop(0),index=words)
     10 
     11 print("\nand if we sort - all terms are:\n")

/srv/paws/lib/python3.4/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    274 
    275                     mgr = _arrays_to_mgr(arrays, columns, index, columns,
--> 276                                          dtype=dtype)
    277                 else:
    278                     mgr = self._init_ndarray(data, index, columns, dtype=dtype,

/srv/paws/lib/python3.4/site-packages/pandas/core/frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
   5239     axes = [_ensure_index(columns), _ensure_index(index)]
   5240 
-> 5241     return create_block_manager_from_arrays(arrays, arr_names, axes)
   5242 
   5243 

/srv/paws/lib/python3.4/site-packages/pandas/core/internals.py in create_block_manager_from_arrays(arrays, names, axes)
   4002         return mgr
   4003     except ValueError as e:
-> 4004         construction_error(len(arrays), arrays[0].shape, axes, e)
   4005 
   4006 

/srv/paws/lib/python3.4/site-packages/pandas/core/internals.py in construction_error(tot_items, block_shape, axes, e)
   3968         raise ValueError("Empty data passed with indices specified.")
   3969     raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 3970         passed, implied))
   3971 
   3972 

ValueError: Shape of passed values is (2, 38), indices imply (2, 1)

looking just at the top terms

cutoff_freq=0.2
#all_terms.query('freq>'+str(cutoff_freq)).sort_values(by='freq').head()

def mask(df,key,value):
    return df[df[key] == cutoff_freq]

top_terms=df[df['tf-idf']>cutoff_freq].sort_values(by='tf-idf')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-15-ab872a51ee63> in <module>()
      5     return df[df[key] == cutoff_freq]
      6 
----> 7 top_terms=df[df['tf-idf']>cutoff_freq].sort_values(by='tf-idf')

NameError: name 'df' is not defined

this means that in article 1001 the words:

lets plot this as a graph:

import matplotlib.pyplot as plt

my_plot = top_terms.plot(kind='bar',figsize=(9, 7))
my_plot.set_title("top tf-idf weights")
my_plot.set_xlabel("Word")
my_plot.set_ylabel("tf-idf weights")
<matplotlib.text.Text at 0x7f8d5a6d0be0>
counter=0
for doc in mm:
    counter=counter+1
    if counter> 20:
        break
    print(len(doc))
   #print(doc)
    
679
330
277
92
73
93
188
181
62
50
286
64
107
958
165
20
86
230
100
66
# LSI (Latent Semantic Indexing)
import os 

lsi_model_name = 'wiki_en.lsi'

if os.path.isfile( './' + lsi_model_name ):
    lsi = lsi.load(lsi_model_name)
else:
    # extract 400 LSI topics; use the default one-pass algorithm
    lsi = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400)
    lsi.save(lsi_model_name)
2016-07-30 23:05:57,310 : INFO : using serial LSI version on this node
2016-07-30 23:05:57,312 : INFO : updating model with new documents
2016-07-30 23:06:10,307 : INFO : preparing a new chunk of documents
2016-07-30 23:06:11,043 : INFO : using 100 extra samples and 2 power iterations
2016-07-30 23:06:11,045 : INFO : 1st phase: constructing (30736, 500) action matrix
2016-07-30 23:06:14,114 : INFO : orthonormalizing (30736, 500) action matrix
2016-07-30 23:06:40,262 : INFO : 2nd phase: running dense svd on (500, 20000) matrix
2016-07-30 23:06:51,156 : INFO : computing the final decomposition
2016-07-30 23:06:51,164 : INFO : keeping 400 factors (discarding 8.649% of energy spectrum)
2016-07-30 23:06:52,268 : INFO : processed documents up to #20000
2016-07-30 23:06:52,288 : INFO : topic #0(16.087): 0.186*"footballer" + 0.181*"actor" + 0.171*"english" + 0.171*"german" + 0.171*"politician" + 0.163*"actress" + 0.158*"singer" + 0.155*"french" + 0.143*"writer" + 0.126*"british"
2016-07-30 23:06:52,293 : INFO : topic #1(11.110): -0.181*"footballer" + -0.160*"politician" + -0.153*"actor" + -0.141*"actress" + 0.136*"music" + -0.119*"singer" + -0.115*"writer" + -0.094*"german" + -0.090*"italian" + -0.088*"french"
2016-07-30 23:06:52,296 : INFO : topic #2(8.283): 0.282*"music" + -0.211*"district" + 0.204*"band" + 0.173*"album" + -0.151*"coat" + -0.147*"arms" + -0.139*"king" + -0.124*"municipalities" + -0.121*"capital" + -0.117*"population"
2016-07-30 23:06:52,299 : INFO : topic #3(7.742): 0.335*"king" + 0.174*"emperor" + 0.157*"england" + -0.156*"district" + 0.140*"music" + -0.135*"footballer" + 0.134*"ii" + 0.128*"henry" + 0.123*"pope" + -0.119*"coat"
2016-07-30 23:06:52,301 : INFO : topic #4(7.567): -0.296*"district" + -0.271*"music" + -0.235*"coat" + -0.225*"arms" + -0.211*"band" + -0.178*"municipalities" + 0.177*"jupiter" + -0.170*"album" + -0.118*"districts" + -0.109*"municipality"
2016-07-30 23:06:59,819 : INFO : preparing a new chunk of documents
2016-07-30 23:07:00,420 : INFO : using 100 extra samples and 2 power iterations
2016-07-30 23:07:00,421 : INFO : 1st phase: constructing (30736, 500) action matrix
2016-07-30 23:07:03,017 : INFO : orthonormalizing (30736, 500) action matrix
2016-07-30 23:07:26,952 : INFO : 2nd phase: running dense svd on (500, 20000) matrix
2016-07-30 23:07:37,175 : INFO : computing the final decomposition
2016-07-30 23:07:37,177 : INFO : keeping 400 factors (discarding 8.104% of energy spectrum)
2016-07-30 23:07:38,007 : INFO : merging projections: (30736, 400) + (30736, 400)
2016-07-30 23:07:43,227 : INFO : keeping 400 factors (discarding 14.278% of energy spectrum)
2016-07-30 23:07:44,735 : INFO : processed documents up to #40000
2016-07-30 23:07:44,738 : INFO : topic #0(19.159): 0.125*"english" + 0.118*"german" + 0.116*"league" + 0.110*"actor" + 0.105*"french" + 0.105*"player" + 0.102*"footballer" + 0.102*"singer" + 0.101*"war" + 0.098*"politician"
2016-07-30 23:07:44,741 : INFO : topic #1(14.116): -0.660*"league" + -0.274*"football" + -0.223*"premier" + -0.222*"statistics" + -0.214*"division" + -0.184*"club" + -0.175*"team" + -0.169*"career" + -0.109*"player" + -0.095*"liga"
2016-07-30 23:07:44,745 : INFO : topic #2(13.204): -0.209*"footballer" + -0.196*"politician" + -0.191*"actor" + -0.173*"actress" + -0.151*"german" + -0.147*"writer" + -0.145*"singer" + -0.138*"french" + -0.129*"english" + -0.121*"italian"
2016-07-30 23:07:44,748 : INFO : topic #3(11.642): 0.323*"album" + 0.293*"band" + 0.218*"music" + -0.177*"district" + -0.162*"river" + 0.161*"song" + 0.159*"released" + -0.142*"county" + 0.131*"rock" + 0.126*"guitar"
2016-07-30 23:07:44,751 : INFO : topic #4(10.317): -0.271*"district" + -0.242*"album" + -0.231*"county" + -0.224*"band" + -0.198*"river" + -0.171*"hurricane" + -0.142*"town" + -0.125*"province" + -0.119*"tropical" + -0.117*"storm"
2016-07-30 23:07:53,414 : INFO : preparing a new chunk of documents
2016-07-30 23:07:53,832 : INFO : using 100 extra samples and 2 power iterations
2016-07-30 23:07:53,833 : INFO : 1st phase: constructing (30736, 500) action matrix
2016-07-30 23:07:56,261 : INFO : orthonormalizing (30736, 500) action matrix
2016-07-30 23:08:17,619 : INFO : 2nd phase: running dense svd on (500, 20000) matrix
2016-07-30 23:08:27,141 : INFO : computing the final decomposition
2016-07-30 23:08:27,144 : INFO : keeping 400 factors (discarding 8.114% of energy spectrum)
2016-07-30 23:08:28,036 : INFO : merging projections: (30736, 400) + (30736, 400)
2016-07-30 23:08:33,935 : INFO : keeping 400 factors (discarding 12.351% of energy spectrum)
2016-07-30 23:08:35,616 : INFO : processed documents up to #60000
2016-07-30 23:08:35,621 : INFO : topic #0(22.428): 0.100*"league" + 0.097*"english" + 0.095*"actor" + 0.094*"music" + 0.093*"movie" + 0.087*"war" + 0.087*"german" + 0.087*"album" + 0.085*"player" + 0.084*"british"
2016-07-30 23:08:35,628 : INFO : topic #1(15.565): -0.577*"league" + -0.292*"football" + -0.231*"team" + -0.172*"club" + -0.172*"statistics" + -0.168*"premier" + -0.166*"division" + -0.148*"career" + -0.135*"played" + -0.131*"hockey"
2016-07-30 23:08:35,633 : INFO : topic #2(14.463): 0.359*"album" + 0.263*"band" + 0.211*"song" + 0.192*"released" + 0.185*"music" + -0.179*"river" + -0.140*"county" + 0.137*"albums" + 0.133*"chart" + 0.117*"guitar"
2016-07-30 23:08:35,639 : INFO : topic #3(14.305): -0.223*"river" + 0.197*"actor" + 0.174*"actress" + -0.166*"county" + 0.162*"footballer" + 0.160*"politician" + 0.145*"singer" + 0.132*"writer" + 0.119*"german" + 0.115*"french"
2016-07-30 23:08:35,642 : INFO : topic #4(12.701): -0.445*"river" + -0.368*"county" + -0.239*"album" + -0.174*"band" + -0.147*"district" + -0.127*"song" + -0.113*"province" + -0.108*"town" + -0.099*"chart" + -0.097*"released"
2016-07-30 23:08:41,669 : INFO : preparing a new chunk of documents
2016-07-30 23:08:42,171 : INFO : using 100 extra samples and 2 power iterations
2016-07-30 23:08:42,173 : INFO : 1st phase: constructing (30736, 500) action matrix
2016-07-30 23:08:44,141 : INFO : orthonormalizing (30736, 500) action matrix
2016-07-30 23:09:05,033 : INFO : 2nd phase: running dense svd on (500, 18550) matrix
2016-07-30 23:09:14,333 : INFO : computing the final decomposition
2016-07-30 23:09:14,338 : INFO : keeping 400 factors (discarding 8.538% of energy spectrum)
2016-07-30 23:09:15,179 : INFO : merging projections: (30736, 400) + (30736, 400)
2016-07-30 23:09:20,345 : INFO : keeping 400 factors (discarding 10.738% of energy spectrum)
2016-07-30 23:09:21,856 : INFO : processed documents up to #78550
2016-07-30 23:09:21,861 : INFO : topic #0(25.214): 0.117*"movie" + 0.105*"album" + 0.097*"actor" + 0.091*"music" + 0.091*"league" + 0.085*"english" + 0.084*"actress" + 0.083*"released" + 0.081*"singer" + 0.081*"war"
2016-07-30 23:09:21,864 : INFO : topic #1(17.118): -0.434*"league" + 0.272*"album" + -0.235*"football" + -0.220*"team" + 0.176*"song" + 0.164*"band" + -0.161*"hockey" + 0.154*"released" + -0.145*"club" + -0.144*"nhl"
2016-07-30 23:09:21,867 : INFO : topic #2(16.677): 0.331*"album" + 0.237*"league" + 0.203*"song" + 0.199*"band" + 0.192*"released" + -0.153*"river" + 0.133*"chart" + 0.127*"team" + 0.122*"albums" + 0.120*"played"
2016-07-30 23:09:21,871 : INFO : topic #3(15.612): -0.213*"river" + 0.210*"actor" + 0.190*"movie" + 0.180*"actress" + 0.155*"politician" + -0.153*"county" + -0.132*"league" + 0.117*"footballer" + 0.117*"writer" + 0.111*"president"
2016-07-30 23:09:21,876 : INFO : topic #4(14.108): -0.241*"county" + -0.237*"river" + 0.237*"movie" + -0.217*"album" + 0.137*"game" + -0.132*"band" + -0.120*"district" + 0.117*"series" + -0.114*"song" + -0.106*"chart"
2016-07-30 23:09:22,024 : INFO : saving Projection object under wiki_en.lsi.projection, separately None
2016-07-30 23:09:22,027 : INFO : storing numpy array 'u' to wiki_en.lsi.projection.u.npy
2016-07-30 23:09:37,104 : INFO : saving LsiModel object under wiki_en.lsi, separately None
2016-07-30 23:09:37,105 : INFO : not storing attribute dispatcher
2016-07-30 23:09:37,106 : INFO : not storing attribute projection
# print the most contributing words (both positively and negatively) for each of the first ten topics
lsi.print_topics(10)
2016-07-30 23:11:30,855 : INFO : topic #0(25.214): 0.117*"movie" + 0.105*"album" + 0.097*"actor" + 0.091*"music" + 0.091*"league" + 0.085*"english" + 0.084*"actress" + 0.083*"released" + 0.081*"singer" + 0.081*"war"
2016-07-30 23:11:30,862 : INFO : topic #1(17.118): -0.434*"league" + 0.272*"album" + -0.235*"football" + -0.220*"team" + 0.176*"song" + 0.164*"band" + -0.161*"hockey" + 0.154*"released" + -0.145*"club" + -0.144*"nhl"
2016-07-30 23:11:30,867 : INFO : topic #2(16.677): 0.331*"album" + 0.237*"league" + 0.203*"song" + 0.199*"band" + 0.192*"released" + -0.153*"river" + 0.133*"chart" + 0.127*"team" + 0.122*"albums" + 0.120*"played"
2016-07-30 23:11:30,871 : INFO : topic #3(15.612): -0.213*"river" + 0.210*"actor" + 0.190*"movie" + 0.180*"actress" + 0.155*"politician" + -0.153*"county" + -0.132*"league" + 0.117*"footballer" + 0.117*"writer" + 0.111*"president"
2016-07-30 23:11:30,877 : INFO : topic #4(14.108): -0.241*"county" + -0.237*"river" + 0.237*"movie" + -0.217*"album" + 0.137*"game" + -0.132*"band" + -0.120*"district" + 0.117*"series" + -0.114*"song" + -0.106*"chart"
2016-07-30 23:11:30,880 : INFO : topic #5(13.577): 0.391*"county" + 0.379*"river" + 0.362*"movie" + 0.134*"award" + 0.129*"television" + 0.111*"movies" + 0.101*"series" + 0.098*"district" + 0.093*"town" + 0.093*"actor"
2016-07-30 23:11:30,886 : INFO : topic #6(13.072): -0.385*"championship" + -0.385*"wrestling" + -0.364*"match" + -0.315*"wwe" + -0.253*"defeated" + 0.218*"league" + -0.212*"tag" + -0.153*"singles" + -0.137*"heavyweight" + -0.123*"team"
2016-07-30 23:11:30,889 : INFO : topic #7(12.694): -0.575*"county" + 0.526*"river" + -0.178*"university" + -0.137*"party" + 0.100*"de" + 0.099*"movie" + 0.083*"saint" + -0.082*"college" + -0.079*"president" + -0.076*"school"
2016-07-30 23:11:30,892 : INFO : topic #8(12.624): -0.387*"nhl" + -0.348*"hockey" + 0.278*"league" + -0.215*"river" + 0.197*"movie" + 0.189*"football" + -0.157*"ice" + 0.157*"county" + 0.144*"premier" + 0.143*"club"
2016-07-30 23:11:30,898 : INFO : topic #9(12.171): -0.455*"river" + -0.256*"party" + -0.182*"university" + 0.171*"de" + 0.163*"saint" + -0.146*"election" + 0.144*"county" + -0.139*"president" + 0.138*"emperor" + 0.132*"province"
[(0,
  '0.117*"movie" + 0.105*"album" + 0.097*"actor" + 0.091*"music" + 0.091*"league" + 0.085*"english" + 0.084*"actress" + 0.083*"released" + 0.081*"singer" + 0.081*"war"'),
 (1,
  '-0.434*"league" + 0.272*"album" + -0.235*"football" + -0.220*"team" + 0.176*"song" + 0.164*"band" + -0.161*"hockey" + 0.154*"released" + -0.145*"club" + -0.144*"nhl"'),
 (2,
  '0.331*"album" + 0.237*"league" + 0.203*"song" + 0.199*"band" + 0.192*"released" + -0.153*"river" + 0.133*"chart" + 0.127*"team" + 0.122*"albums" + 0.120*"played"'),
 (3,
  '-0.213*"river" + 0.210*"actor" + 0.190*"movie" + 0.180*"actress" + 0.155*"politician" + -0.153*"county" + -0.132*"league" + 0.117*"footballer" + 0.117*"writer" + 0.111*"president"'),
 (4,
  '-0.241*"county" + -0.237*"river" + 0.237*"movie" + -0.217*"album" + 0.137*"game" + -0.132*"band" + -0.120*"district" + 0.117*"series" + -0.114*"song" + -0.106*"chart"'),
 (5,
  '0.391*"county" + 0.379*"river" + 0.362*"movie" + 0.134*"award" + 0.129*"television" + 0.111*"movies" + 0.101*"series" + 0.098*"district" + 0.093*"town" + 0.093*"actor"'),
 (6,
  '-0.385*"championship" + -0.385*"wrestling" + -0.364*"match" + -0.315*"wwe" + -0.253*"defeated" + 0.218*"league" + -0.212*"tag" + -0.153*"singles" + -0.137*"heavyweight" + -0.123*"team"'),
 (7,
  '-0.575*"county" + 0.526*"river" + -0.178*"university" + -0.137*"party" + 0.100*"de" + 0.099*"movie" + 0.083*"saint" + -0.082*"college" + -0.079*"president" + -0.076*"school"'),
 (8,
  '-0.387*"nhl" + -0.348*"hockey" + 0.278*"league" + -0.215*"river" + 0.197*"movie" + 0.189*"football" + -0.157*"ice" + 0.157*"county" + 0.144*"premier" + 0.143*"club"'),
 (9,
  '-0.455*"river" + -0.256*"party" + -0.182*"university" + 0.171*"de" + 0.163*"saint" + -0.146*"election" + 0.144*"county" + -0.139*"president" + 0.138*"emperor" + 0.132*"province"')]

LDA

import logging, gensim, bz2
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# load id->word mapping (the dictionary), one of the results of step 2 above
id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt.bz2')
# load corpus iterator
mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
# mm = gensim.corpora.MmCorpus(bz2.BZ2File('wiki_en_tfidf.mm.bz2')) # use this if you compressed the TFIDF output

print(mm)
2016-07-30 23:11:42,441 : INFO : loaded corpus index from wiki_en_tfidf.mm.index
2016-07-30 23:11:42,443 : INFO : initializing corpus reader from wiki_en_tfidf.mm
2016-07-30 23:11:42,577 : INFO : accepted corpus with 78550 documents, 30736 features, 6839040 non-zero entries
MmCorpus(78550 documents, 30736 features, 6839040 non-zero entries)
lda_model_name = 'wiki_en.lda'

if os.path.isfile( './' + lda_model_name ):
    lda = gensim.models.ldamodel.LdaModel.load(lda_model_name)
else:
    # extract 100 LDA topics, using 1 pass and updating once every 1 chunk (10,000 documents)
    lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=1, chunksize=10000, passes=1)
    lda.save(lda_model_name)
2016-07-30 23:20:52,517 : INFO : using symmetric alpha at 0.01
2016-07-30 23:20:52,524 : INFO : using symmetric eta at 0.01
2016-07-30 23:20:52,525 : INFO : using serial LDA version on this node
2016-07-30 23:21:40,990 : INFO : running online LDA training, 100 topics, 1 passes over the supplied corpus of 78550 documents, updating model once every 10000 documents, evaluating perplexity every 78550 documents, iterating 50x with a convergence threshold of 0.001000
2016-07-30 23:21:40,993 : WARNING : too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
2016-07-30 23:21:49,460 : INFO : PROGRESS: pass 0, at document #10000/78550
2016-07-30 23:23:11,710 : INFO : merging changes from 10000 documents into a model of 78550 documents
2016-07-30 23:23:17,829 : INFO : topic #30 (0.010): 0.002*river + 0.002*cadmium + 0.002*word + 0.002*example + 0.002*soprano + 0.002*kerry + 0.002*genocide + 0.002*water + 0.002*ebay + 0.002*allende
2016-07-30 23:23:17,838 : INFO : topic #22 (0.010): 0.003*king + 0.002*ii + 0.002*nations + 0.002*axis + 0.002*county + 0.002*england + 0.002*de + 0.002*player + 0.002*aircraft + 0.002*bc
2016-07-30 23:23:17,849 : INFO : topic #72 (0.010): 0.002*february + 0.002*movie + 0.002*horse + 0.001*king + 0.001*march + 0.001*actor + 0.001*british + 0.001*french + 0.001*league + 0.001*english
2016-07-30 23:23:17,860 : INFO : topic #19 (0.010): 0.002*king + 0.002*bilbao + 0.002*kansas + 0.002*england + 0.002*pope + 0.001*day + 0.001*mythology + 0.001*english + 0.001*water + 0.001*league
2016-07-30 23:23:17,871 : INFO : topic #11 (0.010): 0.002*english + 0.002*person + 0.002*president + 0.001*yes + 0.001*day + 0.001*afi + 0.001*famous + 0.001*primes + 0.001*chapters + 0.001*town
2016-07-30 23:23:17,883 : INFO : topic diff=75.635263, rho=1.000000
2016-07-30 23:23:25,971 : INFO : PROGRESS: pass 0, at document #20000/78550
2016-07-30 23:24:48,021 : INFO : merging changes from 10000 documents into a model of 78550 documents
2016-07-30 23:24:52,982 : INFO : topic #86 (0.010): 0.003*haiku + 0.002*limburg + 0.002*novgorod + 0.002*diaries + 0.002*avon + 0.002*rock + 0.002*crowley + 0.002*bratislava + 0.002*dharma + 0.002*castle
2016-07-30 23:24:52,992 : INFO : topic #16 (0.010): 0.003*fruit + 0.003*clarksville + 0.003*flags + 0.003*population + 0.002*electrical + 0.002*frankfurt + 0.002*voltage + 0.002*dam + 0.002*solar + 0.002*current
2016-07-30 23:24:53,001 : INFO : topic #90 (0.010): 0.007*bei + 0.007*bern + 0.007*district + 0.005*municipalities + 0.004*canton + 0.004*capital + 0.003*nam + 0.003*han + 0.003*dynasty + 0.003*baden
2016-07-30 23:24:53,013 : INFO : topic #41 (0.010): 0.003*church + 0.003*german + 0.003*burg + 0.003*palatinate + 0.003*switzerland + 0.002*municipalities + 0.002*rhineland + 0.002*hereford + 0.002*district + 0.002*baden
2016-07-30 23:24:53,022 : INFO : topic #7 (0.010): 0.005*jupiter + 0.003*zürich + 0.003*iau + 0.003*saturn + 0.003*hurricane + 0.003*awards + 0.002*division + 0.002*canton + 0.002*league + 0.002*music
2016-07-30 23:24:53,036 : INFO : topic diff=1.608188, rho=0.707107
2016-07-30 23:25:00,163 : INFO : PROGRESS: pass 0, at document #30000/78550
2016-07-30 23:26:15,085 : INFO : merging changes from 10000 documents into a model of 78550 documents
2016-07-30 23:26:19,243 : INFO : topic #16 (0.010): 0.006*dam + 0.004*census + 0.004*flags + 0.003*population + 0.003*circuit + 0.003*voltage + 0.003*privatisation + 0.003*fruit + 0.003*current + 0.003*living
2016-07-30 23:26:19,252 : INFO : topic #75 (0.010): 0.019*district + 0.012*coat + 0.011*arms + 0.008*saxony + 0.007*rural + 0.007*towns + 0.006*districts + 0.005*islam + 0.005*ahl + 0.005*groß
2016-07-30 23:26:19,262 : INFO : topic #56 (0.010): 0.007*sindh + 0.006*indus + 0.004*alvin + 0.004*mozilla + 0.003*texas + 0.003*saxe + 0.003*prototype + 0.003*district + 0.003*civilization + 0.002*weimar
2016-07-30 23:26:19,272 : INFO : topic #91 (0.010): 0.018*league + 0.013*mario + 0.012*flag + 0.011*ligue + 0.009*super + 0.009*nintendo + 0.008*football + 0.007*game + 0.007*bros + 0.006*smash
2016-07-30 23:26:19,283 : INFO : topic #51 (0.010): 0.005*river + 0.004*fork + 0.003*island + 0.003*paraná + 0.003*republican + 0.003*gauge + 0.003*observatory + 0.003*county + 0.003*species + 0.003*linux
2016-07-30 23:26:19,296 : INFO : topic diff=1.824947, rho=0.577350
2016-07-30 23:26:25,333 : INFO : PROGRESS: pass 0, at document #40000/78550
2016-07-30 23:27:36,792 : INFO : merging changes from 10000 documents into a model of 78550 documents
2016-07-30 23:27:41,044 : INFO : topic #44 (0.010): 0.005*hyundai + 0.005*mitsubishi + 0.004*cluster + 0.004*quarterback + 0.004*alekhine + 0.003*tour + 0.003*canton + 0.003*saudi + 0.003*arabia + 0.003*cycling
2016-07-30 23:27:41,054 : INFO : topic #1 (0.010): 0.004*scottish + 0.004*ross + 0.003*halifax + 0.003*qmjhl + 0.003*thistle + 0.003*wars + 0.003*qur + 0.003*pinned + 0.003*china + 0.003*zhejiang
2016-07-30 23:27:41,063 : INFO : topic #14 (0.010): 0.005*canon + 0.004*dibiase + 0.004*interval + 0.004*shorts + 0.004*underwear + 0.004*vasco + 0.004*diego + 0.004*bronx + 0.004*pants + 0.004*berkshire
2016-07-30 23:27:41,075 : INFO : topic #47 (0.010): 0.004*instruction + 0.004*trek + 0.004*linux + 0.003*code + 0.003*ionian + 0.003*register + 0.003*label + 0.003*processor + 0.003*processors + 0.003*carlos
2016-07-30 23:27:41,087 : INFO : topic #89 (0.010): 0.007*crystalline + 0.005*leaves + 0.005*arica + 0.005*flowers + 0.004*plant + 0.004*phoenix + 0.004*image + 0.004*jpg + 0.004*motherwell + 0.004*plants
2016-07-30 23:27:41,100 : INFO : topic diff=1.892730, rho=0.500000
2016-07-30 23:27:49,977 : INFO : PROGRESS: pass 0, at document #50000/78550
2016-07-30 23:28:54,200 : INFO : merging changes from 10000 documents into a model of 78550 documents
2016-07-30 23:28:57,987 : INFO : topic #83 (0.010): 0.043*wwe + 0.011*raw + 0.010*match + 0.009*smackdown + 0.006*ring + 0.006*clown + 0.006*women + 0.005*ichihara + 0.005*concepción + 0.005*sex
2016-07-30 23:28:58,000 : INFO : topic #13 (0.010): 0.015*language + 0.011*languages + 0.009*singapore + 0.008*azerbaijani + 0.007*indo + 0.007*spoken + 0.006*dialect + 0.006*arrondissement + 0.005*speakers + 0.005*pie
2016-07-30 23:28:58,013 : INFO : topic #24 (0.010): 0.019*primera + 0.010*orbit + 0.008*asteroid + 0.006*división + 0.006*asteroids + 0.005*belt + 0.005*serial + 0.004*yun + 0.004*sun + 0.004*earth
2016-07-30 23:28:58,025 : INFO : topic #99 (0.010): 0.013*opera + 0.008*music + 0.007*wagner + 0.004*operas + 0.004*poem + 0.004*rachel + 0.004*sang + 0.003*garland + 0.003*awards + 0.003*irish
2016-07-30 23:28:58,038 : INFO : topic #49 (0.010): 0.020*arsenic + 0.011*drug + 0.010*pool + 0.007*wainwright + 0.006*suv + 0.005*dough + 0.005*crossover + 0.004*uc + 0.004*drugs + 0.004*flour
2016-07-30 23:28:58,056 : INFO : topic diff=1.807750, rho=0.447214
2016-07-30 23:29:01,554 : INFO : PROGRESS: pass 0, at document #60000/78550
2016-07-30 23:30:01,909 : INFO : merging changes from 10000 documents into a model of 78550 documents
2016-07-30 23:30:04,972 : INFO : topic #15 (0.010): 0.015*abbottabad + 0.012*pyramid + 0.008*français + 0.008*kardashian + 0.008*holmes + 0.007*bin + 0.007*wisconsin + 0.007*hancock + 0.006*lavigne + 0.006*salle
2016-07-30 23:30:04,983 : INFO : topic #4 (0.010): 0.038*airport + 0.017*air + 0.015*airlines + 0.015*aviation + 0.014*boeing + 0.014*aircraft + 0.013*skier + 0.012*skiing + 0.011*airline + 0.010*slalom
2016-07-30 23:30:04,997 : INFO : topic #74 (0.010): 0.008*chess + 0.005*japan + 0.005*japanese + 0.003*conduct + 0.003*tokyo + 0.003*jaxa + 0.002*gaga + 0.002*okayama + 0.002*prefectural + 0.002*shizuoka
2016-07-30 23:30:05,008 : INFO : topic #98 (0.010): 0.019*cbs + 0.019*espn + 0.014*usher + 0.008*soap + 0.007*nbc + 0.007*abc + 0.007*kong + 0.006*hong + 0.006*chevrolet + 0.006*golf
2016-07-30 23:30:05,018 : INFO : topic #72 (0.010): 0.014*disability + 0.009*joseon + 0.006*car + 0.006*vehicle + 0.005*game + 0.004*cars + 0.004*wheelchair + 0.004*okinawa + 0.004*horse + 0.003*zelda
2016-07-30 23:30:05,032 : INFO : topic diff=1.645231, rho=0.408248
2016-07-30 23:30:10,463 : INFO : PROGRESS: pass 0, at document #70000/78550
2016-07-30 23:31:09,183 : INFO : merging changes from 10000 documents into a model of 78550 documents
2016-07-30 23:31:11,881 : INFO : topic #23 (0.010): 0.017*station + 0.009*railway + 0.008*line + 0.006*london + 0.006*parish + 0.005*network + 0.005*trains + 0.005*street + 0.005*file + 0.004*train
2016-07-30 23:31:11,889 : INFO : topic #9 (0.010): 0.005*energy + 0.004*light + 0.004*system + 0.003*data + 0.003*if + 0.003*example + 0.003*function + 0.003*will + 0.003*water + 0.002*earth
2016-07-30 23:31:11,902 : INFO : topic #12 (0.010): 0.016*emperor + 0.013*roman + 0.011*pope + 0.006*catholic + 0.006*church + 0.005*month + 0.005*imperial + 0.005*reign + 0.005*rome + 0.005*bishop
2016-07-30 23:31:11,912 : INFO : topic #70 (0.010): 0.012*posse + 0.011*freestyle + 0.009*interchange + 0.009*digital + 0.009*hospice + 0.008*granada + 0.008*fi + 0.008*catalonia + 0.007*snails + 0.007*molluscs
2016-07-30 23:31:11,920 : INFO : topic #3 (0.010): 0.019*israel + 0.018*jewish + 0.015*mosque + 0.013*hebrew + 0.012*tamil + 0.010*jerusalem + 0.010*seamount + 0.010*jesus + 0.008*bible + 0.008*cardiac
2016-07-30 23:31:11,932 : INFO : topic diff=1.479885, rho=0.377964
2016-07-30 23:33:29,898 : INFO : -15.658 per-word bound, 51709.6 perplexity estimate based on a held-out corpus of 8550 documents with 48268 words
2016-07-30 23:33:29,903 : INFO : PROGRESS: pass 0, at document #78550/78550
2016-07-30 23:34:22,515 : INFO : merging changes from 8550 documents into a model of 78550 documents
2016-07-30 23:34:25,455 : INFO : topic #58 (0.010): 0.021*album + 0.016*song + 0.015*band + 0.012*music + 0.012*released + 0.011*albums + 0.010*singer + 0.009*vocals + 0.009*guitar + 0.008*songs
2016-07-30 23:34:25,465 : INFO : topic #10 (0.010): 0.006*race + 0.005*complications + 0.005*cancer + 0.005*prize + 0.004*university + 0.003*raced + 0.003*nobel + 0.003*january + 0.003*racing + 0.003*prix
2016-07-30 23:34:25,474 : INFO : topic #79 (0.010): 0.035*party + 0.023*minister + 0.023*election + 0.022*politician + 0.017*member + 0.014*served + 0.014*president + 0.013*liberal + 0.013*parliament + 0.013*prime
2016-07-30 23:34:25,483 : INFO : topic #89 (0.010): 0.019*chile + 0.015*floorball + 0.013*jönköping + 0.011*chilean + 0.011*holder + 0.008*omaha + 0.007*apartheid + 0.006*allt + 0.006*lethal + 0.006*vill
2016-07-30 23:34:25,491 : INFO : topic #77 (0.010): 0.016*egypt + 0.014*egyptian + 0.012*bc + 0.009*coached + 0.008*mumbai + 0.007*cairo + 0.006*fa + 0.006*premier + 0.006*coaching + 0.005*discogs
2016-07-30 23:34:25,503 : INFO : topic diff=1.302134, rho=0.353553
2016-07-30 23:34:25,595 : INFO : saving LdaState object under wiki_en.lda.state, separately None
2016-07-30 23:34:31,476 : INFO : saving LdaModel object under wiki_en.lda, separately None
2016-07-30 23:34:31,481 : INFO : not storing attribute dispatcher
2016-07-30 23:34:31,485 : INFO : not storing attribute state
# print the most contributing words for 20 randomly selected topics
lda.print_topics(10)
2016-07-30 23:00:31,418 : INFO : topic #11 (0.010): 0.015*reporter + 0.013*congo + 0.012*mumbai + 0.010*maharashtra + 0.009*seamount + 0.008*metre + 0.008*copenhagen + 0.007*botswana + 0.007*lapland + 0.007*journalism
2016-07-30 23:00:31,430 : INFO : topic #41 (0.010): 0.009*mastering + 0.008*ada + 0.008*blunt + 0.007*climbing + 0.007*wizards + 0.006*honduras + 0.006*devi + 0.006*rope + 0.006*alt + 0.006*chiefs
2016-07-30 23:00:31,444 : INFO : topic #93 (0.010): 0.003*if + 0.003*person + 0.002*will + 0.002*usually + 0.002*space + 0.002*do + 0.002*blood + 0.002*game + 0.002*example + 0.002*each
2016-07-30 23:00:31,455 : INFO : topic #57 (0.010): 0.014*japanese + 0.012*japan + 0.010*bk + 0.010*calendar + 0.010*preceded + 0.009*edo + 0.008*period + 0.007*tokyo + 0.007*anjou + 0.007*gd
2016-07-30 23:00:31,468 : INFO : topic #4 (0.010): 0.012*mountain + 0.011*mount + 0.009*volcano + 0.009*park + 0.008*kerala + 0.008*bc + 0.008*india + 0.007*earthquake + 0.007*lahore + 0.007*mughal
2016-07-30 23:00:31,478 : INFO : topic #56 (0.010): 0.023*batman + 0.020*coaster + 0.016*roller + 0.015*gothenburg + 0.015*professionally + 0.014*cincinnati + 0.012*mohamed + 0.010*iranian + 0.010*saskatchewan + 0.009*llc
2016-07-30 23:00:31,490 : INFO : topic #76 (0.010): 0.010*language + 0.009*zealand + 0.008*languages + 0.007*philosophy + 0.005*auckland + 0.005*emporis + 0.004*beetle + 0.004*spoken + 0.004*benin + 0.004*māori
2016-07-30 23:00:31,500 : INFO : topic #5 (0.010): 0.009*singh + 0.008*garcía + 0.008*murdoch + 0.007*belize + 0.007*flores + 0.007*ichinomiya + 0.006*filmfare + 0.006*bollywood + 0.006*qmjhl + 0.006*universidad
2016-07-30 23:00:31,510 : INFO : topic #92 (0.010): 0.007*insects + 0.006*tax + 0.005*sulfur + 0.004*ants + 0.004*crow + 0.004*beer + 0.004*sitcoms + 0.004*wasps + 0.004*nest + 0.003*nests
2016-07-30 23:00:31,519 : INFO : topic #33 (0.010): 0.018*parish + 0.018*lake + 0.010*county + 0.009*hc + 0.009*highness + 0.005*northumbria + 0.005*harbour + 0.004*ship + 0.004*milwaukee + 0.004*municipality
[(11,
  '0.015*reporter + 0.013*congo + 0.012*mumbai + 0.010*maharashtra + 0.009*seamount + 0.008*metre + 0.008*copenhagen + 0.007*botswana + 0.007*lapland + 0.007*journalism'),
 (41,
  '0.009*mastering + 0.008*ada + 0.008*blunt + 0.007*climbing + 0.007*wizards + 0.006*honduras + 0.006*devi + 0.006*rope + 0.006*alt + 0.006*chiefs'),
 (93,
  '0.003*if + 0.003*person + 0.002*will + 0.002*usually + 0.002*space + 0.002*do + 0.002*blood + 0.002*game + 0.002*example + 0.002*each'),
 (57,
  '0.014*japanese + 0.012*japan + 0.010*bk + 0.010*calendar + 0.010*preceded + 0.009*edo + 0.008*period + 0.007*tokyo + 0.007*anjou + 0.007*gd'),
 (4,
  '0.012*mountain + 0.011*mount + 0.009*volcano + 0.009*park + 0.008*kerala + 0.008*bc + 0.008*india + 0.007*earthquake + 0.007*lahore + 0.007*mughal'),
 (56,
  '0.023*batman + 0.020*coaster + 0.016*roller + 0.015*gothenburg + 0.015*professionally + 0.014*cincinnati + 0.012*mohamed + 0.010*iranian + 0.010*saskatchewan + 0.009*llc'),
 (76,
  '0.010*language + 0.009*zealand + 0.008*languages + 0.007*philosophy + 0.005*auckland + 0.005*emporis + 0.004*beetle + 0.004*spoken + 0.004*benin + 0.004*māori'),
 (5,
  '0.009*singh + 0.008*garcía + 0.008*murdoch + 0.007*belize + 0.007*flores + 0.007*ichinomiya + 0.006*filmfare + 0.006*bollywood + 0.006*qmjhl + 0.006*universidad'),
 (92,
  '0.007*insects + 0.006*tax + 0.005*sulfur + 0.004*ants + 0.004*crow + 0.004*beer + 0.004*sitcoms + 0.004*wasps + 0.004*nest + 0.003*nests'),
 (33,
  '0.018*parish + 0.018*lake + 0.010*county + 0.009*hc + 0.009*highness + 0.005*northumbria + 0.005*harbour + 0.004*ship + 0.004*milwaukee + 0.004*municipality')]

Doc2Vec

d2v = gensim.models.doc2vec.Doc2Vec(docvecs_mapfile=mm, size=100, window=8, min_count=5, workers=4)
d2v.print_topics(10)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-33-df92a122c8f5> in <module>()
      1 d2v = gensim.models.doc2vec.Doc2Vec(docvecs_mapfile=mm, size=100, window=8, min_count=5, workers=4)
----> 2 d2v.print_topics(10)

AttributeError: 'Doc2Vec' object has no attribute 'print_topics'

Word2Vec

w2v = gensim.models.wrd2vec.Wrd2Vec(docvecs_mapfile=mm, size=100, window=8, min_count=5, workers=4)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-31-d6d69333143a> in <module>()
----> 1 w2v = gensim.models.wrd2vec.Wrd2Vec(docvecs_mapfile=mm, size=100, window=8, min_count=5, workers=4)

AttributeError: 'module' object has no attribute 'wrd2vec'
from gensim.summarization import summarize

text = "Thomas A. Anderson is a man living two lives. By day he is an " + \
    "average computer programmer and by night a hacker known as " + \
    "Neo. Neo has always questioned his reality, but the truth is " + \
    "far beyond his imagination. Neo finds himself targeted by the " + \
    "police when he is contacted by Morpheus, a legendary computer " + \
    "hacker branded a terrorist by the government. Morpheus awakens " + \
    "Neo to the real world, a ravaged wasteland where most of " + \
    "humanity have been captured by a race of machines that live " + \
    "off of the humans' body heat and electrochemical energy and " + \
    "who imprison their minds within an artificial reality known as " + \
    "the Matrix. As a rebel against the machines, Neo must return to " + \
    "the Matrix and confront the agents: super-powerful computer " + \
    "programs devoted to snuffing out Neo and the entire human " + \
    "rebellion. "

print('Input text:')
print(text)
    

print('Summary:')
print(summarize(text))

print('Keywords:')
print(keywords(text, ratio=0.1))
2016-07-31 05:42:43,212 : WARNING : Input text is expected to have at least 10 sentences.
2016-07-31 05:42:43,214 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2016-07-31 05:42:43,216 : INFO : built Dictionary(53 unique tokens: ['wasteland', 'rebellion', 'rebel', 'imagin', 'electrochem']...) from 6 documents (total 68 corpus positions)
2016-07-31 05:42:43,219 : WARNING : Input corpus is expected to have at least 10 documents.
Input text:
Thomas A. Anderson is a man living two lives. By day he is an average computer programmer and by night a hacker known as Neo. Neo has always questioned his reality, but the truth is far beyond his imagination. Neo finds himself targeted by the police when he is contacted by Morpheus, a legendary computer hacker branded a terrorist by the government. Morpheus awakens Neo to the real world, a ravaged wasteland where most of humanity have been captured by a race of machines that live off of the humans' body heat and electrochemical energy and who imprison their minds within an artificial reality known as the Matrix. As a rebel against the machines, Neo must return to the Matrix and confront the agents: super-powerful computer programs devoted to snuffing out Neo and the entire human rebellion. 
Summary:
Morpheus awakens Neo to the real world, a ravaged wasteland where most of humanity have been captured by a race of machines that live off of the humans' body heat and electrochemical energy and who imprison their minds within an artificial reality known as the Matrix.
Keywords:
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-18-ca9772919a19> in <module>()
     11 
     12 print('Keywords:')
---> 13 print(keywords(text, ratio=0.1))
     14 

NameError: name 'keywords' is not defined