KeyError in gensim.prepare #5

christofs · 2015-06-15T11:35:20Z

Hi there, I'm using gensim to do LDA on a collection of novels (using just 40 for testing, I have several hundreds). Building the corpus and dictionary seems to work fine, as does the modeling process itself. I can also inspect the resulting model (topics in documents and words in topics, for example). However, when attempting to use pyLDAvis, I run into a KeyError.

I'm on Linux (Ubuntu 14.04) and using Python 3.4 and the following versions of relevant modules:
pyLDAvis 1.2.0
numpy 1.9.2
gensim 0.11.1-1

This is my code (loading corpus, dictionary and model from previous step):

def gensim_output(modelfile, corpusfile, dictionaryfile): 
    """Displaying gensim topic models"""
    ## Load files from "gensim_modeling"
    corpus = corpora.MmCorpus(corpusfile)
    dictionary = corpora.Dictionary.load(dictionaryfile) # for pyLDAvis
    myldamodel = models.ldamodel.LdaModel.load(modelfile)    

    ## Interactive visualisation
    import pyLDAvis.gensim
    vis = pyLDAvis.gensim.prepare(myldamodel, corpus, dictionary)
    pyLDAvis.display(vis)

This is the output I get:

Traceback (most recent call last):

  File "<ipython-input-79-940daa51d8a9>", line 1, in <module>
    runfile('/home/[PATH]/an5/mygensim.py', wdir='/home/christof/Dropbox/0-Analysen/2015/rp_Sydney/an5')

  File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 586, in runfile
    execfile(filename, namespace)

  File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 48, in execfile
    exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)

  File "/home/[PATH]/an5/mygensim.py", line 84, in <module>
    main("./5_lemmata/*.txt", "gensim_corpus.dict", "gensim_corpus.mm", "gensim_modelfile.gensim")

  File "/home/[PATH]/an5/mygensim.py", line 82, in main
    gensim_output(modelfile, corpusfile, dictionaryfile)

  File "/home/[PATH]/an5/mygensim.py", line 75, in gensim_output
    vis = pyLDAvis.gensim.prepare(myldamodel, corpus, dictionary)

  File "/usr/local/lib/python3.4/dist-packages/pyLDAvis/gensim.py", line 61, in prepare
    return vis_prepare(**_extract_data(topic_model, corpus, dictionary))

  File "/usr/local/lib/python3.4/dist-packages/pyLDAvis/gensim.py", line 24, in _extract_data
    term_freqs = [term_freqs_dict[id] for id in xrange(N)]

  File "/usr/local/lib/python3.4/dist-packages/pyLDAvis/gensim.py", line 24, in <listcomp>
    term_freqs = [term_freqs_dict[id] for id in xrange(N)]

KeyError: 6

Not sure whether this is a bug or bad usage of the module. Any help would be very much appreciated.

The text was updated successfully, but these errors were encountered:

bmabey · 2015-06-15T13:43:00Z

Thanks for reporting this. The problem, it seems, is that pyLDAvis is assuming a compacted dictionary with a contiguous list of IDs. This will not be the case however in some dictionaries if you have removed tokens and have not called compactify() on it afterwards.

I'll look into removing this assumption made by pyLDAvis. In the meantime I would suggest that you call dictionary.compactify() before training your model. Give that a try and let me know if that helps.

christofs · 2015-06-15T16:36:50Z

Hi there, thanks for the quick response. Unfortunately, calling dictionary.compactify() does not help. Removing any preprocessing of the dictionary (I did in fact delete some low-frequency tokens) did not help, either. The error shifts to KeyError: 1 or KeyError: 0 but otherwise the behavior stays the same.

I'll investigate if something else goes wrong further upstream. I'm also somewhat unimpressed with the topics I get, so maybe there is some other problem. I will report back if I come across anything that seems relevant.

huihuifan · 2015-06-15T16:41:36Z

I usually delete low-frequency tokens before constructing the dictionary. I have gotten a similar error before when deleting low frequency tokens from the dictionary manually, but re-compactifying has resolved the problem, not sure about your bug.

Perhaps further text preprocessing could help your topics?

bmabey · 2015-06-15T16:47:13Z

Could you provide me with example code of how you are creating your dictionary?

christofs · 2015-06-15T17:46:25Z

Sure, here you go:

def make_gensim_corpus(inpath,dictionaryfile,corpusfile):
    """Turn collection of text files into corpus for gensim."""
    ## Create list of document texts and list of document idnos.
    all_texts = []
    all_idnos = []
    for file in glob.glob(inpath): 
        with open(file) as infile:
            text = infile.read()
            text = re.split(" ", text)
            all_texts.append(text)
            idno,ext = os.path.splitext(os.path.basename(file))
            all_idnos.append(idno)
    ## Delete tokens which only appear once
    from collections import defaultdict
    frequency = defaultdict(int)
    for text in all_texts:
        for token in text:
            frequency[token] += 1
    all_texts = [[token for token in text if frequency[token] > 1] for text in all_texts]
    ## Build dictionary and corpus
    dictionary = corpora.Dictionary(all_texts)
    dictionary.compactify() # suggested by bmabey
    dictionary.save(dictionaryfile) # stores the dictionary
    dictionary.save_as_text("text_"+dictionaryfile) # stores the dictionary
    corpus = [dictionary.doc2bow(text) for item in all_texts]
    corpora.MmCorpus.serialize(corpusfile, corpus) # stores the corpus
    print("Done building the corpus.")

christofs · 2015-06-15T17:48:59Z

Basically, I have each document in one file and read them from there into a "one document per line" manner. I keep track of file identifiers for later usage. For modeling, I load the saved dictionary so I don't have to build it every time I change something in the modeling step.

christofs · 2015-06-15T19:46:21Z

There is something else wrong. With 400 novels, 5000 iterations and 50 passes, I still get the following virtually identical topics.

0   monsieur ami marquis marquise madame homme vicomte les suis femme heure nom mari mère jour père temps rue enfant fois 
1   monsieur marquise madame ami marquis vicomte suis les homme femme mari affaire jour nom rue heure père fois temps enfant 
2   monsieur madame marquise marquis les ami homme vicomte femme suis nom heure temps affaire jour coup mari fois mère enfant 
3   monsieur suis vicomte homme madame marquise ami les marquis femme nom heure enfant temps affaire fois mère mari père étudiant 
4   monsieur madame marquise vicomte homme ami femme suis marquis nom mari heure rue les main enfant affaire jour mère père 
5   monsieur marquise ami marquis homme femme madame suis les vicomte mari nom enfant mère jour rue heure temps affaire étudiant 
6   monsieur ami marquise homme madame les vicomte suis femme marquis nom mari affaire fois mère heure rue œil jour enfant 
7   marquise monsieur homme vicomte madame marquis suis ami les femme nom mère temps rue fils enfant jour mari heure mot 
8   monsieur marquise madame homme marquis ami vicomte suis les heure affaire rue mère femme nom enfant père mari quartier fois 
9   monsieur marquise madame ami les suis marquis vicomte homme heure femme nom mère temps enfant père mari affaire fils mot 
10  monsieur marquise homme marquis ami suis madame femme les vicomte affaire mère temps nom mari œil enfant rue jour heure 
11  monsieur madame vicomte les marquis suis femme homme marquise mari enfant nom affaire rue ami temps jour moment coup étudiant 
12  monsieur madame marquise ami marquis vicomte homme suis les femme nom heure mère enfant mari temps rue affaire père jour 
13  monsieur vicomte marquise madame ami femme suis homme marquis les nom affaire temps heure jour enfant rue mère fils mari 
14  monsieur homme marquis vicomte madame marquise suis ami les femme affaire heure mère enfant mari rue fois temps nom étudiant 
15  monsieur marquise ami les suis madame homme vicomte marquis affaire heure rue mari mère enfant nom femme temps fils père 
16  monsieur ami vicomte madame suis homme marquise marquis les nom femme mari temps coup mère main rue œil heure fois 
17  monsieur madame marquise ami vicomte suis homme les marquis femme affaire enfant nom rue heure fils père mère mari jour 
18  monsieur les marquise vicomte homme madame ami suis femme mari nom enfant marquis rue affaire heure mère fils jour temps

Seems like I'm doing something wrong upstream, so this whole thing may be unrelated to pyldavis. Sorry about that.

bmabey · 2015-06-15T20:03:05Z

Yeah, I certainly can't speak to that problem. The gensim mailing list would be a better place to ask about that issue.

prateekmehta · 2015-06-23T09:12:28Z

Hey even i am getting a key error when i am using an external dictionary to convert docs to bow,

what i am doing in following :
id2word = dictionary.load("someDictionary")
corpus = [id2word.doc2bow(doc) for doc in docs]
ldamodel =model.LdaModel(corpus,id2word,num_topics...)

i get a model, which i can explore.

but vis = pyLDAvis.gensim.prepare(ldamodel,corpus,id2word)

produces key errors i can't understand why?

i even tried to compactify() id2word before transforming corpus and learning model but that doesn'r help either.

Kindly look into it.

bmabey · 2015-06-23T11:36:31Z

Please provide me with the code and corpus that created the dictionary. Without a way to reproduce this bug locally it will be hard for me to fix this.

On Jun 23, 2015, at 3:12 AM, Prateek Mehta [email protected] wrote:

Hey even i am getting a key error when i am using an external dictionary to convert docs to bow,

what i am doing in following :
id2word = dictionary.load("someDictionary")
corpus = [id2word.doc2bow(doc) for doc in docs]
ldamodel =model.LdaModel(corpus,id2word,num_topics...)

i get a model, which i can explore.

but vis = pyLDAvis.gensim.prepare(ldamodel,corpus,id2word)

produces key errors i can't understand why?

i even tried to compactify() id2word before transforming corpus and learning model but that doesn'r help either.

Kindly look into it.

—
Reply to this email directly or view it on GitHub.

bmabey · 2015-07-24T16:01:03Z

I just merged in #17 which may address this bug. If someone running into this error can clone and run setup.py on master I'd appreciate the feedback. Otherwise I'll assume this has fixed it and I'll close this in a week or so.

dpatschke · 2015-09-17T22:11:24Z

I 'pip installed' pyLDAvis today and ran into a key error problem as well.

The key error could be related to a specific Python 3.x issue.

The line where the problem occurred for me is here (in gensim.py):
topics_df = pd.DataFrame([dict((y,x) for x, y in tuples) for tuples in topics])[vocab]

But the real culprit is here:
vocab = dictionary.token2id.keys()

From my understanding, in Python 3, the 'dict_keys' object created by the above call is now an iterable where in Python 2, the object created is a list. A pandas DataFrame appears to be unable to subset using an iterable, so a key error occurs.

Changing the vocab assignment to:
vocab = list(dictionary.token2id.keys())
should work for both Python 2 and Python 3. (Worked for me in Python 3 at least)

bmabey · 2015-09-17T22:18:05Z

@dpatschke Thanks. So, did you make this change locally and have this work for in in Python 3 then? If so, want to submit a PR for it?

dpatschke · 2015-09-17T22:36:09Z

@bmabey I did change it locally and got it to work in Python 3 successfully. Let me work on the pull request for it.

dpatschke · 2015-09-17T23:02:47Z

@bmabey I had a problem cloning with the 'lfs' error (similar to another issue previously listed). If you care to make the fix yourself, feel free. Otherwise, give me a little time to try and get everything set up on my end so as not to delete necessary files

Grinshpun · 2015-10-24T22:30:29Z

dpatschke suggested on Sept 17 that modifying a line gensim.py would help the KeyError problem. I'm running in Windows 10, WinPython 2.7 environment and have installed the pyLDAvis today. Tried the change - no dice. I do really need help with this thing, since I'm terribly new to python and terribly late with my work, so any moral and technical support will be greatly appreciated.

Traceback (most recent call last):
File "C:\Users\Vadim\Documents\PhD\Software\My_pyLDAvis\vis.py", line 16, in
run_vis()
File "C:\Users\Vadim\Documents\PhD\Software\My_pyLDAvis\vis.py", line 13, in run_vis
pyLDAvis.gensim.prepare(lda, corpus, dictionary)
File "C:\WinPython27\python-2.7.10.amd64\lib\site-packages\pyLDAvis\gensim.py", line 68, in prepare
opts = fp.merge(_extract_data(topic_model, corpus, dictionary), kargs)
File "C:\WinPython27\python-2.7.10.amd64\lib\site-packages\pyLDAvis\gensim.py", line 34, in _extract_data
topics_df = pd.DataFrame([dict((y,x) for x, y in tuples) for tuples in topics])[vocab]
File "C:\WinPython27\python-2.7.10.amd64\lib\site-packages\pandas\core\frame.py", line 1791, in getitem
return self._getitem_array(key)
File "C:\WinPython27\python-2.7.10.amd64\lib\site-packages\pandas\core\frame.py", line 1835, in _getitem_array
indexer = self.ix._convert_to_indexer(key, axis=1)
File "C:\WinPython27\python-2.7.10.amd64\lib\site-packages\pandas\core\indexing.py", line 1112, in _convert_to_indexer
raise KeyError('%s not in index' % objarr[mask])
KeyError: "[u'raining' u'unscientific' u'writings' ..., u'cliff' u'reticence' u'ranks'] not in index"

dpatschke · 2015-10-24T23:35:45Z

The problem I was experiencing was not the same problem you are mentioning. Mine had to do with an object being an "iterable" in Python 3 and a list in Python 2. I honestly have no idea what could be causing your problem, but I would double check that you are, in fact, passing a corpus in as your second parameter to 'prepare'.

From comment above:
id2word = dictionary.load("someDictionary")
corpus = [id2word.doc2bow(doc) for doc in docs]

If you are doing this and still getting the error ... I am sorry ... I will not be of much further help. Good luck :-)!

Grinshpun · 2015-10-25T13:09:57Z

dpatschke thank you for your reply. The problem you were experiencing was the only reference to something remotely resembling the problem I had, so tried both - the compactify() and the change you have suggested.
As for the corpus - here is my code directly cut'n'pasted from the pyLDAvis sample at http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb#topic=0&lambda=1&term= sections in [19], [21]. and it looks line this:

dictionary = gensim.corpora.Dictionary.load('%s\\my.dict' % directory)
corpus = gensim.corpora.MmCorpus('%s\\myCorpus.mm' % directory)
lda = gensim.models.ldamodel.LdaModel.load('%s\\myModel_def.model' % directory)

pyLDAvis.gensim.prepare(lda, corpus, dictionary)

Grinshpun · 2015-10-25T20:20:33Z

Apologies. I have a bit of a different issue - my corpus is empty to begin with. Granted this is not the most descriptive error message, but the problem appears with the pyLDAvis user, not the package itself.
Again thanks everyone for support.

ibarria0 · 2015-10-28T05:06:21Z

list(dictionary.token2id.keys())

works for me :)

bmabey · 2015-11-02T16:47:53Z

This should be fixed with the 1.3.1 release.

bmabey added the bug label Jun 15, 2015

bmabey closed this as completed Nov 2, 2015

lenhhoxung86 mentioned this issue May 30, 2017

Inconsistency in displaying top terms #93

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError in gensim.prepare #5

KeyError in gensim.prepare #5

christofs commented Jun 15, 2015

bmabey commented Jun 15, 2015

christofs commented Jun 15, 2015

huihuifan commented Jun 15, 2015

bmabey commented Jun 15, 2015

christofs commented Jun 15, 2015

christofs commented Jun 15, 2015

christofs commented Jun 15, 2015

bmabey commented Jun 15, 2015

prateekmehta commented Jun 23, 2015

bmabey commented Jun 23, 2015

bmabey commented Jul 24, 2015

dpatschke commented Sep 17, 2015

bmabey commented Sep 17, 2015

dpatschke commented Sep 17, 2015

dpatschke commented Sep 17, 2015

Grinshpun commented Oct 24, 2015

dpatschke commented Oct 24, 2015

Grinshpun commented Oct 25, 2015

Grinshpun commented Oct 25, 2015

ibarria0 commented Oct 28, 2015

bmabey commented Nov 2, 2015

KeyError in gensim.prepare #5

KeyError in gensim.prepare #5

Comments

christofs commented Jun 15, 2015

bmabey commented Jun 15, 2015

christofs commented Jun 15, 2015

huihuifan commented Jun 15, 2015

bmabey commented Jun 15, 2015

christofs commented Jun 15, 2015

christofs commented Jun 15, 2015

christofs commented Jun 15, 2015

bmabey commented Jun 15, 2015

prateekmehta commented Jun 23, 2015

bmabey commented Jun 23, 2015

bmabey commented Jul 24, 2015

dpatschke commented Sep 17, 2015

bmabey commented Sep 17, 2015

dpatschke commented Sep 17, 2015

dpatschke commented Sep 17, 2015

Grinshpun commented Oct 24, 2015

dpatschke commented Oct 24, 2015

Grinshpun commented Oct 25, 2015

Grinshpun commented Oct 25, 2015

ibarria0 commented Oct 28, 2015

bmabey commented Nov 2, 2015