Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError in gensim.prepare #5

Closed
christofs opened this issue Jun 15, 2015 · 21 comments
Closed

KeyError in gensim.prepare #5

christofs opened this issue Jun 15, 2015 · 21 comments
Labels

Comments

@christofs
Copy link

Hi there, I'm using gensim to do LDA on a collection of novels (using just 40 for testing, I have several hundreds). Building the corpus and dictionary seems to work fine, as does the modeling process itself. I can also inspect the resulting model (topics in documents and words in topics, for example). However, when attempting to use pyLDAvis, I run into a KeyError.

I'm on Linux (Ubuntu 14.04) and using Python 3.4 and the following versions of relevant modules:
pyLDAvis 1.2.0
numpy 1.9.2
gensim 0.11.1-1

This is my code (loading corpus, dictionary and model from previous step):

def gensim_output(modelfile, corpusfile, dictionaryfile): 
    """Displaying gensim topic models"""
    ## Load files from "gensim_modeling"
    corpus = corpora.MmCorpus(corpusfile)
    dictionary = corpora.Dictionary.load(dictionaryfile) # for pyLDAvis
    myldamodel = models.ldamodel.LdaModel.load(modelfile)    

    ## Interactive visualisation
    import pyLDAvis.gensim
    vis = pyLDAvis.gensim.prepare(myldamodel, corpus, dictionary)
    pyLDAvis.display(vis)

This is the output I get:

Traceback (most recent call last):

  File "<ipython-input-79-940daa51d8a9>", line 1, in <module>
    runfile('/home/[PATH]/an5/mygensim.py', wdir='/home/christof/Dropbox/0-Analysen/2015/rp_Sydney/an5')

  File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 586, in runfile
    execfile(filename, namespace)

  File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 48, in execfile
    exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)

  File "/home/[PATH]/an5/mygensim.py", line 84, in <module>
    main("./5_lemmata/*.txt", "gensim_corpus.dict", "gensim_corpus.mm", "gensim_modelfile.gensim")

  File "/home/[PATH]/an5/mygensim.py", line 82, in main
    gensim_output(modelfile, corpusfile, dictionaryfile)

  File "/home/[PATH]/an5/mygensim.py", line 75, in gensim_output
    vis = pyLDAvis.gensim.prepare(myldamodel, corpus, dictionary)

  File "/usr/local/lib/python3.4/dist-packages/pyLDAvis/gensim.py", line 61, in prepare
    return vis_prepare(**_extract_data(topic_model, corpus, dictionary))

  File "/usr/local/lib/python3.4/dist-packages/pyLDAvis/gensim.py", line 24, in _extract_data
    term_freqs = [term_freqs_dict[id] for id in xrange(N)]

  File "/usr/local/lib/python3.4/dist-packages/pyLDAvis/gensim.py", line 24, in <listcomp>
    term_freqs = [term_freqs_dict[id] for id in xrange(N)]

KeyError: 6

Not sure whether this is a bug or bad usage of the module. Any help would be very much appreciated.

@bmabey
Copy link
Owner

bmabey commented Jun 15, 2015

Thanks for reporting this. The problem, it seems, is that pyLDAvis is assuming a compacted dictionary with a contiguous list of IDs. This will not be the case however in some dictionaries if you have removed tokens and have not called compactify() on it afterwards.

I'll look into removing this assumption made by pyLDAvis. In the meantime I would suggest that you call dictionary.compactify() before training your model. Give that a try and let me know if that helps.

@bmabey bmabey added the bug label Jun 15, 2015
@christofs
Copy link
Author

Hi there, thanks for the quick response. Unfortunately, calling dictionary.compactify() does not help. Removing any preprocessing of the dictionary (I did in fact delete some low-frequency tokens) did not help, either. The error shifts to KeyError: 1 or KeyError: 0 but otherwise the behavior stays the same.

I'll investigate if something else goes wrong further upstream. I'm also somewhat unimpressed with the topics I get, so maybe there is some other problem. I will report back if I come across anything that seems relevant.

@huihuifan
Copy link

I usually delete low-frequency tokens before constructing the dictionary. I have gotten a similar error before when deleting low frequency tokens from the dictionary manually, but re-compactifying has resolved the problem, not sure about your bug.

Perhaps further text preprocessing could help your topics?

@bmabey
Copy link
Owner

bmabey commented Jun 15, 2015

Could you provide me with example code of how you are creating your dictionary?

@christofs
Copy link
Author

Sure, here you go:

def make_gensim_corpus(inpath,dictionaryfile,corpusfile):
    """Turn collection of text files into corpus for gensim."""
    ## Create list of document texts and list of document idnos.
    all_texts = []
    all_idnos = []
    for file in glob.glob(inpath): 
        with open(file) as infile:
            text = infile.read()
            text = re.split(" ", text)
            all_texts.append(text)
            idno,ext = os.path.splitext(os.path.basename(file))
            all_idnos.append(idno)
    ## Delete tokens which only appear once
    from collections import defaultdict
    frequency = defaultdict(int)
    for text in all_texts:
        for token in text:
            frequency[token] += 1
    all_texts = [[token for token in text if frequency[token] > 1] for text in all_texts]
    ## Build dictionary and corpus
    dictionary = corpora.Dictionary(all_texts)
    dictionary.compactify() # suggested by bmabey
    dictionary.save(dictionaryfile) # stores the dictionary
    dictionary.save_as_text("text_"+dictionaryfile) # stores the dictionary
    corpus = [dictionary.doc2bow(text) for item in all_texts]
    corpora.MmCorpus.serialize(corpusfile, corpus) # stores the corpus
    print("Done building the corpus.")

@christofs
Copy link
Author

Basically, I have each document in one file and read them from there into a "one document per line" manner. I keep track of file identifiers for later usage. For modeling, I load the saved dictionary so I don't have to build it every time I change something in the modeling step.

@christofs
Copy link
Author

There is something else wrong. With 400 novels, 5000 iterations and 50 passes, I still get the following virtually identical topics.

0   monsieur ami marquis marquise madame homme vicomte les suis femme heure nom mari mère jour père temps rue enfant fois 
1   monsieur marquise madame ami marquis vicomte suis les homme femme mari affaire jour nom rue heure père fois temps enfant 
2   monsieur madame marquise marquis les ami homme vicomte femme suis nom heure temps affaire jour coup mari fois mère enfant 
3   monsieur suis vicomte homme madame marquise ami les marquis femme nom heure enfant temps affaire fois mère mari père étudiant 
4   monsieur madame marquise vicomte homme ami femme suis marquis nom mari heure rue les main enfant affaire jour mère père 
5   monsieur marquise ami marquis homme femme madame suis les vicomte mari nom enfant mère jour rue heure temps affaire étudiant 
6   monsieur ami marquise homme madame les vicomte suis femme marquis nom mari affaire fois mère heure rue œil jour enfant 
7   marquise monsieur homme vicomte madame marquis suis ami les femme nom mère temps rue fils enfant jour mari heure mot 
8   monsieur marquise madame homme marquis ami vicomte suis les heure affaire rue mère femme nom enfant père mari quartier fois 
9   monsieur marquise madame ami les suis marquis vicomte homme heure femme nom mère temps enfant père mari affaire fils mot 
10  monsieur marquise homme marquis ami suis madame femme les vicomte affaire mère temps nom mari œil enfant rue jour heure 
11  monsieur madame vicomte les marquis suis femme homme marquise mari enfant nom affaire rue ami temps jour moment coup étudiant 
12  monsieur madame marquise ami marquis vicomte homme suis les femme nom heure mère enfant mari temps rue affaire père jour 
13  monsieur vicomte marquise madame ami femme suis homme marquis les nom affaire temps heure jour enfant rue mère fils mari 
14  monsieur homme marquis vicomte madame marquise suis ami les femme affaire heure mère enfant mari rue fois temps nom étudiant 
15  monsieur marquise ami les suis madame homme vicomte marquis affaire heure rue mari mère enfant nom femme temps fils père 
16  monsieur ami vicomte madame suis homme marquise marquis les nom femme mari temps coup mère main rue œil heure fois 
17  monsieur madame marquise ami vicomte suis homme les marquis femme affaire enfant nom rue heure fils père mère mari jour 
18  monsieur les marquise vicomte homme madame ami suis femme mari nom enfant marquis rue affaire heure mère fils jour temps 

Seems like I'm doing something wrong upstream, so this whole thing may be unrelated to pyldavis. Sorry about that.

@bmabey
Copy link
Owner

bmabey commented Jun 15, 2015

Yeah, I certainly can't speak to that problem. The gensim mailing list would be a better place to ask about that issue.

@prateekmehta
Copy link

Hey even i am getting a key error when i am using an external dictionary to convert docs to bow,

what i am doing in following :
id2word = dictionary.load("someDictionary")
corpus = [id2word.doc2bow(doc) for doc in docs]
ldamodel =model.LdaModel(corpus,id2word,num_topics...)

i get a model, which i can explore.

but vis = pyLDAvis.gensim.prepare(ldamodel,corpus,id2word)

produces key errors i can't understand why?

i even tried to compactify() id2word before transforming corpus and learning model but that doesn'r help either.

Kindly look into it.

@bmabey
Copy link
Owner

bmabey commented Jun 23, 2015

Please provide me with the code and corpus that created the dictionary. Without a way to reproduce this bug locally it will be hard for me to fix this.

On Jun 23, 2015, at 3:12 AM, Prateek Mehta [email protected] wrote:

Hey even i am getting a key error when i am using an external dictionary to convert docs to bow,

what i am doing in following :
id2word = dictionary.load("someDictionary")
corpus = [id2word.doc2bow(doc) for doc in docs]
ldamodel =model.LdaModel(corpus,id2word,num_topics...)

i get a model, which i can explore.

but vis = pyLDAvis.gensim.prepare(ldamodel,corpus,id2word)

produces key errors i can't understand why?

i even tried to compactify() id2word before transforming corpus and learning model but that doesn'r help either.

Kindly look into it.


Reply to this email directly or view it on GitHub.

@bmabey
Copy link
Owner

bmabey commented Jul 24, 2015

I just merged in #17 which may address this bug. If someone running into this error can clone and run setup.py on master I'd appreciate the feedback. Otherwise I'll assume this has fixed it and I'll close this in a week or so.

@dpatschke
Copy link
Contributor

I 'pip installed' pyLDAvis today and ran into a key error problem as well.

The key error could be related to a specific Python 3.x issue.

The line where the problem occurred for me is here (in gensim.py):
topics_df = pd.DataFrame([dict((y,x) for x, y in tuples) for tuples in topics])[vocab]

But the real culprit is here:
vocab = dictionary.token2id.keys()

From my understanding, in Python 3, the 'dict_keys' object created by the above call is now an iterable where in Python 2, the object created is a list. A pandas DataFrame appears to be unable to subset using an iterable, so a key error occurs.

Changing the vocab assignment to:
vocab = list(dictionary.token2id.keys())
should work for both Python 2 and Python 3. (Worked for me in Python 3 at least)

@bmabey
Copy link
Owner

bmabey commented Sep 17, 2015

@dpatschke Thanks. So, did you make this change locally and have this work for in in Python 3 then? If so, want to submit a PR for it?

@dpatschke
Copy link
Contributor

@bmabey I did change it locally and got it to work in Python 3 successfully. Let me work on the pull request for it.

@dpatschke
Copy link
Contributor

@bmabey I had a problem cloning with the 'lfs' error (similar to another issue previously listed). If you care to make the fix yourself, feel free. Otherwise, give me a little time to try and get everything set up on my end so as not to delete necessary files

@Grinshpun
Copy link

dpatschke suggested on Sept 17 that modifying a line gensim.py would help the KeyError problem. I'm running in Windows 10, WinPython 2.7 environment and have installed the pyLDAvis today. Tried the change - no dice. I do really need help with this thing, since I'm terribly new to python and terribly late with my work, so any moral and technical support will be greatly appreciated.

Traceback (most recent call last):
File "C:\Users\Vadim\Documents\PhD\Software\My_pyLDAvis\vis.py", line 16, in
run_vis()
File "C:\Users\Vadim\Documents\PhD\Software\My_pyLDAvis\vis.py", line 13, in run_vis
pyLDAvis.gensim.prepare(lda, corpus, dictionary)
File "C:\WinPython27\python-2.7.10.amd64\lib\site-packages\pyLDAvis\gensim.py", line 68, in prepare
opts = fp.merge(_extract_data(topic_model, corpus, dictionary), kargs)
File "C:\WinPython27\python-2.7.10.amd64\lib\site-packages\pyLDAvis\gensim.py", line 34, in _extract_data
topics_df = pd.DataFrame([dict((y,x) for x, y in tuples) for tuples in topics])[vocab]
File "C:\WinPython27\python-2.7.10.amd64\lib\site-packages\pandas\core\frame.py", line 1791, in getitem
return self._getitem_array(key)
File "C:\WinPython27\python-2.7.10.amd64\lib\site-packages\pandas\core\frame.py", line 1835, in _getitem_array
indexer = self.ix._convert_to_indexer(key, axis=1)
File "C:\WinPython27\python-2.7.10.amd64\lib\site-packages\pandas\core\indexing.py", line 1112, in _convert_to_indexer
raise KeyError('%s not in index' % objarr[mask])
KeyError: "[u'raining' u'unscientific' u'writings' ..., u'cliff' u'reticence' u'ranks'] not in index"

@dpatschke
Copy link
Contributor

The problem I was experiencing was not the same problem you are mentioning. Mine had to do with an object being an "iterable" in Python 3 and a list in Python 2. I honestly have no idea what could be causing your problem, but I would double check that you are, in fact, passing a corpus in as your second parameter to 'prepare'.

From comment above:
id2word = dictionary.load("someDictionary")
corpus = [id2word.doc2bow(doc) for doc in docs]

If you are doing this and still getting the error ... I am sorry ... I will not be of much further help. Good luck :-)!

@Grinshpun
Copy link

dpatschke thank you for your reply. The problem you were experiencing was the only reference to something remotely resembling the problem I had, so tried both - the compactify() and the change you have suggested.
As for the corpus - here is my code directly cut'n'pasted from the pyLDAvis sample at http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb#topic=0&lambda=1&term= sections in [19], [21]. and it looks line this:

dictionary = gensim.corpora.Dictionary.load('%s\\my.dict' % directory)
corpus = gensim.corpora.MmCorpus('%s\\myCorpus.mm' % directory)
lda = gensim.models.ldamodel.LdaModel.load('%s\\myModel_def.model' % directory)

pyLDAvis.gensim.prepare(lda, corpus, dictionary)

@Grinshpun
Copy link

Apologies. I have a bit of a different issue - my corpus is empty to begin with. Granted this is not the most descriptive error message, but the problem appears with the pyLDAvis user, not the package itself.
Again thanks everyone for support.

@ibarria0
Copy link
Contributor

list(dictionary.token2id.keys())

works for me :)

@bmabey bmabey closed this as completed Nov 2, 2015
@bmabey
Copy link
Owner

bmabey commented Nov 2, 2015

This should be fixed with the 1.3.1 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants