Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accesing vector model vocabulary broken in Gensim 3.3 when loading from word2vec format #1882

Open
akutuzov opened this issue Feb 7, 2018 · 11 comments · Fixed by #1884
Open
Assignees
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix

Comments

@akutuzov
Copy link
Contributor

akutuzov commented Feb 7, 2018

After upgrading to 3.3.0, it is now impossible to get the model's vocabulary with model.wv.vocab method, if the model is loaded from a text or binary word2vec file. However, it works for models saved in the Gensim native format.
I suppose it is related to re-designing vector models implementations in #1777. Anyway, it is not good to break compatibility in this way, without even notifying users.

Steps/ to Reproduce

import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model = gensim.models.KeyedVectors.load_word2vec_format('ANY_MODEL.bin.gz', binary=True)
WORD in model.wv.vocab

Expected Results

True or False, as it is in Gensim 3.2

Actual Results

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Word2VecKeyedVectors' object has no attribute 'wv'

Versions

Linux-4.13.0-32-generic-x86_64-with-LinuxMint-18.2-sonya
Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609]
NumPy 1.14.0
SciPy 1.0.0
gensim 3.3.0
FAST_VERSION 1
@menshikh-iv
Copy link
Contributor

menshikh-iv commented Feb 8, 2018

@akutuzov thanks for the report! Sorry for this, we did not plan anything to break (but this happens :( ).

CC: @manneshiva

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix labels Feb 8, 2018
@manneshiva
Copy link
Contributor

Hi @akutuzov,
Thanks for reporting this issue. This shall be fixed very soon (a couple of hours from now). I tested your code in gensim 3.2.0 and saw that model is model.wv returns True. So, for the time being, you could use model.vocab instead of model.wv.vocab (or any other property).
I seem to have missed the self-referential property for KeyedVectors -- https://github.com/RaRe-Technologies/gensim/blob/3.2.0/gensim/models/keyedvectors.py#L422. Not sure about the purpose of this property. Will add it back for backward compatibility.

menshikh-iv pushed a commit that referenced this issue Feb 8, 2018
…1882 (#1884)

* adds test for `wv` property

* adds `wv` property to KeyedVectors class
@akutuzov
Copy link
Contributor Author

akutuzov commented Feb 8, 2018

@manneshiva thanks!
So, model.wv.vocab is deprecated now, and we should use model.vocab instead, right?

@menshikh-iv
Copy link
Contributor

@akutuzov exactly

@akutuzov
Copy link
Contributor Author

akutuzov commented Jun 18, 2018

If model.wv.vocab is deprecated and we should always use model.vocab, why then model.vocab does not work for word2vec models saved in Gensim native format?

model = gensim.models.Word2Vec.load(MODELFILE)
print(len(model.vocab))
AttributeError: 'Word2Vec' object has no attribute 'vocab'
print(len(model.wv.vocab))
237255

I use Gensim 3.4.0 both for training and for loading the models.

The funny thing is that if the same model is saved in word2vec format and loaded via gensim.models.KeyedVectors.load_word2vec_format, then both model.vocab and model.wv.vocab work.
So, is there any recommended way to access the model's vocabulary independent of how the model was loaded?

@rachhitgarg
Copy link

what if i want to update the model loaded with syntax (gensim.models.KeyedVectors.load_word2vec_format) by new sentences
I tried : showing error

model.build_vocab(more_sentences, update=True)
AttributeError: 'Word2VecKeyedVectors' object has no attribute 'build_vocab'

@piskvorky
Copy link
Owner

piskvorky commented Aug 29, 2018

@akutuzov Sounds like a (nasty) bug to me. Can you replicate this in 3.5.0?

@menshikh-iv if the bug is still there, should we re-open this issue?

@rachhitgarg see the documentation under https://radimrehurek.com/gensim/models/word2vec.html#usage-examples

@akutuzov
Copy link
Contributor Author

@piskvorky Yes, nothing has changed in 3.5.0 in this respect. The bug is still reproduced: for some weird reason model.vocab does not work for word2vec models saved in Gensim native format.

@piskvorky
Copy link
Owner

piskvorky commented Aug 29, 2018

Thanks @akutuzov . @menshikh-iv I'm re-opening this ticket, this sounds serious to critical. Do we have a unit test for testing load-after-save?

@piskvorky piskvorky reopened this Aug 29, 2018
@menshikh-iv
Copy link
Contributor

@rachhitgarg please stop post this to unrelated issues, I asnwered you #1994 (comment)

@menshikh-iv
Copy link
Contributor

@piskvorky yes, many different, just Ctrl+F Word2vec.load in https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_word2vec.py (but case mentioned by @akutuzov not covered)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants