-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate Deprecated exception when using Word2Vec.load_word2vec_format #1165
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restored branch to be able to leave line-specific comments.
NOTE: document vectors are not loaded/saved with .load/save_word2vec_format(). Use .save()/.load() instead. | ||
If you're finished training a model (=no more updates, only querying), you can do | ||
|
||
>>> model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this will also break inference, so comment should mention that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that what the keep_inference=True
is for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inference is preserved. It is tested in https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_doc2vec.py#L319
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, though this is still kind of odd. As called in this prominent example, this method hardly gets rid of anything – just the relatively-tiny doctag_syn0_lockf
. Someone who just needs that tiny benefit could be coached to execute del model.docvecs.doctag_syn0_lockf
. (I fear here, and to some extend on Word2Vec too, this method is attractive to novices but likely to cause headaches for them and then support/maintenance issues down the road.)
|
||
.. [1] Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents. http://arxiv.org/pdf/1405.4053v2.pdf | ||
.. [2] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013. | ||
.. [3] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. | ||
In Proceedings of NIPS, 2013. | ||
.. [blog] Optimizing word2vec in gensim, http://radimrehurek.com/2013/09/word2vec-in-python-part-two-optimizing/ | ||
.. [tutorial] Doc2vec in gensim tutorial, http://radimrehurek.com/2013/09/word2vec-in-python-part-two-optimizing/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong link.
|
||
The word vectors can also be instantiated from an existing file on disk in the word2vec C format as a KeyedVectors instance:: | ||
|
||
NOTE: It is impossible to continue training the vectors loaded from the C format because the binary tree is missing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not just the binary tree (which is only used in hs
mode), but the hidden-weights and vocabulary-frequency information are missing.
If you're finished training a model (=no more updates, only querying), you can do | ||
|
||
>>> model.init_sims(replace=True) | ||
>>> model.delete_temporary_training_data(replace_word_vectors_with_normalized=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With KeyedVectors now the recommended form for read-only access, perhaps the proper recommendation for "if you're sure you're done training" is to discard the Word2Vec model instance entirely, and just retain the KeyedVectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course. This is some weird mix, "the worst of both world", complicating the API and confusing people.
|
||
to trim unneeded model memory = use (much) less RAM. | ||
|
||
Note that there is a :mod:`gensim.models.phrases` module which lets you automatically | ||
detect phrases longer than one word. Using phrases, you can learn a word2vec model | ||
where "words" are actually multiword expressions, such as `new_york_times` or `financial_crisis`: | ||
|
||
>>> bigram_transformer = gensim.models.Phrases(sentences) | ||
>>> bigram_transformer = gensim.models.Phraser(gensim.models.Phrases(sentences)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I might not recommend all users prefer Phraser
without understanding the extra steps it requires, because of the extra time of the reduction-pass, and the fact it throws out some info (in Phrases
) that was expensive to collect and allow experimentation with different count/threshold values.
@@ -1272,6 +1299,17 @@ def _load_specials(self, *args, **kwargs): | |||
self.wv = wv | |||
super(Word2Vec, self)._load_specials(*args, **kwargs) | |||
|
|||
@classmethod | |||
def load_word2vec_format(cls, fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now makes the call_on_class_only
reference in __init__()
superfluous/wrong.
Users have been thrown off by the Word2Vec.load_word2vec_format method disappearing without an obvious alternative. An Exception is now thrown directing to KeyedVectors.
Also docstrings and ipynbs updated with KeyedVectors changes.