-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster analogies #340
Faster analogies #340
Conversation
Computing the accuracy of the model on analogies should be faster by 1+ order of magnitude.
Also bigger default batch size
Nice! I got a 13x speedup in my try. But, in Python3, the line (You should be able to use |
Great! Are we good to merge? The logic seems pretty complex -- can you add some basic sanity checks, in unit tests? |
I trained a very small model and evaluated the accuracy with the current version of Gensim. Are there any licensing issues if I use the word2vec analogy questions? |
The test fails on python 3.x. Is there a simple way to load a model pickled with python 2.x? |
@sebastien-j seems to be an encoding problem. You can specify the encoding with pickle.load: https://docs.python.org/3/library/pickle.html#pickle.load |
I like the idea of a small loadable model for sanity testing, including across versions, but a pickled version has problems, both for this sort of 2-to-3 encoding issue, and because it's somewhat opaque to easy review yet might trigger arbitrary code on unpickling. Two ideas: (1) use the C word2vec format; or (2) train up a model on one of the already included corpuses. (As long as the corpus gets at least one analogy question right, and the same before/after changes, it seems an OK sanity check.) It looks like the test/test_data/lee_background.cor is enough to get 2 questions right:
|
@sebastien-j some analogy speed up has been implemented in #458. It doesn't including batching though, and it doesn't include your additions around various similarity functions. Can you revive this PR (rebase on develop, add tests?) |
@sebastien-j Should we plan this for the February release? |
""" | ||
|
||
i2w = [pair[0] for pair in sorted(iteritems(self.vocab), | ||
key=lambda item: -item[1].count)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be clearer if you put reverse=True
: sorted(iteritems(self.vocab), key=lambda item: item[1].count, reverse=True)
?
It's more verbose but the intention is clear at least
Ping @sebastien-j |
Close because it is abandoned |
Speed up analogy-making, generally by an order of magnitude or more.
Some minor issues:
section['correct']
andsection['incorrect']
only store the number of good/bad answers instead of the questions.