Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster analogies #340

Closed
wants to merge 8 commits into from
Closed

Conversation

sebastien-j
Copy link
Contributor

Speed up analogy-making, generally by an order of magnitude or more.

Some minor issues:

  • If you want to use 3CosMul or some other objective, you need to call the function differently.
  • section['correct'] and section['incorrect'] only store the number of good/bad answers instead of the questions.

@gojomo
Copy link
Collaborator

gojomo commented May 21, 2015

Nice! I got a 13x speedup in my try.

But, in Python3, the line num_batches = num_questions/batchsize + bool(num_questions%batchsize) gives a float which breaks the following xrange(num_batches). Changing that line's division operator to // to force integer-division seemed enough to fix, and should work in Python2 as well.

(You should be able to use from __future__ import division if you want to reproduce the issue in Python2.)

@piskvorky
Copy link
Owner

Great!

Are we good to merge? The logic seems pretty complex -- can you add some basic sanity checks, in unit tests?

@sebastien-j
Copy link
Contributor Author

I trained a very small model and evaluated the accuracy with the current version of Gensim.
I made a test verifying that we still obtain the same results.

Are there any licensing issues if I use the word2vec analogy questions?

@sebastien-j
Copy link
Contributor Author

The test fails on python 3.x. Is there a simple way to load a model pickled with python 2.x?

@cscorley
Copy link
Contributor

@sebastien-j seems to be an encoding problem. You can specify the encoding with pickle.load: https://docs.python.org/3/library/pickle.html#pickle.load

@gojomo
Copy link
Collaborator

gojomo commented May 25, 2015

I like the idea of a small loadable model for sanity testing, including across versions, but a pickled version has problems, both for this sort of 2-to-3 encoding issue, and because it's somewhat opaque to easy review yet might trigger arbitrary code on unpickling.

Two ideas: (1) use the C word2vec format; or (2) train up a model on one of the already included corpuses. (As long as the corpus gets at least one analogy question right, and the same before/after changes, it seems an OK sanity check.)

It looks like the test/test_data/lee_background.cor is enough to get 2 questions right:

sentences = LineSentence('tests/test_data/lee_background.cor')
model = Word2Vec(sentences, min_count=1)
model.accuracy('tests/test_data/questions-words.txt')  # gets 2/1258 right for me

@piskvorky
Copy link
Owner

@sebastien-j some analogy speed up has been implemented in #458.

It doesn't including batching though, and it doesn't include your additions around various similarity functions.

Can you revive this PR (rebase on develop, add tests?)

@tmylk
Copy link
Contributor

tmylk commented Jan 24, 2016

@sebastien-j Should we plan this for the February release?

"""

i2w = [pair[0] for pair in sorted(iteritems(self.vocab),
key=lambda item: -item[1].count)]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be clearer if you put reverse=True: sorted(iteritems(self.vocab), key=lambda item: item[1].count, reverse=True) ?

It's more verbose but the intention is clear at least

@tmylk tmylk added the difficulty hard Hard issue: required deep gensim understanding & high python/cython skills label Oct 4, 2016
@menshikh-iv
Copy link
Contributor

Ping @sebastien-j

@menshikh-iv
Copy link
Contributor

Close because it is abandoned

@menshikh-iv menshikh-iv closed this Jun 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty hard Hard issue: required deep gensim understanding & high python/cython skills
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants