Faster analogies #340

sebastien-j · 2015-05-14T06:09:26Z

Speed up analogy-making, generally by an order of magnitude or more.

Some minor issues:

If you want to use 3CosMul or some other objective, you need to call the function differently.
section['correct'] and section['incorrect'] only store the number of good/bad answers instead of the questions.

Computing the accuracy of the model on analogies should be faster by 1+ order of magnitude.

Also bigger default batch size

gojomo · 2015-05-21T04:41:29Z

Nice! I got a 13x speedup in my try.

But, in Python3, the line num_batches = num_questions/batchsize + bool(num_questions%batchsize) gives a float which breaks the following xrange(num_batches). Changing that line's division operator to // to force integer-division seemed enough to fix, and should work in Python2 as well.

(You should be able to use from __future__ import division if you want to reproduce the issue in Python2.)

piskvorky · 2015-05-21T18:45:55Z

Great!

Are we good to merge? The logic seems pretty complex -- can you add some basic sanity checks, in unit tests?

sebastien-j · 2015-05-25T01:13:51Z

I trained a very small model and evaluated the accuracy with the current version of Gensim.
I made a test verifying that we still obtain the same results.

Are there any licensing issues if I use the word2vec analogy questions?

sebastien-j · 2015-05-25T01:28:46Z

The test fails on python 3.x. Is there a simple way to load a model pickled with python 2.x?

cscorley · 2015-05-25T01:49:41Z

@sebastien-j seems to be an encoding problem. You can specify the encoding with pickle.load: https://docs.python.org/3/library/pickle.html#pickle.load

gojomo · 2015-05-25T02:12:21Z

I like the idea of a small loadable model for sanity testing, including across versions, but a pickled version has problems, both for this sort of 2-to-3 encoding issue, and because it's somewhat opaque to easy review yet might trigger arbitrary code on unpickling.

Two ideas: (1) use the C word2vec format; or (2) train up a model on one of the already included corpuses. (As long as the corpus gets at least one analogy question right, and the same before/after changes, it seems an OK sanity check.)

It looks like the test/test_data/lee_background.cor is enough to get 2 questions right:

sentences = LineSentence('tests/test_data/lee_background.cor')
model = Word2Vec(sentences, min_count=1)
model.accuracy('tests/test_data/questions-words.txt')  # gets 2/1258 right for me

piskvorky · 2015-09-24T01:16:47Z

@sebastien-j some analogy speed up has been implemented in #458.

It doesn't including batching though, and it doesn't include your additions around various similarity functions.

Can you revive this PR (rebase on develop, add tests?)

tmylk · 2016-01-24T09:45:39Z

@sebastien-j Should we plan this for the February release?

bloody76 · 2016-05-14T17:56:52Z

gensim/models/word2vec.py

+        """
+
+        i2w = [pair[0] for pair in sorted(iteritems(self.vocab),
+                  key=lambda item: -item[1].count)]


Wouldn't it be clearer if you put reverse=True: sorted(iteritems(self.vocab), key=lambda item: item[1].count, reverse=True) ?

It's more verbose but the intention is clear at least

menshikh-iv · 2017-05-11T10:13:19Z

Ping @sebastien-j

menshikh-iv · 2017-06-08T07:53:42Z

Close because it is abandoned

sebastien-j added 6 commits May 13, 2015 23:56

Faster analogies

9bf15a8

Computing the accuracy of the model on analogies should be faster by 1+ order of magnitude.

Import log, tanh. And verify if syn1, syn1neg exist

f0cc8d8

Fix oversight

28254a6

Keep memory usage low with init_sims(replace=True)

81d12de

Remove possible division by zero error

def7e68

Use 'mul' instead of 'log' (a bit faster)

6e398c6

Also bigger default batch size

Python 3 compatibility fix

3199bb6

Add test for word2vec analogies

1722114

sebastien-j mentioned this pull request May 28, 2015

Sped up most_similar and accuracy #350

Closed

gojomo mentioned this pull request Aug 20, 2015

[doc2vec] train new doc tags with old words vocab #430

Closed

bloody76 reviewed May 14, 2016
View reviewed changes

tmylk added the difficulty hard Hard issue: required deep gensim understanding & high python/cython skills label Oct 4, 2016

menshikh-iv closed this Jun 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster analogies #340

Faster analogies #340

sebastien-j commented May 14, 2015

gojomo commented May 21, 2015

piskvorky commented May 21, 2015

sebastien-j commented May 25, 2015

sebastien-j commented May 25, 2015

cscorley commented May 25, 2015

gojomo commented May 25, 2015

piskvorky commented Sep 24, 2015

tmylk commented Jan 24, 2016

bloody76 May 14, 2016

menshikh-iv commented May 11, 2017

menshikh-iv commented Jun 8, 2017

Faster analogies #340

Faster analogies #340

Conversation

sebastien-j commented May 14, 2015

gojomo commented May 21, 2015

piskvorky commented May 21, 2015

sebastien-j commented May 25, 2015

sebastien-j commented May 25, 2015

cscorley commented May 25, 2015

gojomo commented May 25, 2015

piskvorky commented Sep 24, 2015

tmylk commented Jan 24, 2016

bloody76 May 14, 2016

Choose a reason for hiding this comment

menshikh-iv commented May 11, 2017

menshikh-iv commented Jun 8, 2017