online word2vec #435

rutum · 2015-08-17T05:55:02Z

Usage:

model = Word2Vec() # sg and hs are the default parameters
model.build_vocab(sentences)
model.train(sentences)
model.save("base_model")

model.build_vocab(new_sentences, update=True)
model.train(new_sentences)
model.save("updated_model")

Then you can compare the 2 models to see whether the new vocabulary is learning the way it is supposed to.

I tried an experiment with learning a model without "queen", and adding it in the subsequent set of sentences. The updated model learned "queen" as being similar to "king", "duke" etc. So that was a huge success. I would love to hear of any other ideas you might have to test this.

Accompanying blogpost: http://rutumulkar.com/blog/2015/word2vec/

Rm

rutum · 2015-08-21T18:20:39Z

Here is the accompanying blog post about how to use Online Word2Vec:
http://rutumulkar.com/blog/2015/word2vec/

robhawkins · 2015-08-24T03:04:16Z

I see you consolidated update_vocab into build_vocab with another parameter. I think that makes sense to keep code concise but I liked your original description. For people new to gensim like myself, I think it'd be nice to have an update_vocab that wraps build_vocab(update=true). Great work!

rutum · 2015-08-24T04:58:49Z

Any ideas why this branch is failing on Python 3.3 only?

rutum · 2015-08-24T04:59:21Z

@robhawkins Sure! But lets wait for the PR process to see what others recommend

piskvorky · 2015-08-24T05:11:37Z

The fail is unrelated; @gojomo is working on fixing that.

piskvorky · 2015-08-24T05:13:23Z

Btw why six commit for resolving conflicts? That's scary. What's going on there?

Can you at least squash all these commits into one? We don't want such verbose git history.

piskvorky · 2015-08-24T05:15:20Z

gensim/models/word2vec.py

@@ -1444,7 +1510,6 @@ def __init__(self, init_fn, job_fn):
    def put(self, job):
        self.job_fn(job, self.inits)

-


There should be two empty lines between classes; why did you delete it?

resolving merge conflicts resolving merge conflicts resolving merge conflicts resolving merge conflicts resolving merge conflicts resolving merge conflicts

piskvorky · 2015-08-24T05:38:34Z

The failing Travis test is different this time: it looks like the online learning caused a segfault on Python3.4. That is a serious problem.

rutum · 2015-08-24T05:39:42Z

Yeah! Looks like Python 3.4 is doing something differently. Any ideas?

rutum · 2015-08-24T18:03:23Z

All is well on Travis in the online Word2Vec land! Review away @piskvorky, @gojomo and others

piskvorky · 2015-08-24T23:52:13Z

I re-ran Travis and it still segfaults during the online training test. It looks like there is some subtle bug that causes a non-deterministic segfault.

rutum · 2015-08-26T19:07:02Z

I got some time to debug this, and it is a non-deterministic error. The syn[0] array ends up with nan values when training after new vocabulary is added.

word2vec.py:174: RuntimeWarning: overflow encountered in exp fa = 1.0 / (1.0 + exp(-dot(l1, l2a.T))) # propagate hidden -> output
So clearly, there was something going on with the way the numpy arrays are referenced. It looks like I wrote some code to mane a shallow copy of the NP arrays, instead of a deep copy. I updated and tested it, and have not hit upon the segfault yet.

gojomo · 2015-08-26T20:36:24Z

In the cython work I did, intermittent nan-related crashes of this type were sometimes due to earlier mis-addressed writes, which corrupted other arrays.

If deepcopy has really prevented the errors, I'm concerned it may be covering them up rather than fixing the real error. (Perhaps it's coercing nan to 0?) The pre-deepcopy property-reassignment should have worked fine, and avoids both a large extra-allocation and a large memory-copy – so I'd revert to that so to be sure to trigger the problem until it's definitively fixed.

I would focus on earlier steps, especially the initial expand/update-weights. (I see one potential problem L967 – will add line comment there.)

gojomo · 2015-08-26T20:44:04Z

gensim/models/word2vec.py

+        # randomize the remaining words
+        for i in xrange(len(self.vocab), len(newsyn0)):
+            # construct deterministic seed from word AND seed argument
+            self.syn0[i] = self.seeded_vector(self.index2word[i] + str(self.seed))


Seems this should start at the old vocab length, and the newly-seeded vectors should be landing in the newsyn0? (Doubt this could cause segfaults/nan-issues as out-of-bounds access should be caught in this pure Python code, but true cause of those errors might be something similar elsewhere...)

rutum · 2015-08-26T21:39:01Z

@gojomo The error is non-deterministic, which is why I think there was some malloc issue which was hard to reproduce. If I do not do a deep copy, then an index in syn0 will still reference to an index in newsyn0, but I think python3 handles numpy references and numpy copies differently than 2.6 and 2.7. Anyway, this is what I think the issue was. Let me know if you can reproduce this.

gojomo · 2015-08-26T22:29:04Z

If the issue is corruption from earlier bad writes, whether that corruption is fatal or harmless can be affected by the (effectively) arbitrary locations chosen by earlier mallocs, yes. So that could explain the indeterminism. But if that's the case, I don't yet see any way a deepcopy can fix the real problem.

All indexed-accesses to syn0, eg syn0[i], will always reach into the current value of syn0, so there's no risk of misdirected accesses from doing self.syn0 = newsyn0. I'm pretty confident that's the right assignment to make, for speed and memory efficiency.

If the deepcopy is seemingly resolving the problem (or just making it much less frequent), it may be by simply hiding it. The big reallocation might guarantee the array is further from other arrays, so out-of-bounds writes are damaging other memory.

The tactic I mentioned the other day, to try the cython code with bounds-checkling back on (a directive atop the file), might help catch a real corrupting write when it happens, rather than when later calculations 'step' on it again. If you're not set up to modify/recompile the cython parts, I'll have a chance to try that in the next day or so...

tmylk · 2016-01-10T08:45:07Z

@rutum Are you still working on this? Or should this be closed?
Thanks

skillachie · 2016-02-03T22:21:14Z

This feature looks good. Do you guys plan to merge it into the main dev/main branch soon ?

danintheory · 2016-02-07T22:18:15Z

Hi,

I am very interested in this feature. What would be needed to get this or a new pull request accepted?

Thanks!
Dan

si74 · 2016-02-11T19:34:41Z

Hi I also wanted to know if you were planning on merging this soon! :)

piskvorky · 2016-02-12T02:27:57Z

This PR is not mergeable yet:

segfaults
contains changes in unrelated files (symlinks)
merge conflicts against the current develop branch
needs a deeper review that it works -- see code comments by @gojomo in this PR

@danintheory the most important one is fixing the segfault. I think starting from a current develop checkout, and re-applying relevant @rutum 's changes, may be easier than resolving the conflicts, due to the commit history.

zachmayer · 2016-02-22T20:43:05Z

FYI: I rebased this PR and fixed merge conflicts here: #615.

I suggest we close this PR, and move the discussion over there.

piskvorky · 2016-02-23T03:01:13Z

Continued in #615 .

adarshaj · 2017-01-04T18:24:59Z

For posterity, this has been merged to develop on #900

Rutu Mulkar-Mehta and others added 9 commits June 12, 2015 13:23

freeze word vectors

6f86b26

recovering lost work

03b428c

recovering lost work

aeb64b5

updating bug in sentences iterator

eed64fd

vector freeze after each training iteration

b38dac0

cbow update

f949db3

clean up code

765d16c

Merge pull request #1 from rutum/rm

34afd95

Rm

merging from master

c2c624b

piskvorky changed the title ~~Rm online~~ online word2vec Aug 19, 2015

piskvorky mentioned this pull request Aug 19, 2015

Online Word2Vec #365

Closed

gojomo mentioned this pull request Aug 19, 2015

[doc2vec] train new doc tags with old words vocab #430

Closed

gojomo mentioned this pull request Aug 21, 2015

big doc-vector refactor/enhancements #356

Merged

Merge remote-tracking branch 'upstream/develop' into develop

87340d9

rutum force-pushed the rm_online branch from baeac80 to a64fc7e Compare August 24, 2015 04:41

piskvorky reviewed Aug 24, 2015
View reviewed changes

rutum force-pushed the rm_online branch from a64fc7e to 68aa0d4 Compare August 24, 2015 05:27

piskvorky and others added 2 commits August 23, 2015 22:29

update summarization tutorial image

e6fcbb3

resolving merge conflicts resolving merge conflicts resolving merge conflicts resolving merge conflicts resolving merge conflicts resolving merge conflicts

removing non required test

8df818b

rutum force-pushed the rm_online branch from 68aa0d4 to 8df818b Compare August 24, 2015 05:31

update_weights function change

e886d90

rutum added 2 commits August 26, 2015 11:47

updating the numpy copy code

6eced42

updating the numpy copy code

c93b63e

gojomo reviewed Aug 26, 2015
View reviewed changes

zachmayer mentioned this pull request Feb 22, 2016

Rebase of "online word2vec" by rutum #615

Closed

piskvorky closed this Feb 23, 2016

zachmayer mentioned this pull request May 18, 2016

Online word2vec #700

Closed

piskvorky mentioned this pull request Jul 20, 2017

Fully supporting incremental updation of vocabulary in Word2Vec model #1493

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

online word2vec #435

online word2vec #435

rutum commented Aug 17, 2015

rutum commented Aug 21, 2015

robhawkins commented Aug 24, 2015

rutum commented Aug 24, 2015

rutum commented Aug 24, 2015

piskvorky commented Aug 24, 2015

piskvorky commented Aug 24, 2015

piskvorky Aug 24, 2015

piskvorky commented Aug 24, 2015

rutum commented Aug 24, 2015

rutum commented Aug 24, 2015

piskvorky commented Aug 24, 2015

rutum commented Aug 26, 2015

gojomo commented Aug 26, 2015

gojomo Aug 26, 2015

rutum commented Aug 26, 2015

gojomo commented Aug 26, 2015

tmylk commented Jan 10, 2016

skillachie commented Feb 3, 2016

danintheory commented Feb 7, 2016

si74 commented Feb 11, 2016

piskvorky commented Feb 12, 2016

zachmayer commented Feb 22, 2016

piskvorky commented Feb 23, 2016

adarshaj commented Jan 4, 2017

		@@ -1444,7 +1510,6 @@ def __init__(self, init_fn, job_fn):
		def put(self, job):
		self.job_fn(job, self.inits)

online word2vec #435

online word2vec #435

Conversation

rutum commented Aug 17, 2015

rutum commented Aug 21, 2015

robhawkins commented Aug 24, 2015

rutum commented Aug 24, 2015

rutum commented Aug 24, 2015

piskvorky commented Aug 24, 2015

piskvorky commented Aug 24, 2015

piskvorky Aug 24, 2015

Choose a reason for hiding this comment

piskvorky commented Aug 24, 2015

rutum commented Aug 24, 2015

rutum commented Aug 24, 2015

piskvorky commented Aug 24, 2015

rutum commented Aug 26, 2015

gojomo commented Aug 26, 2015

gojomo Aug 26, 2015

Choose a reason for hiding this comment

rutum commented Aug 26, 2015

gojomo commented Aug 26, 2015

tmylk commented Jan 10, 2016

skillachie commented Feb 3, 2016

danintheory commented Feb 7, 2016

si74 commented Feb 11, 2016

piskvorky commented Feb 12, 2016

zachmayer commented Feb 22, 2016

piskvorky commented Feb 23, 2016

adarshaj commented Jan 4, 2017