Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

online word2vec #435

Closed
wants to merge 15 commits into from
Closed

online word2vec #435

wants to merge 15 commits into from

Conversation

rutum
Copy link

@rutum rutum commented Aug 17, 2015

Usage:

model = Word2Vec() # sg and hs are the default parameters
model.build_vocab(sentences)
model.train(sentences)
model.save("base_model")

model.build_vocab(new_sentences, update=True)
model.train(new_sentences)
model.save("updated_model")

Then you can compare the 2 models to see whether the new vocabulary is learning the way it is supposed to.

I tried an experiment with learning a model without "queen", and adding it in the subsequent set of sentences. The updated model learned "queen" as being similar to "king", "duke" etc. So that was a huge success. I would love to hear of any other ideas you might have to test this.

Accompanying blogpost: http://rutumulkar.com/blog/2015/word2vec/

@piskvorky piskvorky changed the title Rm online online word2vec Aug 19, 2015
@piskvorky piskvorky mentioned this pull request Aug 19, 2015
@rutum
Copy link
Author

rutum commented Aug 21, 2015

Here is the accompanying blog post about how to use Online Word2Vec:
http://rutumulkar.com/blog/2015/word2vec/

@robhawkins
Copy link

I see you consolidated update_vocab into build_vocab with another parameter. I think that makes sense to keep code concise but I liked your original description. For people new to gensim like myself, I think it'd be nice to have an update_vocab that wraps build_vocab(update=true). Great work!

@rutum
Copy link
Author

rutum commented Aug 24, 2015

Any ideas why this branch is failing on Python 3.3 only?

@rutum
Copy link
Author

rutum commented Aug 24, 2015

@robhawkins Sure! But lets wait for the PR process to see what others recommend

@piskvorky
Copy link
Owner

The fail is unrelated; @gojomo is working on fixing that.

@piskvorky
Copy link
Owner

Btw why six commit for resolving conflicts? That's scary. What's going on there?

Can you at least squash all these commits into one? We don't want such verbose git history.

@@ -1444,7 +1510,6 @@ def __init__(self, init_fn, job_fn):
def put(self, job):
self.job_fn(job, self.inits)


Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be two empty lines between classes; why did you delete it?

piskvorky and others added 2 commits August 23, 2015 22:29
resolving merge conflicts

resolving merge conflicts

resolving merge conflicts

resolving merge conflicts

resolving merge conflicts

resolving merge conflicts
@piskvorky
Copy link
Owner

The failing Travis test is different this time: it looks like the online learning caused a segfault on Python3.4. That is a serious problem.

@rutum
Copy link
Author

rutum commented Aug 24, 2015

Yeah! Looks like Python 3.4 is doing something differently. Any ideas?

@rutum
Copy link
Author

rutum commented Aug 24, 2015

All is well on Travis in the online Word2Vec land! Review away @piskvorky, @gojomo and others

@piskvorky
Copy link
Owner

I re-ran Travis and it still segfaults during the online training test. It looks like there is some subtle bug that causes a non-deterministic segfault.

@rutum
Copy link
Author

rutum commented Aug 26, 2015

I got some time to debug this, and it is a non-deterministic error. The syn[0] array ends up with nan values when training after new vocabulary is added.

word2vec.py:174: RuntimeWarning: overflow encountered in exp fa = 1.0 / (1.0 + exp(-dot(l1, l2a.T))) # propagate hidden -> output
So clearly, there was something going on with the way the numpy arrays are referenced. It looks like I wrote some code to mane a shallow copy of the NP arrays, instead of a deep copy. I updated and tested it, and have not hit upon the segfault yet.

@gojomo
Copy link
Collaborator

gojomo commented Aug 26, 2015

In the cython work I did, intermittent nan-related crashes of this type were sometimes due to earlier mis-addressed writes, which corrupted other arrays.

If deepcopy has really prevented the errors, I'm concerned it may be covering them up rather than fixing the real error. (Perhaps it's coercing nan to 0?) The pre-deepcopy property-reassignment should have worked fine, and avoids both a large extra-allocation and a large memory-copy – so I'd revert to that so to be sure to trigger the problem until it's definitively fixed.

I would focus on earlier steps, especially the initial expand/update-weights. (I see one potential problem L967 – will add line comment there.)

# randomize the remaining words
for i in xrange(len(self.vocab), len(newsyn0)):
# construct deterministic seed from word AND seed argument
self.syn0[i] = self.seeded_vector(self.index2word[i] + str(self.seed))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this should start at the old vocab length, and the newly-seeded vectors should be landing in the newsyn0? (Doubt this could cause segfaults/nan-issues as out-of-bounds access should be caught in this pure Python code, but true cause of those errors might be something similar elsewhere...)

@rutum
Copy link
Author

rutum commented Aug 26, 2015

@gojomo The error is non-deterministic, which is why I think there was some malloc issue which was hard to reproduce. If I do not do a deep copy, then an index in syn0 will still reference to an index in newsyn0, but I think python3 handles numpy references and numpy copies differently than 2.6 and 2.7. Anyway, this is what I think the issue was. Let me know if you can reproduce this.

@gojomo
Copy link
Collaborator

gojomo commented Aug 26, 2015

If the issue is corruption from earlier bad writes, whether that corruption is fatal or harmless can be affected by the (effectively) arbitrary locations chosen by earlier mallocs, yes. So that could explain the indeterminism. But if that's the case, I don't yet see any way a deepcopy can fix the real problem.

All indexed-accesses to syn0, eg syn0[i], will always reach into the current value of syn0, so there's no risk of misdirected accesses from doing self.syn0 = newsyn0. I'm pretty confident that's the right assignment to make, for speed and memory efficiency.

If the deepcopy is seemingly resolving the problem (or just making it much less frequent), it may be by simply hiding it. The big reallocation might guarantee the array is further from other arrays, so out-of-bounds writes are damaging other memory.

The tactic I mentioned the other day, to try the cython code with bounds-checkling back on (a directive atop the file), might help catch a real corrupting write when it happens, rather than when later calculations 'step' on it again. If you're not set up to modify/recompile the cython parts, I'll have a chance to try that in the next day or so...

@tmylk
Copy link
Contributor

tmylk commented Jan 10, 2016

@rutum Are you still working on this? Or should this be closed?
Thanks

@skillachie
Copy link

This feature looks good. Do you guys plan to merge it into the main dev/main branch soon ?

@danintheory
Copy link

Hi,

I am very interested in this feature. What would be needed to get this or a new pull request accepted?

Thanks!
Dan

@si74
Copy link

si74 commented Feb 11, 2016

Hi I also wanted to know if you were planning on merging this soon! :)

@piskvorky
Copy link
Owner

This PR is not mergeable yet:

  • segfaults
  • contains changes in unrelated files (symlinks)
  • merge conflicts against the current develop branch
  • needs a deeper review that it works -- see code comments by @gojomo in this PR

@danintheory the most important one is fixing the segfault. I think starting from a current develop checkout, and re-applying relevant @rutum 's changes, may be easier than resolving the conflicts, due to the commit history.

@zachmayer
Copy link

FYI: I rebased this PR and fixed merge conflicts here: #615.

I suggest we close this PR, and move the discussion over there.

@piskvorky
Copy link
Owner

Continued in #615 .

@piskvorky piskvorky closed this Feb 23, 2016
@zachmayer zachmayer mentioned this pull request May 18, 2016
@adarshaj
Copy link

adarshaj commented Jan 4, 2017

For posterity, this has been merged to develop on #900

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants