-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
online word2vec #435
online word2vec #435
Conversation
Here is the accompanying blog post about how to use Online Word2Vec: |
I see you consolidated update_vocab into build_vocab with another parameter. I think that makes sense to keep code concise but I liked your original description. For people new to gensim like myself, I think it'd be nice to have an update_vocab that wraps build_vocab(update=true). Great work! |
Any ideas why this branch is failing on Python 3.3 only? |
@robhawkins Sure! But lets wait for the PR process to see what others recommend |
The fail is unrelated; @gojomo is working on fixing that. |
Btw why six commit for resolving conflicts? That's scary. What's going on there? Can you at least squash all these commits into one? We don't want such verbose git history. |
@@ -1444,7 +1510,6 @@ def __init__(self, init_fn, job_fn): | |||
def put(self, job): | |||
self.job_fn(job, self.inits) | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be two empty lines between classes; why did you delete it?
resolving merge conflicts resolving merge conflicts resolving merge conflicts resolving merge conflicts resolving merge conflicts resolving merge conflicts
The failing Travis test is different this time: it looks like the online learning caused a segfault on Python3.4. That is a serious problem. |
Yeah! Looks like Python 3.4 is doing something differently. Any ideas? |
All is well on Travis in the online Word2Vec land! Review away @piskvorky, @gojomo and others |
I re-ran Travis and it still segfaults during the online training test. It looks like there is some subtle bug that causes a non-deterministic segfault. |
I got some time to debug this, and it is a non-deterministic error. The
|
In the cython work I did, intermittent If I would focus on earlier steps, especially the initial expand/update-weights. (I see one potential problem L967 – will add line comment there.) |
# randomize the remaining words | ||
for i in xrange(len(self.vocab), len(newsyn0)): | ||
# construct deterministic seed from word AND seed argument | ||
self.syn0[i] = self.seeded_vector(self.index2word[i] + str(self.seed)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this should start at the old vocab length, and the newly-seeded vectors should be landing in the newsyn0
? (Doubt this could cause segfaults/nan
-issues as out-of-bounds access should be caught in this pure Python code, but true cause of those errors might be something similar elsewhere...)
@gojomo The error is non-deterministic, which is why I think there was some |
If the issue is corruption from earlier bad writes, whether that corruption is fatal or harmless can be affected by the (effectively) arbitrary locations chosen by earlier mallocs, yes. So that could explain the indeterminism. But if that's the case, I don't yet see any way a deepcopy can fix the real problem. All indexed-accesses to If the deepcopy is seemingly resolving the problem (or just making it much less frequent), it may be by simply hiding it. The big reallocation might guarantee the array is further from other arrays, so out-of-bounds writes are damaging other memory. The tactic I mentioned the other day, to try the cython code with bounds-checkling back on (a directive atop the file), might help catch a real corrupting write when it happens, rather than when later calculations 'step' on it again. If you're not set up to modify/recompile the cython parts, I'll have a chance to try that in the next day or so... |
@rutum Are you still working on this? Or should this be closed? |
This feature looks good. Do you guys plan to merge it into the main dev/main branch soon ? |
Hi, I am very interested in this feature. What would be needed to get this or a new pull request accepted? Thanks! |
Hi I also wanted to know if you were planning on merging this soon! :) |
This PR is not mergeable yet:
@danintheory the most important one is fixing the segfault. I think starting from a current |
FYI: I rebased this PR and fixed merge conflicts here: #615. I suggest we close this PR, and move the discussion over there. |
Continued in #615 . |
For posterity, this has been merged to develop on #900 |
Usage:
model = Word2Vec() # sg and hs are the default parameters
model.build_vocab(sentences)
model.train(sentences)
model.save("base_model")
model.build_vocab(new_sentences, update=True)
model.train(new_sentences)
model.save("updated_model")
Then you can compare the 2 models to see whether the new vocabulary is learning the way it is supposed to.
I tried an experiment with learning a model without "queen", and adding it in the subsequent set of sentences. The updated model learned "queen" as being similar to "king", "duke" etc. So that was a huge success. I would love to hear of any other ideas you might have to test this.
Accompanying blogpost: http://rutumulkar.com/blog/2015/word2vec/