-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Online word2vec #700
Online word2vec #700
Conversation
recovering lost work updating bug in sentences iterator vector freeze after each training iteration cbow update clean up code update summarization tutorial image resolving merge conflicts resolving merge conflicts resolving merge conflicts resolving merge conflicts resolving merge conflicts resolving merge conflicts update_weights function change updating the numpy copy code updating the numpy copy code
Looks clean, but will need more extensive testing & sanity checking, because it's such a tricky feature. CC @gojomo . @zachmayer how could we test this more thoroughly? What results (accuracy, performance) can be expected if we run word2vec "online" on a larger corpus, such as text8/text9, compared to the existing "single batch" version? |
|
||
def scan_vocab(self, sentences, progress_per=10000, trim_rule=None): | ||
def scan_vocab(self, sentences, update, progress_per=10000, trim_rule=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to put a default in here, so that the change is backward compatible (some users call these functions manually in their app, we don't want to break that just because of optional upgrade).
@rutum want to chime in on:
? |
@piskvorky I'm primarily interested in using the "online" mode for "two corpus" training. E.g. train a word2vec model on a very large dataset, and then fine-tune the embeddings on a smaller dataset that is more specific to the task at hand. E.g. wikipedia for the initial embedding and a medical dictionary for the fine tuning. Let me think about some specific use cases and get back to you. |
@zachmayer : Adding reference post on testing here: http://rutumulkar.com/blog/2015/word2vec/ |
Few thoughts on this, I think comparing an online model and on regular model is still missing in the test cases.
A quick pseudo-code
|
Hi @rutum
How about randomly separate the |
@zihaolucky if you test on larger corpora, please include test cases where new words are added. You can create 2 corpora like this from wikipedia - leaving the same test case - one corpus with queen and the other without queen. Our goal is to see whether the semantics of the new words are correct or not. |
Hi @rutum I was wondering if the current version of gensim supports the word2vec online updating with the skip-gram model and the negative sampling. In the website it seems like it is supported, however I read in some thread that it is not yet. Thanks |
Ping @gojomo |
The current version (0.12.4) allows you to continue supplying |
Thanks for your fast reply @gojomo Then, if the current 0.12.4 version do not supports the vocabulary updating, can I download the version of this PR to do that or it is no available yet? |
@neosyon – Everything in github pull-requests/branches is "available", so you certainly "can" download it & try it. And that's welcome, especially if you can test/improve/document it, and help resolve the various open questions and tradeoffs I allude to. But note you'll then be working on a unreleased bit of code that might only become part of the released code with a lot of changes (if ever). So you'd need to be confident in your own ability to understand the rough work-in-progress, and adapt it for your custom needs, relying mainly on the source code itself. |
I find this gensim feature really interesting. @gojomo can you tell me how to obtain the current version of the code? I never user github, sorry. If I download it, will work the word2vec online training option? or it has big bugs yet? |
@mirandanfu I am using this version currently: https://github.com/rutum/gensim/tree/rm_online model.build_vocab(new_sentences, update=True) works fine but it really needs some tuning to get good results.... You will find a clone / download option in that link. |
Fantastic thanks @Soumyajit Do you know if before updating the vocab and training with new words is possible to frozen the model for the previous ones? I would like to train a good model with Wikipedia and after that include some domain specific missing words without modifying the original knowledge learned from Wikipedia. |
@mirandanfu – the model's array |
Thanks @gojomo I downloaded the code from https://github.com/rutum/gensim/tree/rm_online and updated the function update_weights with these lines:
where lockVectors is a boolean that I propagate from build_vocab and oldVocabSize is created on that function before updating syn0:
Would that do what I want? In addition I modified this part that was wrong or incomplete:
Sorry if I put here the code because I don't know how to use github |
@mirandanfu I wish you luck, but you're now working with an experimental, unreleased feature branch. Though I've offered feedback, I've never run this branch, and (as above) have recommended most people should be retraining with full datasets, because of all the unresolved open questions on how this should work. Only you can judge whether this source code, with or without additional patches by yourself, does what your project needs. |
Nice feature, I think it might be useful. But isn't it missing an |
Possibly, yes |
Hi @zachmayer , I used this branch to train several models sequentially, call model1, model 2, etc. However, when I check the the vocab size, it seems that the subsequent models don't update the size correctly. That is, |
@michelleowen -- I think this is almost to be expected, as the vocabulary is only built once. Adding new words to the vocabulary once it has been created might be quite tricky to say the least. But I agree that updating the word counts would be a nice addition. |
Another problem to report. I used online learning to train 12 months data sequentially. The update of embedding vectors from first 6 months seem reasonable. However, starting from month 7, the update became wild (especially for frequent words). At month 9, all embedding vectors become Nan, though no error is reported in output log. |
@michelleowen @zachmayer There is more testing of this code by @isohyt in #778 |
Hi. I've completed online word2vec development and wrote a tutorial using wikipedia dumps. |
Finally merged in #900 |
Rebase of rebase of #435