Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Online word2vec #700

Closed
wants to merge 3 commits into from
Closed

Online word2vec #700

wants to merge 3 commits into from

Conversation

zachmayer
Copy link

@zachmayer zachmayer commented May 17, 2016

Rebase of rebase of #435

Rutu Mulkar-Mehta and others added 3 commits May 17, 2016 10:50
recovering lost work

updating bug in sentences iterator

vector freeze after each training iteration

cbow update

clean up code

update summarization tutorial image

resolving merge conflicts

resolving merge conflicts

resolving merge conflicts

resolving merge conflicts

resolving merge conflicts

resolving merge conflicts

update_weights function change

updating the numpy copy code

updating the numpy copy code
@zachmayer zachmayer changed the title Online w2 v Online word2vec May 17, 2016
@piskvorky
Copy link
Owner

piskvorky commented May 18, 2016

Looks clean, but will need more extensive testing & sanity checking, because it's such a tricky feature. CC @gojomo .

@zachmayer how could we test this more thoroughly? What results (accuracy, performance) can be expected if we run word2vec "online" on a larger corpus, such as text8/text9, compared to the existing "single batch" version?

@piskvorky piskvorky added the feature Issue described a new feature label May 18, 2016

def scan_vocab(self, sentences, progress_per=10000, trim_rule=None):
def scan_vocab(self, sentences, update, progress_per=10000, trim_rule=None):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to put a default in here, so that the change is backward compatible (some users call these functions manually in their app, we don't want to break that just because of optional upgrade).

@zachmayer
Copy link
Author

@rutum want to chime in on:

how could we test this more thoroughly? What results (accuracy, performance) can be expected if we run word2vec "online" on a larger corpus, such as text8/text9, compared to the existing "single batch" version?

?

@zachmayer
Copy link
Author

@piskvorky I'm primarily interested in using the "online" mode for "two corpus" training. E.g. train a word2vec model on a very large dataset, and then fine-tune the embeddings on a smaller dataset that is more specific to the task at hand. E.g. wikipedia for the initial embedding and a medical dictionary for the fine tuning.

Let me think about some specific use cases and get back to you.

@rutum
Copy link

rutum commented May 18, 2016

@zachmayer : Adding reference post on testing here: http://rutumulkar.com/blog/2015/word2vec/

@mohataher
Copy link

mohataher commented May 18, 2016

Few thoughts on this, I think comparing an online model and on regular model is still missing in the test cases.

  • The online model could be built like the one in testOnlineLearning() test case and the offline model through merging the sentences.

A quick pseudo-code

    def testOnlineAnfOfflineLearning(self):
        """Test that the algorithm is able to create the same model
        as the offline one with same sentences"""
        online_model = word2vec.Word2Vec(sentences, min_count=0, sorted_vocab=0)
        online_model.build_vocab(new_sentences, update=True)
        online_model.train(new_sentences)
        merged_sentences=sentences+new_sentences
        offline_model = word2vec.Word2Vec(merged_sentences, min_count=0, sorted_vocab=0)
        self.assertEqual(online_model, offline_model)
  • Adding more to the same test case, we could check the similarity accuracy for each word with in both online and offline models.
  • Test the model after it's saved to file and loaded again. Would that be helpful?

@zihaolucky
Copy link

Hi @rutum
According to the post, the accuracy drops when we train the model in an incremental way.

Corpus Accuracy
text8-rest 22.6%
text8-rest then text8-queen 20.5%
text8-all 23.6%
text8-first
text8-first then text8-second

How about randomly separate the text8-all into two parts and test it again? Cause the word distribution in text8-queen is highly skew, plus the update issues you've mentioned in former discussion. I believe the model would be more accurate if the corpus is not that small.

@rutum
Copy link

rutum commented May 31, 2016

@zihaolucky if you test on larger corpora, please include test cases where new words are added. You can create 2 corpora like this from wikipedia - leaving the same test case - one corpus with queen and the other without queen. Our goal is to see whether the semantics of the new words are correct or not.

@neosyon
Copy link

neosyon commented Jun 9, 2016

Hi @rutum

I was wondering if the current version of gensim supports the word2vec online updating with the skip-gram model and the negative sampling. In the website it seems like it is supported, however I read in some thread that it is not yet.

Thanks

@tmylk
Copy link
Contributor

tmylk commented Jun 9, 2016

Ping @gojomo

@gojomo
Copy link
Collaborator

gojomo commented Jun 10, 2016

The current version (0.12.4) allows you to continue supplying train() examples – but whether that's meaningful/beneficial is going to depend on a lot of things, including how different the new examples are from earlier training, how you decide to manage the alpha training rate, how much you iterate over the newer examples, etc. There aren't any standing recommendations for these decisions: there's no obviously right/best practice. (The most defensible course of action will always be: retrain with all old and new examples.) The current version doesn't provide any way to expand the vocabulary past the original build_vocab (scan/scale/finalize) step – one of the things this PR covers. But again, there are many open/vague questions about how to balance new examples/words against those learned from the original training.

@neosyon
Copy link

neosyon commented Jun 10, 2016

Thanks for your fast reply @gojomo

Then, if the current 0.12.4 version do not supports the vocabulary updating, can I download the version of this PR to do that or it is no available yet?

@gojomo
Copy link
Collaborator

gojomo commented Jun 10, 2016

@neosyon – Everything in github pull-requests/branches is "available", so you certainly "can" download it & try it. And that's welcome, especially if you can test/improve/document it, and help resolve the various open questions and tradeoffs I allude to.

But note you'll then be working on a unreleased bit of code that might only become part of the released code with a lot of changes (if ever). So you'd need to be confident in your own ability to understand the rough work-in-progress, and adapt it for your custom needs, relying mainly on the source code itself.

@ghost
Copy link

ghost commented Jun 11, 2016

I find this gensim feature really interesting. @gojomo can you tell me how to obtain the current version of the code? I never user github, sorry. If I download it, will work the word2vec online training option? or it has big bugs yet?

@Soumyajit
Copy link

@mirandanfu I am using this version currently: https://github.com/rutum/gensim/tree/rm_online

model.build_vocab(new_sentences, update=True) works fine but it really needs some tuning to get good results....

You will find a clone / download option in that link.
You can also follow this tutorial: http://rutumulkar.com/blog/2015/word2vec/

@ghost
Copy link

ghost commented Jun 11, 2016

Fantastic thanks @Soumyajit

Do you know if before updating the vocab and training with new words is possible to frozen the model for the previous ones? I would like to train a good model with Wikipedia and after that include some domain specific missing words without modifying the original knowledge learned from Wikipedia.

@gojomo
Copy link
Collaborator

gojomo commented Jun 11, 2016

@mirandanfu – the model's array syn0_lockf is a 'lock factor' which is applied to any attempted changes to existing word-vectors. The values are usually all 1.0 – so training causes 100% of the usual adjustment to a word-vector. If you set some slots to 0.0, training changes will be completely suppressed (for those slots). It's an experimental feature but may achieve what you want, if you manually set the right slots to 0.0 or 1.0 as appropriate for your goals.

@ghost
Copy link

ghost commented Jun 12, 2016

Thanks @gojomo

I downloaded the code from https://github.com/rutum/gensim/tree/rm_online and updated the function update_weights with these lines:

    if lockVectors == False:       
        self.syn0_lockf = ones(len(self.vocab), dtype=REAL)  # do not suppress learning for already learned words
    else:
        self.syn0_lockf = zeros(len(self.vocab), dtype=REAL)  # zeros suppress learning of previous vectors
        for i in xrange(oldVocabSize, len(self.vocab)):
            self.syn0_lockf[i] = 1.0  

where lockVectors is a boolean that I propagate from build_vocab and oldVocabSize is created on that function before updating syn0:

    oldVocabSize = len(self.syn0)
    self.syn0 = deepcopy(newsyn0)

Would that do what I want?

In addition I modified this part that was wrong or incomplete:

    # randomize the remaining words
    for i in xrange(len(self.syn0), len(newsyn0)):
        # construct deterministic seed from word AND seed argument
        newsyn0[i] = self.seeded_vector(self.index2word[i] + str(self.seed))
    oldVocabSize = len(self.syn0)
    self.syn0 = deepcopy(newsyn0)

    if self.hs:
        oldsyn1 = deepcopy(self.syn1)
        self.syn1 = zeros((len(self.vocab), self.layer1_size), dtype=REAL)
        for i in xrange(0, len(oldsyn1)):
            self.syn1[i] = deepcopy(oldsyn1[i])

    if self.negative:
        oldneg = deepcopy(self.syn1neg)
        self.syn1neg = zeros((len(self.vocab), self.layer1_size), dtype=REAL)
        for i in xrange(0, len(oldneg)):
            self.syn1neg[i] = deepcopy(oldneg[i])

Sorry if I put here the code because I don't know how to use github

@gojomo
Copy link
Collaborator

gojomo commented Jun 12, 2016

@mirandanfu I wish you luck, but you're now working with an experimental, unreleased feature branch. Though I've offered feedback, I've never run this branch, and (as above) have recommended most people should be retraining with full datasets, because of all the unresolved open questions on how this should work. Only you can judge whether this source code, with or without additional patches by yourself, does what your project needs.

@c-martinez
Copy link

Nice feature, I think it might be useful. But isn't it missing an update=True in line build_vocab ?

@zachmayer
Copy link
Author

Possibly, yes

@isomap isomap mentioned this pull request Jul 8, 2016
@michelleowen
Copy link

Hi @zachmayer , I used this branch to train several models sequentially, call model1, model 2, etc. However, when I check the the vocab size, it seems that the subsequent models don't update the size correctly. That is, model2.vocab[word].count is always the same as model1.vocab[word].count, though the size should be different in two corpus.

@c-martinez
Copy link

@michelleowen -- I think this is almost to be expected, as the vocabulary is only built once. Adding new words to the vocabulary once it has been created might be quite tricky to say the least. But I agree that updating the word counts would be a nice addition.

@michelleowen
Copy link

Another problem to report. I used online learning to train 12 months data sequentially. The update of embedding vectors from first 6 months seem reasonable. However, starting from month 7, the update became wild (especially for frequent words). At month 9, all embedding vectors become Nan, though no error is reported in output log.

@zachmayer
Copy link
Author

I don't really know how to debug these models. All I did was rebase @rutum's PR: #435

I haven't looked closely into what's going on under the hood, but I welcome PRs into this PR.

@tmylk
Copy link
Contributor

tmylk commented Aug 23, 2016

@michelleowen @zachmayer There is more testing of this code by @isohyt in #778

@isomap
Copy link
Contributor

isomap commented Sep 4, 2016

Hi.

I've completed online word2vec development and wrote a tutorial using wikipedia dumps.
This implementation increase the count of vocabulary after online training.
Please, check it out in #778 :)

@tmylk
Copy link
Contributor

tmylk commented Oct 4, 2016

Finally merged in #900

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issue described a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.