Online word2vec #700

zachmayer · 2016-05-17T14:51:17Z

Rebase of rebase of #435

recovering lost work updating bug in sentences iterator vector freeze after each training iteration cbow update clean up code update summarization tutorial image resolving merge conflicts resolving merge conflicts resolving merge conflicts resolving merge conflicts resolving merge conflicts resolving merge conflicts update_weights function change updating the numpy copy code updating the numpy copy code

piskvorky · 2016-05-18T02:02:05Z

Looks clean, but will need more extensive testing & sanity checking, because it's such a tricky feature. CC @gojomo .

@zachmayer how could we test this more thoroughly? What results (accuracy, performance) can be expected if we run word2vec "online" on a larger corpus, such as text8/text9, compared to the existing "single batch" version?

piskvorky · 2016-05-18T02:03:55Z

gensim/models/word2vec.py


-    def scan_vocab(self, sentences, progress_per=10000, trim_rule=None):
+    def scan_vocab(self, sentences, update, progress_per=10000, trim_rule=None):


I'd prefer to put a default in here, so that the change is backward compatible (some users call these functions manually in their app, we don't want to break that just because of optional upgrade).

zachmayer · 2016-05-18T02:04:18Z

@rutum want to chime in on:

how could we test this more thoroughly? What results (accuracy, performance) can be expected if we run word2vec "online" on a larger corpus, such as text8/text9, compared to the existing "single batch" version?

?

zachmayer · 2016-05-18T02:05:45Z

@piskvorky I'm primarily interested in using the "online" mode for "two corpus" training. E.g. train a word2vec model on a very large dataset, and then fine-tune the embeddings on a smaller dataset that is more specific to the task at hand. E.g. wikipedia for the initial embedding and a medical dictionary for the fine tuning.

Let me think about some specific use cases and get back to you.

rutum · 2016-05-18T02:47:29Z

@zachmayer : Adding reference post on testing here: http://rutumulkar.com/blog/2015/word2vec/

mohataher · 2016-05-18T09:07:39Z

Few thoughts on this, I think comparing an online model and on regular model is still missing in the test cases.

The online model could be built like the one in testOnlineLearning() test case and the offline model through merging the sentences.

A quick pseudo-code

    def testOnlineAnfOfflineLearning(self):
        """Test that the algorithm is able to create the same model
        as the offline one with same sentences"""
        online_model = word2vec.Word2Vec(sentences, min_count=0, sorted_vocab=0)
        online_model.build_vocab(new_sentences, update=True)
        online_model.train(new_sentences)
        merged_sentences=sentences+new_sentences
        offline_model = word2vec.Word2Vec(merged_sentences, min_count=0, sorted_vocab=0)
        self.assertEqual(online_model, offline_model)

Adding more to the same test case, we could check the similarity accuracy for each word with in both online and offline models.
Test the model after it's saved to file and loaded again. Would that be helpful?

zihaolucky · 2016-05-31T08:23:22Z

Hi @rutum
According to the post, the accuracy drops when we train the model in an incremental way.

Corpus	Accuracy
`text8-rest`	22.6%
`text8-rest` then `text8-queen`	20.5%
`text8-all`	23.6%
`text8-first`
`text8-first` then `text8-second`

How about randomly separate the text8-all into two parts and test it again? Cause the word distribution in text8-queen is highly skew, plus the update issues you've mentioned in former discussion. I believe the model would be more accurate if the corpus is not that small.

rutum · 2016-05-31T18:30:30Z

@zihaolucky if you test on larger corpora, please include test cases where new words are added. You can create 2 corpora like this from wikipedia - leaving the same test case - one corpus with queen and the other without queen. Our goal is to see whether the semantics of the new words are correct or not.

neosyon · 2016-06-09T16:16:42Z

Hi @rutum

I was wondering if the current version of gensim supports the word2vec online updating with the skip-gram model and the negative sampling. In the website it seems like it is supported, however I read in some thread that it is not yet.

Thanks

tmylk · 2016-06-09T16:22:16Z

Ping @gojomo

gojomo · 2016-06-10T00:08:33Z

The current version (0.12.4) allows you to continue supplying train() examples – but whether that's meaningful/beneficial is going to depend on a lot of things, including how different the new examples are from earlier training, how you decide to manage the alpha training rate, how much you iterate over the newer examples, etc. There aren't any standing recommendations for these decisions: there's no obviously right/best practice. (The most defensible course of action will always be: retrain with all old and new examples.) The current version doesn't provide any way to expand the vocabulary past the original build_vocab (scan/scale/finalize) step – one of the things this PR covers. But again, there are many open/vague questions about how to balance new examples/words against those learned from the original training.

neosyon · 2016-06-10T10:37:16Z

Thanks for your fast reply @gojomo

Then, if the current 0.12.4 version do not supports the vocabulary updating, can I download the version of this PR to do that or it is no available yet?

gojomo · 2016-06-10T18:27:02Z

@neosyon – Everything in github pull-requests/branches is "available", so you certainly "can" download it & try it. And that's welcome, especially if you can test/improve/document it, and help resolve the various open questions and tradeoffs I allude to.

But note you'll then be working on a unreleased bit of code that might only become part of the released code with a lot of changes (if ever). So you'd need to be confident in your own ability to understand the rough work-in-progress, and adapt it for your custom needs, relying mainly on the source code itself.

ghost · 2016-06-11T17:34:31Z

I find this gensim feature really interesting. @gojomo can you tell me how to obtain the current version of the code? I never user github, sorry. If I download it, will work the word2vec online training option? or it has big bugs yet?

Soumyajit · 2016-06-11T18:27:16Z

@mirandanfu I am using this version currently: https://github.com/rutum/gensim/tree/rm_online

model.build_vocab(new_sentences, update=True) works fine but it really needs some tuning to get good results....

You will find a clone / download option in that link.
You can also follow this tutorial: http://rutumulkar.com/blog/2015/word2vec/

ghost · 2016-06-11T22:16:27Z

Fantastic thanks @Soumyajit

Do you know if before updating the vocab and training with new words is possible to frozen the model for the previous ones? I would like to train a good model with Wikipedia and after that include some domain specific missing words without modifying the original knowledge learned from Wikipedia.

gojomo · 2016-06-11T22:43:39Z

@mirandanfu – the model's array syn0_lockf is a 'lock factor' which is applied to any attempted changes to existing word-vectors. The values are usually all 1.0 – so training causes 100% of the usual adjustment to a word-vector. If you set some slots to 0.0, training changes will be completely suppressed (for those slots). It's an experimental feature but may achieve what you want, if you manually set the right slots to 0.0 or 1.0 as appropriate for your goals.

ghost · 2016-06-12T11:21:31Z

Thanks @gojomo

I downloaded the code from https://github.com/rutum/gensim/tree/rm_online and updated the function update_weights with these lines:

    if lockVectors == False:       
        self.syn0_lockf = ones(len(self.vocab), dtype=REAL)  # do not suppress learning for already learned words
    else:
        self.syn0_lockf = zeros(len(self.vocab), dtype=REAL)  # zeros suppress learning of previous vectors
        for i in xrange(oldVocabSize, len(self.vocab)):
            self.syn0_lockf[i] = 1.0

where lockVectors is a boolean that I propagate from build_vocab and oldVocabSize is created on that function before updating syn0:

    oldVocabSize = len(self.syn0)
    self.syn0 = deepcopy(newsyn0)

Would that do what I want?

In addition I modified this part that was wrong or incomplete:

    # randomize the remaining words
    for i in xrange(len(self.syn0), len(newsyn0)):
        # construct deterministic seed from word AND seed argument
        newsyn0[i] = self.seeded_vector(self.index2word[i] + str(self.seed))
    oldVocabSize = len(self.syn0)
    self.syn0 = deepcopy(newsyn0)

    if self.hs:
        oldsyn1 = deepcopy(self.syn1)
        self.syn1 = zeros((len(self.vocab), self.layer1_size), dtype=REAL)
        for i in xrange(0, len(oldsyn1)):
            self.syn1[i] = deepcopy(oldsyn1[i])

    if self.negative:
        oldneg = deepcopy(self.syn1neg)
        self.syn1neg = zeros((len(self.vocab), self.layer1_size), dtype=REAL)
        for i in xrange(0, len(oldneg)):
            self.syn1neg[i] = deepcopy(oldneg[i])

Sorry if I put here the code because I don't know how to use github

gojomo · 2016-06-12T15:21:32Z

@mirandanfu I wish you luck, but you're now working with an experimental, unreleased feature branch. Though I've offered feedback, I've never run this branch, and (as above) have recommended most people should be retraining with full datasets, because of all the unresolved open questions on how this should work. Only you can judge whether this source code, with or without additional patches by yourself, does what your project needs.

c-martinez · 2016-06-21T21:59:44Z

Nice feature, I think it might be useful. But isn't it missing an update=True in line build_vocab ?

zachmayer · 2016-06-22T18:37:21Z

Possibly, yes

michelleowen · 2016-08-22T18:53:09Z

Hi @zachmayer , I used this branch to train several models sequentially, call model1, model 2, etc. However, when I check the the vocab size, it seems that the subsequent models don't update the size correctly. That is, model2.vocab[word].count is always the same as model1.vocab[word].count, though the size should be different in two corpus.

c-martinez · 2016-08-23T07:51:58Z

@michelleowen -- I think this is almost to be expected, as the vocabulary is only built once. Adding new words to the vocabulary once it has been created might be quite tricky to say the least. But I agree that updating the word counts would be a nice addition.

michelleowen · 2016-08-23T14:05:29Z

Another problem to report. I used online learning to train 12 months data sequentially. The update of embedding vectors from first 6 months seem reasonable. However, starting from month 7, the update became wild (especially for frequent words). At month 9, all embedding vectors become Nan, though no error is reported in output log.

zachmayer · 2016-08-23T14:07:09Z

I don't really know how to debug these models. All I did was rebase @rutum's PR: #435

I haven't looked closely into what's going on under the hood, but I welcome PRs into this PR.

tmylk · 2016-08-23T16:11:56Z

@michelleowen @zachmayer There is more testing of this code by @isohyt in #778

isomap · 2016-09-04T05:13:36Z

Hi.

I've completed online word2vec development and wrote a tutorial using wikipedia dumps.
This implementation increase the count of vocabulary after online training.
Please, check it out in #778 :)

tmylk · 2016-10-04T15:23:45Z

Finally merged in #900

Rutu Mulkar-Mehta and others added 3 commits May 17, 2016 10:50

fix test?

be1a0f0

dont sort vocab when updating

0848018

zachmayer changed the title ~~Online w2 v~~ Online word2vec May 17, 2016

zachmayer mentioned this pull request May 17, 2016

Rebase of "online word2vec" by rutum #615

Closed

piskvorky assigned tmylk May 18, 2016

piskvorky added the feature Issue described a new feature label May 18, 2016

piskvorky reviewed May 18, 2016
View reviewed changes

isomap mentioned this pull request Jul 8, 2016

Online word2vec test #778

Closed

tmylk closed this Oct 4, 2016

martinpopel mentioned this pull request Nov 30, 2016

Enable and refactor image summaries ufal/neuralmonkey#162

Merged

piskvorky mentioned this pull request Jul 20, 2017

Fully supporting incremental updation of vocabulary in Word2Vec model #1493

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Online word2vec #700

Online word2vec #700

zachmayer commented May 17, 2016 •

edited

Loading

piskvorky commented May 18, 2016 •

edited

Loading

piskvorky May 18, 2016

zachmayer commented May 18, 2016

zachmayer commented May 18, 2016

rutum commented May 18, 2016

mohataher commented May 18, 2016 •

edited

Loading

zihaolucky commented May 31, 2016

rutum commented May 31, 2016

neosyon commented Jun 9, 2016

tmylk commented Jun 9, 2016

gojomo commented Jun 10, 2016

neosyon commented Jun 10, 2016

gojomo commented Jun 10, 2016

ghost commented Jun 11, 2016

Soumyajit commented Jun 11, 2016

ghost commented Jun 11, 2016 •

edited by ghost

Loading

gojomo commented Jun 11, 2016

ghost commented Jun 12, 2016 •

edited by ghost

Loading

gojomo commented Jun 12, 2016

c-martinez commented Jun 21, 2016

zachmayer commented Jun 22, 2016

michelleowen commented Aug 22, 2016

c-martinez commented Aug 23, 2016

michelleowen commented Aug 23, 2016

zachmayer commented Aug 23, 2016

tmylk commented Aug 23, 2016

isomap commented Sep 4, 2016

tmylk commented Oct 4, 2016


		def scan_vocab(self, sentences, progress_per=10000, trim_rule=None):
		def scan_vocab(self, sentences, update, progress_per=10000, trim_rule=None):

Online word2vec #700

Online word2vec #700

Conversation

zachmayer commented May 17, 2016 • edited Loading

piskvorky commented May 18, 2016 • edited Loading

piskvorky May 18, 2016

Choose a reason for hiding this comment

zachmayer commented May 18, 2016

zachmayer commented May 18, 2016

rutum commented May 18, 2016

mohataher commented May 18, 2016 • edited Loading

zihaolucky commented May 31, 2016

rutum commented May 31, 2016

neosyon commented Jun 9, 2016

tmylk commented Jun 9, 2016

gojomo commented Jun 10, 2016

neosyon commented Jun 10, 2016

gojomo commented Jun 10, 2016

ghost commented Jun 11, 2016

Soumyajit commented Jun 11, 2016

ghost commented Jun 11, 2016 • edited by ghost Loading

gojomo commented Jun 11, 2016

ghost commented Jun 12, 2016 • edited by ghost Loading

gojomo commented Jun 12, 2016

c-martinez commented Jun 21, 2016

zachmayer commented Jun 22, 2016

michelleowen commented Aug 22, 2016

c-martinez commented Aug 23, 2016

michelleowen commented Aug 23, 2016

zachmayer commented Aug 23, 2016

tmylk commented Aug 23, 2016

isomap commented Sep 4, 2016

tmylk commented Oct 4, 2016

zachmayer commented May 17, 2016 •

edited

Loading

piskvorky commented May 18, 2016 •

edited

Loading

mohataher commented May 18, 2016 •

edited

Loading

ghost commented Jun 11, 2016 •

edited by ghost

Loading

ghost commented Jun 12, 2016 •

edited by ghost

Loading