Online Word2Vec #365

rutum · 2015-06-22T17:57:50Z

Adding functions:

update vocab: updates the vocabulary with new words
update weights: uses the weights of the old vocabulary, and reseting the weights of new vocabulary

Usage:

model = Word2Vec() # sg and hs are the default parameters
model.build_vocab(sentences)
model.train(sentences)
model.save("base_model")

model.update_vocab(new_sentences)
model.train(new_sentences)
model.save("updated_model")

Then you can compare the 2 models to see whether the new vocabulary is learning the way it is supposed to.

I tried an experiment with learning a model without "queen", and adding it in the subsequent set of sentences. The updated model learned "queen" as being similar to "king", "duke" etc. So that was a huge success. I would love to hear of any other ideas you might have to test this.

Rm

piskvorky · 2015-06-22T20:47:16Z

gensim/models/word2vec.py

-            model.syn1[word.point] += outer(ga, l1)  # learn hidden -> output
-        neu1e += dot(ga, l2a)  # save error
+            model.syn1[word.point] += outer(ga, l1) # learn hidden -> output
+        neu1e += dot(ga, l2a) # save error


At least two spaces before inline comments (PEP8).

piskvorky · 2015-06-23T12:50:39Z

@rutum I had a look at the code -- if I understand correctly, after calling update_vocab, training can only continue on new words exclusively (not old+new), right?

What is the reason for this? Why not continue training on all vocabulary?

piskvorky · 2015-06-23T12:51:10Z

gensim/models/word2vec.py

@@ -185,23 +184,21 @@ def train_sg_pair(model, word, word2, alpha, labels, train_w1=True, train_w2=Tru
        fb = 1. / (1. + exp(-dot(l1, l2b.T)))  # propagate hidden -> output
        gb = (labels - fb) * alpha  # vector of error gradients multiplied by the learning rate
        if train_w1:
-            model.syn1neg[word_indices] += outer(gb, l1)  # learn hidden -> output


Where did the update to syn1neg go? I don't understand this refactor. Is it really equivalent to the original?

It is a bug, and I just fixed it.
Just FYI, I have made changes only to the skip gram model with hierarchical sampling

rutum · 2015-06-23T17:58:13Z

@piskvorky I have remove the word freeze feature for now. However, the idea is, that if we have a good enough model to start with, we don't want to change it too much because of the introduction of a small amount of new data. Each new learning iteration starts with a default alpha of 0.025, which would make learning very aggressive with new words.

piskvorky · 2015-06-23T19:08:06Z

Ah, I see what you mean. It is an interesting question what should happen to "old" words. Maybe freezing makes sense. Or perhaps we could do some per-word learning rate? Just an idea :)

piskvorky · 2015-06-23T19:10:21Z

gensim/models/word2vec.py

@@ -186,7 +185,7 @@ def train_sg_pair(model, word, word2, alpha, labels, train_w1=True, train_w2=Tru
        gb = (labels - fb) * alpha  # vector of error gradients multiplied by the learning rate
        if train_w1:
            model.syn1neg[word_indices] += outer(gb, l1)  # learn hidden -> output
-        neu1e += dot(gb, l2b)  # save error
+            neu1e += dot(gb, l2b)  # save error


Is this change also a bug? Careful with the whitespace -- Python is picky :)

rutum · 2015-06-23T19:13:08Z

Yup! That is the next piece I am working on - using a per word learning rate using ADAGRAD

piskvorky · 2015-06-23T19:13:08Z

gensim/models/word2vec.py

+        for line in utils.smart_open(self.filename):
+            line = utils.to_unicode(line)
+            line = line.strip()
+            words = [token.lower() for token in line.split(" ")]


Why not split on all whitespace (not just " ")?

But then this class looks like a duplicate of the existing word2vec.LineSentence class -- what is the difference?

The only difference is the removal of newlines etc before and after a sentence. Also, it expects one sentence per line

That is no different to LineSentence. It removes newlines (all whitespace) as well as and expects one sentence per line.

rutum · 2015-06-23T19:17:26Z

@piskvorky You might be interested in checking out this site: http://rutumulkar.com/blog/2015/word2vec/
I have done some evaluation between the difference in bulk training vs online training. The performance does drop in online training. Your thoughts would be awesome!

gojomo · 2015-06-24T01:12:54Z

It looks like this would work (perhaps already does?) for skip-gram negative-sampling, and could work for CBOW too.

There's some collision with changes in my pending doc2vec PR – but nothing major, and some changes there make this easier. The The syn0_lockf ('lock-factor') arrays in the bigdocvec PR serve as exactly the word-specific training-dampener (that was also your removed syn0lock) – though only the doc2vec training paths fully respect them. A 1.0 value (the default) means full backpropagated errors are applied, a 0.0 means no-error correction (locked). Whether ADAGRAD or similar would need that full parallel array of factors, or can use some other more local/temporary value, I don't yet understand ADAGRAD enough to say.

The cost of reallocating-and-copying syn0/syn1/syn1neg each vocab-expansion may be an issue in volume use – and could be avoided by going to a segmented-representation. That is, syn0 would be a list of ndarrays, rather than one, and a word's coordinate would be 2d rather than 1. (The segments could be equal sized – a true 2d ndarray – but a list of ragged-sized segments is probably just as efficient and more flexible.)

Balancing the influence of new-examples and prior-training may be a big factor in the quality of incremental changes. Locking all old vectors in place is one simple, defensible approach – and if the old vectors have already been ported off to downstream applications, where they can't 'drift' to new values without other re-deployment costs, maybe is the dominant approach. But letting the old vectors improve a little, in proportion to how much info about each word the new examples bring, might be optimal...

rutum · 2015-07-16T17:22:40Z

@gojomo Good points. You are correct about the code already working with negative sampling and cbow. In fact, you can definitely see improvement in accuracy after adding more vocabulary, when you are using negative sampling, as opposed to hierarchical sampling. I think using hierarchical sampling intuitively doesn't feel right to me, because the binary huffman tree will change with the new frequencies, changing the syn0 to syn1 matrix mappings.

Anyway, let me know what are the next steps for this!

phdowling · 2015-07-29T20:44:32Z

@rutum @piskvorky Awesome work, I'd be really interested in this feature. What is the status here, will this be included in a release anytime soon?

rutum · 2015-07-29T20:45:41Z

@phdowling We are working on some tests, and then cython implementation before doing the merge

rutum · 2015-08-05T18:36:24Z

I have resolved the merge conflicts with the develop branch, but there still seem to be some issues. Any thoughts? @piskvorky

rutum · 2015-08-05T18:43:28Z

Nevermind! Figured it out.

piskvorky · 2015-08-11T10:00:25Z

gensim/test/test_word2vec.py

+        model.train(new_sentences)
+        self.assertEqual(len(model.vocab), 14)
+        self.models_equal(model, word2vec.Word2Vec.load(datapath(
+            "gensim_word2vec_update.tst")))


Where does this file come from? Also, a more direct test of success (things become locked?) may be preferable here, rather than comparing against a pre-generated model.

@piskvorky Do you have any recommendations on what to test? I am testing the vocabulary length to make sure it has increased.

Well, you can get inspiration from the other surrounding tests :)

But here I meant: where does the file gensim_word2vec_update.tst come from? I don't see it in the repository.

And this test should probably be finer anyway -- relying on bit-for-bit equality against a pregenerated model seems to brittle. Any future change to the RNG, or any other parameter, will result in a fail. My guess is this won't work even now, across different Python version -- did you run the tests pass on Python 2.6 & 2.7 & 3.4? Once the conflicts are resolved, Github will launch Travis tests and we will be able to see the test results.

Agreed. Will change the tests in the new PR with merged changes from develop.

piskvorky · 2015-08-11T10:02:31Z

@rutum what is the progress here?

I'm thinking what we need now is:

resolve conflicts, so this PR is mergeable
I'll try to find someone to help out with cythonization of relevant parts
merge
release as part of 0.12.2 (~end of August)
fix and improve later, depending on reception :)

rutum · 2015-08-11T16:33:48Z

@piskvorky I am working on the merge conflicts right now. The last release has significant changes, and I plan to open up a new PR with this update by next week. I think we should be able to meet the end of August deadline.

piskvorky · 2015-08-11T16:35:04Z

Perfect :)

piskvorky · 2015-08-19T07:05:09Z

Superceded by #435 . See there for continued discussion.

sasikum · 2017-09-06T12:11:21Z

AttributeError: 'Word2Vec' object has no attribute 'update_vocab'

chez8990 · 2017-09-29T06:29:16Z

@sasikum I think update_vocab(sentences) has been replaced by build_vocab(sentences, update=True), this requires the model to be pre-trained.

Zayme249Shaw · 2019-03-22T05:53:30Z

@rutum So how do you use hierarchical sampling in online word2vec right now?I'm interested in the ideas about that

@gojomo Good points. You are correct about the code already working with negative sampling and cbow. In fact, you can definitely see improvement in accuracy after adding more vocabulary, when you are using negative sampling, as opposed to hierarchical sampling. I think using hierarchical sampling intuitively doesn't feel right to me, because the binary huffman tree will change with the new frequencies, changing the syn0 to syn1 matrix mappings.

Anyway, let me know what are the next steps for this!

penelope24 · 2019-05-28T08:56:07Z

Adding functions:

update vocab: updates the vocabulary with new words

update weights: uses the weights of the old vocabulary, and reseting the weights of new vocabulary

Usage:

model = Word2Vec() # sg and hs are the default parameters
model.build_vocab(sentences)
model.train(sentences)
model.save("base_model")

model.update_vocab(new_sentences)
model.train(new_sentences)
model.save("updated_model")

Then you can compare the 2 models to see whether the new vocabulary is learning the way it is supposed to.

I tried an experiment with learning a model without "queen", and adding it in the subsequent set of sentences. The updated model learned "queen" as being similar to "king", "duke" etc. So that was a huge success. I would love to hear of any other ideas you might have to test this.

that's lovely but how can I use it in my gensim implementation? thx!

Rutu Mulkar-Mehta added 8 commits June 12, 2015 13:23

freeze word vectors

6f86b26

recovering lost work

03b428c

recovering lost work

aeb64b5

updating bug in sentences iterator

eed64fd

vector freeze after each training iteration

b38dac0

cbow update

f949db3

clean up code

765d16c

Merge pull request #1 from rutum/rm

34afd95

Rm

piskvorky reviewed Jun 22, 2015
View reviewed changes

piskvorky reviewed Jun 23, 2015
View reviewed changes

piskvorky mentioned this pull request Jun 25, 2015

Is there any way to update word2vec model ? #368

Closed

Rutu Mulkar-Mehta added 4 commits August 5, 2015 11:20

pep8 and Sentences class updateS

df88f18

pep8 and Sentences class updateS

da832fa

fixing negative sampling bug, removing word freeze

66d59ee

bugfix

86f9424

Rutu Mulkar-Mehta added 4 commits August 5, 2015 11:26

pep8 and Sentences class updateS

256bada

fixing negative sampling bug, removing word freeze

8a78f6d

adding tests for word2vec update model

9be185d

removing duplicate class Sentences - use already existing LineSentence

c1d37a4

rutum force-pushed the rm branch from 2e7681a to c1d37a4 Compare August 5, 2015 18:32

piskvorky reviewed Aug 11, 2015
View reviewed changes

gojomo mentioned this pull request Aug 15, 2015

[doc2vec] train new doc tags with old words vocab #430

Closed

piskvorky closed this Aug 19, 2015

tmylk pushed a commit that referenced this pull request Oct 3, 2016

Online word2vec. Started in #365 by @rutum (#900)

6627c6f

gojomo mentioned this pull request Jul 27, 2020

save_facebook_model() - AssertionError #2853

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Online Word2Vec #365

Online Word2Vec #365

rutum commented Jun 22, 2015

piskvorky Jun 22, 2015

piskvorky commented Jun 23, 2015

piskvorky Jun 23, 2015

rutum Jun 23, 2015

rutum commented Jun 23, 2015

piskvorky commented Jun 23, 2015

piskvorky Jun 23, 2015

rutum commented Jun 23, 2015

piskvorky Jun 23, 2015

rutum Jun 23, 2015

piskvorky Jun 25, 2015

rutum commented Jun 23, 2015

gojomo commented Jun 24, 2015

rutum commented Jul 16, 2015

phdowling commented Jul 29, 2015

rutum commented Jul 29, 2015

rutum commented Aug 5, 2015

rutum commented Aug 5, 2015

piskvorky Aug 11, 2015

rutum Aug 11, 2015

piskvorky Aug 15, 2015

rutum Aug 15, 2015

piskvorky commented Aug 11, 2015

rutum commented Aug 11, 2015

piskvorky commented Aug 11, 2015

piskvorky commented Aug 19, 2015

sasikum commented Sep 6, 2017

chez8990 commented Sep 29, 2017

Zayme249Shaw commented Mar 22, 2019

penelope24 commented May 28, 2019

Online Word2Vec #365

Online Word2Vec #365

Conversation

rutum commented Jun 22, 2015

Choose a reason for hiding this comment

piskvorky commented Jun 23, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rutum commented Jun 23, 2015

piskvorky commented Jun 23, 2015

Choose a reason for hiding this comment

rutum commented Jun 23, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rutum commented Jun 23, 2015

gojomo commented Jun 24, 2015

rutum commented Jul 16, 2015

phdowling commented Jul 29, 2015

rutum commented Jul 29, 2015

rutum commented Aug 5, 2015

rutum commented Aug 5, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky commented Aug 11, 2015

rutum commented Aug 11, 2015

piskvorky commented Aug 11, 2015

piskvorky commented Aug 19, 2015

sasikum commented Sep 6, 2017

chez8990 commented Sep 29, 2017

Zayme249Shaw commented Mar 22, 2019

penelope24 commented May 28, 2019