Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Online Word2Vec #365

Closed
wants to merge 16 commits into from
Closed

Online Word2Vec #365

wants to merge 16 commits into from

Conversation

rutum
Copy link

@rutum rutum commented Jun 22, 2015

Adding functions:

  • update vocab: updates the vocabulary with new words
  • update weights: uses the weights of the old vocabulary, and reseting the weights of new vocabulary

Usage:

model = Word2Vec() # sg and hs are the default parameters
model.build_vocab(sentences)
model.train(sentences)
model.save("base_model")

model.update_vocab(new_sentences)
model.train(new_sentences)
model.save("updated_model")

Then you can compare the 2 models to see whether the new vocabulary is learning the way it is supposed to.

I tried an experiment with learning a model without "queen", and adding it in the subsequent set of sentences. The updated model learned "queen" as being similar to "king", "duke" etc. So that was a huge success. I would love to hear of any other ideas you might have to test this.

model.syn1[word.point] += outer(ga, l1) # learn hidden -> output
neu1e += dot(ga, l2a) # save error
model.syn1[word.point] += outer(ga, l1) # learn hidden -> output
neu1e += dot(ga, l2a) # save error
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least two spaces before inline comments (PEP8).

@piskvorky
Copy link
Owner

@rutum I had a look at the code -- if I understand correctly, after calling update_vocab, training can only continue on new words exclusively (not old+new), right?

What is the reason for this? Why not continue training on all vocabulary?

@@ -185,23 +184,21 @@ def train_sg_pair(model, word, word2, alpha, labels, train_w1=True, train_w2=Tru
fb = 1. / (1. + exp(-dot(l1, l2b.T))) # propagate hidden -> output
gb = (labels - fb) * alpha # vector of error gradients multiplied by the learning rate
if train_w1:
model.syn1neg[word_indices] += outer(gb, l1) # learn hidden -> output
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did the update to syn1neg go? I don't understand this refactor. Is it really equivalent to the original?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bug, and I just fixed it.
Just FYI, I have made changes only to the skip gram model with hierarchical sampling

@rutum
Copy link
Author

rutum commented Jun 23, 2015

@piskvorky I have remove the word freeze feature for now. However, the idea is, that if we have a good enough model to start with, we don't want to change it too much because of the introduction of a small amount of new data. Each new learning iteration starts with a default alpha of 0.025, which would make learning very aggressive with new words.

@piskvorky
Copy link
Owner

Ah, I see what you mean. It is an interesting question what should happen to "old" words. Maybe freezing makes sense. Or perhaps we could do some per-word learning rate? Just an idea :)

@@ -186,7 +185,7 @@ def train_sg_pair(model, word, word2, alpha, labels, train_w1=True, train_w2=Tru
gb = (labels - fb) * alpha # vector of error gradients multiplied by the learning rate
if train_w1:
model.syn1neg[word_indices] += outer(gb, l1) # learn hidden -> output
neu1e += dot(gb, l2b) # save error
neu1e += dot(gb, l2b) # save error
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change also a bug? Careful with the whitespace -- Python is picky :)

@rutum
Copy link
Author

rutum commented Jun 23, 2015

Yup! That is the next piece I am working on - using a per word learning rate using ADAGRAD

for line in utils.smart_open(self.filename):
line = utils.to_unicode(line)
line = line.strip()
words = [token.lower() for token in line.split(" ")]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not split on all whitespace (not just " ")?

But then this class looks like a duplicate of the existing word2vec.LineSentence class -- what is the difference?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference is the removal of newlines etc before and after a sentence. Also, it expects one sentence per line

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is no different to LineSentence. It removes newlines (all whitespace) as well as and expects one sentence per line.

@rutum
Copy link
Author

rutum commented Jun 23, 2015

@piskvorky You might be interested in checking out this site: http://rutumulkar.com/blog/2015/word2vec/
I have done some evaluation between the difference in bulk training vs online training. The performance does drop in online training. Your thoughts would be awesome!

@gojomo
Copy link
Collaborator

gojomo commented Jun 24, 2015

It looks like this would work (perhaps already does?) for skip-gram negative-sampling, and could work for CBOW too.

There's some collision with changes in my pending doc2vec PR – but nothing major, and some changes there make this easier. The The syn0_lockf ('lock-factor') arrays in the bigdocvec PR serve as exactly the word-specific training-dampener (that was also your removed syn0lock) – though only the doc2vec training paths fully respect them. A 1.0 value (the default) means full backpropagated errors are applied, a 0.0 means no-error correction (locked). Whether ADAGRAD or similar would need that full parallel array of factors, or can use some other more local/temporary value, I don't yet understand ADAGRAD enough to say.

The cost of reallocating-and-copying syn0/syn1/syn1neg each vocab-expansion may be an issue in volume use – and could be avoided by going to a segmented-representation. That is, syn0 would be a list of ndarrays, rather than one, and a word's coordinate would be 2d rather than 1. (The segments could be equal sized – a true 2d ndarray – but a list of ragged-sized segments is probably just as efficient and more flexible.)

Balancing the influence of new-examples and prior-training may be a big factor in the quality of incremental changes. Locking all old vectors in place is one simple, defensible approach – and if the old vectors have already been ported off to downstream applications, where they can't 'drift' to new values without other re-deployment costs, maybe is the dominant approach. But letting the old vectors improve a little, in proportion to how much info about each word the new examples bring, might be optimal...

@rutum
Copy link
Author

rutum commented Jul 16, 2015

@gojomo Good points. You are correct about the code already working with negative sampling and cbow. In fact, you can definitely see improvement in accuracy after adding more vocabulary, when you are using negative sampling, as opposed to hierarchical sampling. I think using hierarchical sampling intuitively doesn't feel right to me, because the binary huffman tree will change with the new frequencies, changing the syn0 to syn1 matrix mappings.

Anyway, let me know what are the next steps for this!

@phdowling
Copy link
Contributor

@rutum @piskvorky Awesome work, I'd be really interested in this feature. What is the status here, will this be included in a release anytime soon?

@rutum
Copy link
Author

rutum commented Jul 29, 2015

@phdowling We are working on some tests, and then cython implementation before doing the merge

@rutum
Copy link
Author

rutum commented Aug 5, 2015

I have resolved the merge conflicts with the develop branch, but there still seem to be some issues. Any thoughts? @piskvorky

@rutum
Copy link
Author

rutum commented Aug 5, 2015

Nevermind! Figured it out.

model.train(new_sentences)
self.assertEqual(len(model.vocab), 14)
self.models_equal(model, word2vec.Word2Vec.load(datapath(
"gensim_word2vec_update.tst")))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this file come from? Also, a more direct test of success (things become locked?) may be preferable here, rather than comparing against a pre-generated model.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky Do you have any recommendations on what to test? I am testing the vocabulary length to make sure it has increased.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you can get inspiration from the other surrounding tests :)

But here I meant: where does the file gensim_word2vec_update.tst come from? I don't see it in the repository.

And this test should probably be finer anyway -- relying on bit-for-bit equality against a pregenerated model seems to brittle. Any future change to the RNG, or any other parameter, will result in a fail. My guess is this won't work even now, across different Python version -- did you run the tests pass on Python 2.6 & 2.7 & 3.4? Once the conflicts are resolved, Github will launch Travis tests and we will be able to see the test results.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will change the tests in the new PR with merged changes from develop.

@piskvorky
Copy link
Owner

@rutum what is the progress here?

I'm thinking what we need now is:

  • resolve conflicts, so this PR is mergeable
  • I'll try to find someone to help out with cythonization of relevant parts
  • merge
  • release as part of 0.12.2 (~end of August)
  • fix and improve later, depending on reception :)

@rutum
Copy link
Author

rutum commented Aug 11, 2015

@piskvorky I am working on the merge conflicts right now. The last release has significant changes, and I plan to open up a new PR with this update by next week. I think we should be able to meet the end of August deadline.

@piskvorky
Copy link
Owner

Perfect :)

@piskvorky
Copy link
Owner

Superceded by #435 . See there for continued discussion.

@piskvorky piskvorky closed this Aug 19, 2015
tmylk pushed a commit that referenced this pull request Oct 3, 2016
@sasikum
Copy link

sasikum commented Sep 6, 2017

AttributeError: 'Word2Vec' object has no attribute 'update_vocab'

@chez8990
Copy link

@sasikum I think update_vocab(sentences) has been replaced by build_vocab(sentences, update=True), this requires the model to be pre-trained.

@Zayme249Shaw
Copy link

@rutum So how do you use hierarchical sampling in online word2vec right now?I'm interested in the ideas about that

@gojomo Good points. You are correct about the code already working with negative sampling and cbow. In fact, you can definitely see improvement in accuracy after adding more vocabulary, when you are using negative sampling, as opposed to hierarchical sampling. I think using hierarchical sampling intuitively doesn't feel right to me, because the binary huffman tree will change with the new frequencies, changing the syn0 to syn1 matrix mappings.

Anyway, let me know what are the next steps for this!

@penelope24
Copy link

Adding functions:

  • update vocab: updates the vocabulary with new words
  • update weights: uses the weights of the old vocabulary, and reseting the weights of new vocabulary

Usage:

model = Word2Vec() # sg and hs are the default parameters
model.build_vocab(sentences)
model.train(sentences)
model.save("base_model")

model.update_vocab(new_sentences)
model.train(new_sentences)
model.save("updated_model")

Then you can compare the 2 models to see whether the new vocabulary is learning the way it is supposed to.

I tried an experiment with learning a model without "queen", and adding it in the subsequent set of sentences. The updated model learned "queen" as being similar to "king", "duke" etc. So that was a huge success. I would love to hear of any other ideas you might have to test this.

that's lovely but how can I use it in my gensim implementation? thx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants