-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Online Word2Vec #365
Online Word2Vec #365
Conversation
model.syn1[word.point] += outer(ga, l1) # learn hidden -> output | ||
neu1e += dot(ga, l2a) # save error | ||
model.syn1[word.point] += outer(ga, l1) # learn hidden -> output | ||
neu1e += dot(ga, l2a) # save error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least two spaces before inline comments (PEP8).
@rutum I had a look at the code -- if I understand correctly, after calling What is the reason for this? Why not continue training on all vocabulary? |
@@ -185,23 +184,21 @@ def train_sg_pair(model, word, word2, alpha, labels, train_w1=True, train_w2=Tru | |||
fb = 1. / (1. + exp(-dot(l1, l2b.T))) # propagate hidden -> output | |||
gb = (labels - fb) * alpha # vector of error gradients multiplied by the learning rate | |||
if train_w1: | |||
model.syn1neg[word_indices] += outer(gb, l1) # learn hidden -> output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did the update to syn1neg
go? I don't understand this refactor. Is it really equivalent to the original?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bug, and I just fixed it.
Just FYI, I have made changes only to the skip gram model with hierarchical sampling
@piskvorky I have remove the word freeze feature for now. However, the idea is, that if we have a good enough model to start with, we don't want to change it too much because of the introduction of a small amount of new data. Each new learning iteration starts with a default alpha of 0.025, which would make learning very aggressive with new words. |
Ah, I see what you mean. It is an interesting question what should happen to "old" words. Maybe freezing makes sense. Or perhaps we could do some per-word learning rate? Just an idea :) |
@@ -186,7 +185,7 @@ def train_sg_pair(model, word, word2, alpha, labels, train_w1=True, train_w2=Tru | |||
gb = (labels - fb) * alpha # vector of error gradients multiplied by the learning rate | |||
if train_w1: | |||
model.syn1neg[word_indices] += outer(gb, l1) # learn hidden -> output | |||
neu1e += dot(gb, l2b) # save error | |||
neu1e += dot(gb, l2b) # save error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change also a bug? Careful with the whitespace -- Python is picky :)
Yup! That is the next piece I am working on - using a per word learning rate using ADAGRAD |
for line in utils.smart_open(self.filename): | ||
line = utils.to_unicode(line) | ||
line = line.strip() | ||
words = [token.lower() for token in line.split(" ")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not split on all whitespace (not just " "
)?
But then this class looks like a duplicate of the existing word2vec.LineSentence
class -- what is the difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only difference is the removal of newlines etc before and after a sentence. Also, it expects one sentence per line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is no different to LineSentence
. It removes newlines (all whitespace) as well as and expects one sentence per line.
@piskvorky You might be interested in checking out this site: http://rutumulkar.com/blog/2015/word2vec/ |
It looks like this would work (perhaps already does?) for skip-gram negative-sampling, and could work for CBOW too. There's some collision with changes in my pending doc2vec PR – but nothing major, and some changes there make this easier. The The syn0_lockf ('lock-factor') arrays in the bigdocvec PR serve as exactly the word-specific training-dampener (that was also your removed syn0lock) – though only the doc2vec training paths fully respect them. A 1.0 value (the default) means full backpropagated errors are applied, a 0.0 means no-error correction (locked). Whether ADAGRAD or similar would need that full parallel array of factors, or can use some other more local/temporary value, I don't yet understand ADAGRAD enough to say. The cost of reallocating-and-copying syn0/syn1/syn1neg each vocab-expansion may be an issue in volume use – and could be avoided by going to a segmented-representation. That is, syn0 would be a list of ndarrays, rather than one, and a word's coordinate would be 2d rather than 1. (The segments could be equal sized – a true 2d ndarray – but a list of ragged-sized segments is probably just as efficient and more flexible.) Balancing the influence of new-examples and prior-training may be a big factor in the quality of incremental changes. Locking all old vectors in place is one simple, defensible approach – and if the old vectors have already been ported off to downstream applications, where they can't 'drift' to new values without other re-deployment costs, maybe is the dominant approach. But letting the old vectors improve a little, in proportion to how much info about each word the new examples bring, might be optimal... |
@gojomo Good points. You are correct about the code already working with negative sampling and cbow. In fact, you can definitely see improvement in accuracy after adding more vocabulary, when you are using negative sampling, as opposed to hierarchical sampling. I think using hierarchical sampling intuitively doesn't feel right to me, because the binary huffman tree will change with the new frequencies, changing the syn0 to syn1 matrix mappings. Anyway, let me know what are the next steps for this! |
@rutum @piskvorky Awesome work, I'd be really interested in this feature. What is the status here, will this be included in a release anytime soon? |
@phdowling We are working on some tests, and then cython implementation before doing the merge |
I have resolved the merge conflicts with the |
Nevermind! Figured it out. |
model.train(new_sentences) | ||
self.assertEqual(len(model.vocab), 14) | ||
self.models_equal(model, word2vec.Word2Vec.load(datapath( | ||
"gensim_word2vec_update.tst"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this file come from? Also, a more direct test of success (things become locked?) may be preferable here, rather than comparing against a pre-generated model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky Do you have any recommendations on what to test? I am testing the vocabulary length to make sure it has increased.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you can get inspiration from the other surrounding tests :)
But here I meant: where does the file gensim_word2vec_update.tst
come from? I don't see it in the repository.
And this test should probably be finer anyway -- relying on bit-for-bit equality against a pregenerated model seems to brittle. Any future change to the RNG, or any other parameter, will result in a fail. My guess is this won't work even now, across different Python version -- did you run the tests pass on Python 2.6 & 2.7 & 3.4? Once the conflicts are resolved, Github will launch Travis tests and we will be able to see the test results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will change the tests in the new PR with merged changes from develop.
@rutum what is the progress here? I'm thinking what we need now is:
|
@piskvorky I am working on the merge conflicts right now. The last release has significant changes, and I plan to open up a new PR with this update by next week. I think we should be able to meet the end of August deadline. |
Perfect :) |
Superceded by #435 . See there for continued discussion. |
AttributeError: 'Word2Vec' object has no attribute 'update_vocab' |
@sasikum I think update_vocab(sentences) has been replaced by build_vocab(sentences, update=True), this requires the model to be pre-trained. |
@rutum So how do you use hierarchical sampling in online word2vec right now?I'm interested in the ideas about that
|
that's lovely but how can I use it in my gensim implementation? thx! |
Adding functions:
Usage:
model = Word2Vec() # sg and hs are the default parameters
model.build_vocab(sentences)
model.train(sentences)
model.save("base_model")
model.update_vocab(new_sentences)
model.train(new_sentences)
model.save("updated_model")
Then you can compare the 2 models to see whether the new vocabulary is learning the way it is supposed to.
I tried an experiment with learning a model without "queen", and adding it in the subsequent set of sentences. The updated model learned "queen" as being similar to "king", "duke" etc. So that was a huge success. I would love to hear of any other ideas you might have to test this.