-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] unsupervised fasttext #1482
Conversation
@piskvorky In word2vec skip-gram model, the target is context word which we predict based on input word (exact opposite to cbow model), but in gensim word2vec code, I see that just like cbow model, for skip-gram model too we have taken context as input only. here in word2vec.py . It doesn't affect the performance, but could you please clarify a little for my own understanding. Also, I find the variable naming like |
Skip-gram & CBOW not 'exact opposite' - one uses single words to predict target, other uses averages-of-multiple-words. There has occasionally been confusion because in there's two ways to loop over the text & window in skip-gram: you can loop over each main-word, then each window-word within that word's window, and make the predictions either (main-word -> window-word), or vice-versa (window-word -> main-word). The original word2vec paper described it one way, but the 1st-released word2vec.c code did it the alternate way, and there was a comment from one of authors that the way the code does it is slightly more cache-efficient. In both cases, the exact same (word->word) pairs are eventually trained, just in a slightly different iteration order. The naming |
gensim/models/fasttext.py
Outdated
import logging | ||
|
||
from gensim.models.word2vec import Word2Vec | ||
from gensim.models.ft_keyedvectors import FastTextKeyedVectors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not simply reuse FastTextKeyedVectors
from the wrapper?
gensim/models/fasttext.py
Outdated
subwords_indices.append(len(self.wv.vocab) + subword_hash % self.bucket) # self ?? classmethod or pass model ... discuss ?? | ||
return subwords_indices | ||
|
||
def compute_subwords(word, min_n, max_n): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's something wrong with indentation. Also, why aren't we reusing compute_ngrams
from the wrapper? Similarly for ft_hash
gensim/models/fasttext.py
Outdated
# don't train on the `word` itself | ||
|
||
subword2_indices = [] | ||
subword2_indices += get_subwords(word2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply subword2_indices = get_subwords(word2)
? Also, there's something wrong with indentation here.
gensim/models/fasttext.py
Outdated
# now go over all words from the (reduced) window, predicting each one in turn | ||
start = max(0, pos - model.window + reduced_window) | ||
for pos2, word2 in enumerate(word_vocabs[start:(pos + model.window + 1 - reduced_window)], start): | ||
# don't train on the `word` itself |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is related to a different line, please take care while changing existing code. Things like these don't take up any extra time, and make reviewing (and self-reviewing) much easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't change the comment here. I think it's correct as don't train on word itself
means that word
is the target and rest words in windows are context (taken as input in word2vec code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant, it is present before a different line than it talks about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I got it. Thanks
gensim/models/fasttext.py
Outdated
subword2_indices += get_subwords(word2) | ||
|
||
if pos2 != pos: | ||
train_sg_pair(model, model.wv.index2word[word.index], subword2_indices, alpha) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at how get_subwords
is defined, it looks like subword_indices
contain the actual ngram strings, not the indices.
gensim/models/fasttext.py
Outdated
|
||
# output_ = std::make_shared<Matrix>(dict_->nwords(), args_->dim); | ||
|
||
output_ = np.matrix(len(self.wv.vocab), self.vector_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks strange. What exactly are we trying to do here?
gensim/models/fasttext.py
Outdated
word2_subwords_indices = [] | ||
|
||
for indices in word2_indices: | ||
word2_subwords_indices += get_subwords(model.wv.syn0[indices]) # subwords for each word in window except target word |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this intended to work? You're passing vector weights to get_subwords
?
gensim/models/fasttext.py
Outdated
|
||
# probably, we need to put subwords in model.wv.syn0 too | ||
|
||
l1 = np_sum(model.wv.syn0[word2_subwords_indices], axis=0) # 1 x vector_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think storing ngram weights in syn0
is a good idea, because it makes it much harder to use methods from KeyedVectors
. We should have a separate matrix for ngram weights (maybe syn0_all
, like in the FastTextKeyedVectors
from the wrapper, to be able to reuse that class for methods like most_similar
etc).
Once training is complete, syn0
should be populated with computed in-vocab-word vectors, to not have to recompute word vectors for in-vocab words on every lookup.
ngram_indices.append(len(self.wv.vocab) + ngram_hash % self.bucket) | ||
self.wv.ngrams[ngram] = i | ||
|
||
self.wv.syn0_all = self.wv.syn0_all.take(ngram_indices, axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not going to work if we want to resume training. Extraneous vectors should be discarded only when we are sure no training is going to take place further (probably analogous to init_sims(replace=True)
in Word2Vec
.
gensim/test/test_fasttext.py
Outdated
""" Test training with subwords for both skipgram and cbow model""" | ||
sentences = LeeCorpus() | ||
self.assertTrue(FastText(sentences, min_count=100, size=100, workers=3)) | ||
self.assertTrue(FastText(sentences, sg=1, min_count=100, size=100, workers=3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Model sanity tests similar to word2vec are required here too to ensure the "correct" models are learnt, as well as a comparison to the original FastText vectors.
"""Even tiny models trained on LeeCorpus should pass these sanity checks""" | ||
# run extra before/after training tests if train=True | ||
""" | ||
if train: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason that there is no test here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the training code in the test is commented out as it took very long time on travis for pure python code. I'd look to reduce the number of iterations, or maybe a smaller datasets.
For context - model trained using this code on Lee Corpus
model.similarity("woman", "woman") : 1.0
[('night', 0.999998152256012), |
Overall, we're going for identical with the original, to start with. To ensure implementation correctness (not "somewhat good"), before we start optimizing. |
@piskvorky yes, what I meant was this PR was very close to getting the identical result (perhaps a little debugging), the difference IMO is mainly because of the way we have initialized n-gram weights. |
@jayantj could you look into it and review what needs to be done to get identical results. I think much effort has gone into this so far, and there is no point abandoning this PR. cc @piskvorky |
I agree we don't want to abandon this PR, there has been a lot of effort into it. The fact that the results look like this - model.most_similar("nights")
-> [('night', 0.999998152256012),
('lights', 0.9999980926513672),
('night.', 0.9999977350234985),
("night's", 0.999997615814209),
('fighting,"', 0.9999971985816956),
('sanctioned', 0.9999971985816956),
('night,', 0.9999971389770508),
('flights', 0.9999971389770508),
('airliner', 0.9999971389770508),
('Heights', 0.999997079372406)]
In case they do differ significantly, the next steps in my opinion would be to check -
|
Also keep in mind most modes of word2vec/fasttext use internal randomness. Even with deterministic seeding, if multiple worker threads are in play, the training texts will be consumed in slightly different orders, and thus also be paired with slightly different draws in the PRNG stream. So the standard for comparison against a reference implementation will usually be, "very close in observable quality" rather than "identical numerical results". |
Never mind multiple workers for now -- let's aim for identical results in the simplest possible mode to start with, single-threaded, same RNG. |
While single-thread/same-PRNG could make 'identicality' thinkable, differences in idiomatic constructs and typical datastructures (especially dicts/sets) could do things like reorder operations whose results affect each other. These changes would be orthogonal to the core of the algorithm, and orthogonal to final vector quality, but still change exact numbers. So again I would suggest "very close in observable quality" rather than "identical results" is the right target. |
Continued in #1525 |
#1471
Unsupervised version of facebook's fastText