Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] unsupervised fasttext #1482

Closed
wants to merge 29 commits into from

Conversation

prakhar2b
Copy link
Contributor

@prakhar2b prakhar2b commented Jul 13, 2017

#1471
Unsupervised version of facebook's fastText

@prakhar2b
Copy link
Contributor Author

prakhar2b commented Jul 20, 2017

@piskvorky In word2vec skip-gram model, the target is context word which we predict based on input word (exact opposite to cbow model), but in gensim word2vec code, I see that just like cbow model, for skip-gram model too we have taken context as input only. here in word2vec.py .

It doesn't affect the performance, but could you please clarify a little for my own understanding.

Also, I find the variable naming like train_cbow_pair and word2 etc very confusing.

@gojomo
Copy link
Collaborator

gojomo commented Jul 20, 2017

Skip-gram & CBOW not 'exact opposite' - one uses single words to predict target, other uses averages-of-multiple-words.

There has occasionally been confusion because in there's two ways to loop over the text & window in skip-gram: you can loop over each main-word, then each window-word within that word's window, and make the predictions either (main-word -> window-word), or vice-versa (window-word -> main-word). The original word2vec paper described it one way, but the 1st-released word2vec.c code did it the alternate way, and there was a comment from one of authors that the way the code does it is slightly more cache-efficient. In both cases, the exact same (word->word) pairs are eventually trained, just in a slightly different iteration order.

The naming train_cbow_pair() is somewhat unfortunate, as it isn't really a neat 'pair' but 'window-to-word'. And it'd certainly be better for variables like word2 to have more functionally-descriptive names, like 'target_word' or 'context_word`.

@prakhar2b prakhar2b changed the title unsupervised fasttext [WIP] unsupervised fasttext Jul 21, 2017
import logging

from gensim.models.word2vec import Word2Vec
from gensim.models.ft_keyedvectors import FastTextKeyedVectors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply reuse FastTextKeyedVectors from the wrapper?

subwords_indices.append(len(self.wv.vocab) + subword_hash % self.bucket) # self ?? classmethod or pass model ... discuss ??
return subwords_indices

def compute_subwords(word, min_n, max_n):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's something wrong with indentation. Also, why aren't we reusing compute_ngrams from the wrapper? Similarly for ft_hash

# don't train on the `word` itself

subword2_indices = []
subword2_indices += get_subwords(word2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply subword2_indices = get_subwords(word2)? Also, there's something wrong with indentation here.

# now go over all words from the (reduced) window, predicting each one in turn
start = max(0, pos - model.window + reduced_window)
for pos2, word2 in enumerate(word_vocabs[start:(pos + model.window + 1 - reduced_window)], start):
# don't train on the `word` itself
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is related to a different line, please take care while changing existing code. Things like these don't take up any extra time, and make reviewing (and self-reviewing) much easier.

Copy link
Contributor Author

@prakhar2b prakhar2b Jul 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change the comment here. I think it's correct as don't train on word itself means that word is the target and rest words in windows are context (taken as input in word2vec code)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, it is present before a different line than it talks about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I got it. Thanks

subword2_indices += get_subwords(word2)

if pos2 != pos:
train_sg_pair(model, model.wv.index2word[word.index], subword2_indices, alpha)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at how get_subwords is defined, it looks like subword_indices contain the actual ngram strings, not the indices.


# output_ = std::make_shared<Matrix>(dict_->nwords(), args_->dim);

output_ = np.matrix(len(self.wv.vocab), self.vector_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks strange. What exactly are we trying to do here?

word2_subwords_indices = []

for indices in word2_indices:
word2_subwords_indices += get_subwords(model.wv.syn0[indices]) # subwords for each word in window except target word
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this intended to work? You're passing vector weights to get_subwords?


# probably, we need to put subwords in model.wv.syn0 too

l1 = np_sum(model.wv.syn0[word2_subwords_indices], axis=0) # 1 x vector_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think storing ngram weights in syn0 is a good idea, because it makes it much harder to use methods from KeyedVectors. We should have a separate matrix for ngram weights (maybe syn0_all, like in the FastTextKeyedVectors from the wrapper, to be able to reuse that class for methods like most_similar etc).
Once training is complete, syn0 should be populated with computed in-vocab-word vectors, to not have to recompute word vectors for in-vocab words on every lookup.

ngram_indices.append(len(self.wv.vocab) + ngram_hash % self.bucket)
self.wv.ngrams[ngram] = i

self.wv.syn0_all = self.wv.syn0_all.take(ngram_indices, axis=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not going to work if we want to resume training. Extraneous vectors should be discarded only when we are sure no training is going to take place further (probably analogous to init_sims(replace=True) in Word2Vec.

""" Test training with subwords for both skipgram and cbow model"""
sentences = LeeCorpus()
self.assertTrue(FastText(sentences, min_count=100, size=100, workers=3))
self.assertTrue(FastText(sentences, sg=1, min_count=100, size=100, workers=3))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model sanity tests similar to word2vec are required here too to ensure the "correct" models are learnt, as well as a comparison to the original FastText vectors.

"""Even tiny models trained on LeeCorpus should pass these sanity checks"""
# run extra before/after training tests if train=True
"""
if train:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason that there is no test here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the training code in the test is commented out as it took very long time on travis for pure python code. I'd look to reduce the number of iterations, or maybe a smaller datasets.

@prakhar2b
Copy link
Contributor Author

prakhar2b commented Jul 28, 2017

For context - model trained using this code on Lee Corpus

model = FastText(sentences, min_count=1, size=50)

Total number of ngrams in the vocab is 87485
training model with 3 workers on 10781 vocabulary and 50 features, using sg=0 hs=0 sample=0.001 negative=5 window=5

model.similarity("woman", "woman") : 1.0
model.similarity("night", "nights") : 0.99999806698759131
model.doesnt_match("breakfast cereal dinner lunch".split()) : cereal (as expected)

model.most_similar("nights")
(result looks somewhat good with syntactic pov)

[('night', 0.999998152256012),
('lights', 0.9999980926513672),
('night.', 0.9999977350234985),
("night's", 0.999997615814209),
('fighting,"', 0.9999971985816956),
('sanctioned', 0.9999971985816956),
('night,', 0.9999971389770508),
('flights', 0.9999971389770508),
('airliner', 0.9999971389770508),
('Heights', 0.999997079372406)]

cc @jayantj @piskvorky

@prakhar2b prakhar2b closed this Jul 28, 2017
@piskvorky
Copy link
Owner

piskvorky commented Jul 29, 2017

Overall, we're going for identical with the original, to start with. To ensure implementation correctness (not "somewhat good"), before we start optimizing.

@prakhar2b
Copy link
Contributor Author

@piskvorky yes, what I meant was this PR was very close to getting the identical result (perhaps a little debugging), the difference IMO is mainly because of the way we have initialized n-gram weights.

@prakhar2b prakhar2b reopened this Jul 29, 2017
@prakhar2b
Copy link
Contributor Author

@jayantj could you look into it and review what needs to be done to get identical results. I think much effort has gone into this so far, and there is no point abandoning this PR.

cc @piskvorky

@jayantj
Copy link
Contributor

jayantj commented Jul 31, 2017

I agree we don't want to abandon this PR, there has been a lot of effort into it.
As I said before, I think, to check implementation correctness, the next step is comparing the most_similar results for the FastText models trained using the C code vs Python.

The fact that the results look like this -

model.most_similar("nights")
-> [('night', 0.999998152256012),
('lights', 0.9999980926513672),
('night.', 0.9999977350234985),
("night's", 0.999997615814209),
('fighting,"', 0.9999971985816956),
('sanctioned', 0.9999971985816956),
('night,', 0.9999971389770508),
('flights', 0.9999971389770508),
('airliner', 0.9999971389770508),
('Heights', 0.999997079372406)]
  • is not very conclusive. The way that word vectors are defined in FastText (sum of all ngram vectors) means that even without any training (randomly initialized ngram vectors), the word vector for nights is likely to be very close to words with similar ngrams (night, lights etc.)
    So it is hard to draw any conclusions from the results so far. A more detailed analysis is necessary.

In case they do differ significantly, the next steps in my opinion would be to check -

  1. How the initialization differs in C FastText vs Gensim FastText
  2. Whether the gradients are computed in the same way or not
  3. If there are any subtle differences caused by the other hyperparameters

@gojomo
Copy link
Collaborator

gojomo commented Aug 3, 2017

Also keep in mind most modes of word2vec/fasttext use internal randomness. Even with deterministic seeding, if multiple worker threads are in play, the training texts will be consumed in slightly different orders, and thus also be paired with slightly different draws in the PRNG stream. So the standard for comparison against a reference implementation will usually be, "very close in observable quality" rather than "identical numerical results".

@piskvorky
Copy link
Owner

piskvorky commented Aug 4, 2017

Never mind multiple workers for now -- let's aim for identical results in the simplest possible mode to start with, single-threaded, same RNG.

@gojomo
Copy link
Collaborator

gojomo commented Aug 4, 2017

While single-thread/same-PRNG could make 'identicality' thinkable, differences in idiomatic constructs and typical datastructures (especially dicts/sets) could do things like reorder operations whose results affect each other. These changes would be orthogonal to the core of the algorithm, and orthogonal to final vector quality, but still change exact numbers. So again I would suggest "very close in observable quality" rather than "identical results" is the right target.

@menshikh-iv
Copy link
Contributor

Continued in #1525

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants