[WIP] unsupervised fasttext #1482

prakhar2b · 2017-07-13T07:18:17Z

#1471
Unsupervised version of facebook's fastText

prakhar2b · 2017-07-20T11:26:57Z

@piskvorky In word2vec skip-gram model, the target is context word which we predict based on input word (exact opposite to cbow model), but in gensim word2vec code, I see that just like cbow model, for skip-gram model too we have taken context as input only. here in word2vec.py .

It doesn't affect the performance, but could you please clarify a little for my own understanding.

Also, I find the variable naming like train_cbow_pair and word2 etc very confusing.

gojomo · 2017-07-20T21:34:49Z

Skip-gram & CBOW not 'exact opposite' - one uses single words to predict target, other uses averages-of-multiple-words.

There has occasionally been confusion because in there's two ways to loop over the text & window in skip-gram: you can loop over each main-word, then each window-word within that word's window, and make the predictions either (main-word -> window-word), or vice-versa (window-word -> main-word). The original word2vec paper described it one way, but the 1st-released word2vec.c code did it the alternate way, and there was a comment from one of authors that the way the code does it is slightly more cache-efficient. In both cases, the exact same (word->word) pairs are eventually trained, just in a slightly different iteration order.

The naming train_cbow_pair() is somewhat unfortunate, as it isn't really a neat 'pair' but 'window-to-word'. And it'd certainly be better for variables like word2 to have more functionally-descriptive names, like 'target_word' or 'context_word`.

jayantj · 2017-07-21T14:38:25Z

gensim/models/fasttext.py

+import logging
+
+from gensim.models.word2vec import Word2Vec
+from gensim.models.ft_keyedvectors import FastTextKeyedVectors


Why not simply reuse FastTextKeyedVectors from the wrapper?

jayantj · 2017-07-21T14:48:23Z

gensim/models/fasttext.py

+		subwords_indices.append(len(self.wv.vocab) + subword_hash % self.bucket)  # self ?? classmethod or pass model ... discuss ??
+	return subwords_indices
+
+def compute_subwords(word, min_n, max_n):


There's something wrong with indentation. Also, why aren't we reusing compute_ngrams from the wrapper? Similarly for ft_hash

jayantj · 2017-07-21T14:51:44Z

gensim/models/fasttext.py

+                # don't train on the `word` itself
+
+                subword2_indices = []
+            	subword2_indices += get_subwords(word2)


Simply subword2_indices = get_subwords(word2)? Also, there's something wrong with indentation here.

jayantj · 2017-07-21T14:53:05Z

gensim/models/fasttext.py

+            # now go over all words from the (reduced) window, predicting each one in turn
+            start = max(0, pos - model.window + reduced_window)
+            for pos2, word2 in enumerate(word_vocabs[start:(pos + model.window + 1 - reduced_window)], start):
+                # don't train on the `word` itself


This comment is related to a different line, please take care while changing existing code. Things like these don't take up any extra time, and make reviewing (and self-reviewing) much easier.

I didn't change the comment here. I think it's correct as don't train on word itself means that word is the target and rest words in windows are context (taken as input in word2vec code)

I meant, it is present before a different line than it talks about.

yes, I got it. Thanks

jayantj · 2017-07-21T14:54:20Z

gensim/models/fasttext.py

+            	subword2_indices += get_subwords(word2)
+
+                if pos2 != pos:
+                	train_sg_pair(model, model.wv.index2word[word.index], subword2_indices, alpha)


Looking at how get_subwords is defined, it looks like subword_indices contain the actual ngram strings, not the indices.

jayantj · 2017-07-21T14:56:43Z

gensim/models/fasttext.py

+
+		# output_ = std::make_shared<Matrix>(dict_->nwords(), args_->dim);
+
+		output_ = np.matrix(len(self.wv.vocab), self.vector_size)


This looks strange. What exactly are we trying to do here?

jayantj · 2017-07-21T15:04:13Z

gensim/models/fasttext.py

+			word2_subwords_indices = []
+
+			for indices in word2_indices:
+				word2_subwords_indices += get_subwords(model.wv.syn0[indices])  # subwords for each word in window except target word


How is this intended to work? You're passing vector weights to get_subwords?

jayantj · 2017-07-21T15:07:33Z

gensim/models/fasttext.py

+
+				# probably, we need to put subwords in model.wv.syn0 too
+
+			l1 = np_sum(model.wv.syn0[word2_subwords_indices], axis=0)  # 1 x vector_size


I don't think storing ngram weights in syn0 is a good idea, because it makes it much harder to use methods from KeyedVectors. We should have a separate matrix for ngram weights (maybe syn0_all, like in the FastTextKeyedVectors from the wrapper, to be able to reuse that class for methods like most_similar etc).
Once training is complete, syn0 should be populated with computed in-vocab-word vectors, to not have to recompute word vectors for in-vocab words on every lookup.

jayantj · 2017-07-28T05:57:55Z

gensim/models/fasttext.py

+            ngram_indices.append(len(self.wv.vocab) + ngram_hash % self.bucket)
+            self.wv.ngrams[ngram] = i
+
+        self.wv.syn0_all = self.wv.syn0_all.take(ngram_indices, axis=0)


This is not going to work if we want to resume training. Extraneous vectors should be discarded only when we are sure no training is going to take place further (probably analogous to init_sims(replace=True) in Word2Vec.

jayantj · 2017-07-28T05:59:54Z

gensim/test/test_fasttext.py

+        """ Test training with subwords for both skipgram and cbow model"""
+        sentences = LeeCorpus()
+        self.assertTrue(FastText(sentences, min_count=100, size=100, workers=3))
+        self.assertTrue(FastText(sentences, sg=1, min_count=100, size=100, workers=3))


Model sanity tests similar to word2vec are required here too to ensure the "correct" models are learnt, as well as a comparison to the original FastText vectors.

tmylk · 2017-07-28T12:37:49Z

gensim/test/test_fasttext.py

+        """Even tiny models trained on LeeCorpus should pass these sanity checks"""
+        # run extra before/after training tests if train=True
+        """
+        if train:


is there a reason that there is no test here?

yes, the training code in the test is commented out as it took very long time on travis for pure python code. I'd look to reduce the number of iterations, or maybe a smaller datasets.

prakhar2b · 2017-07-28T18:45:41Z

For context - model trained using this code on Lee Corpus

model = FastText(sentences, min_count=1, size=50)

Total number of ngrams in the vocab is 87485
training model with 3 workers on 10781 vocabulary and 50 features, using sg=0 hs=0 sample=0.001 negative=5 window=5

model.similarity("woman", "woman") : 1.0
model.similarity("night", "nights") : 0.99999806698759131
model.doesnt_match("breakfast cereal dinner lunch".split()) : cereal (as expected)

model.most_similar("nights")
(result looks somewhat good with syntactic pov)

[('night', 0.999998152256012),
('lights', 0.9999980926513672),
('night.', 0.9999977350234985),
("night's", 0.999997615814209),
('fighting,"', 0.9999971985816956),
('sanctioned', 0.9999971985816956),
('night,', 0.9999971389770508),
('flights', 0.9999971389770508),
('airliner', 0.9999971389770508),
('Heights', 0.999997079372406)]

cc @jayantj @piskvorky

piskvorky · 2017-07-29T02:13:13Z

Overall, we're going for identical with the original, to start with. To ensure implementation correctness (not "somewhat good"), before we start optimizing.

prakhar2b · 2017-07-29T06:22:14Z

@piskvorky yes, what I meant was this PR was very close to getting the identical result (perhaps a little debugging), the difference IMO is mainly because of the way we have initialized n-gram weights.

prakhar2b · 2017-07-31T21:15:07Z

@jayantj could you look into it and review what needs to be done to get identical results. I think much effort has gone into this so far, and there is no point abandoning this PR.

cc @piskvorky

jayantj · 2017-07-31T21:28:55Z

I agree we don't want to abandon this PR, there has been a lot of effort into it.
As I said before, I think, to check implementation correctness, the next step is comparing the most_similar results for the FastText models trained using the C code vs Python.

The fact that the results look like this -

model.most_similar("nights")
-> [('night', 0.999998152256012),
('lights', 0.9999980926513672),
('night.', 0.9999977350234985),
("night's", 0.999997615814209),
('fighting,"', 0.9999971985816956),
('sanctioned', 0.9999971985816956),
('night,', 0.9999971389770508),
('flights', 0.9999971389770508),
('airliner', 0.9999971389770508),
('Heights', 0.999997079372406)]

is not very conclusive. The way that word vectors are defined in FastText (sum of all ngram vectors) means that even without any training (randomly initialized ngram vectors), the word vector for nights is likely to be very close to words with similar ngrams (night, lights etc.)
So it is hard to draw any conclusions from the results so far. A more detailed analysis is necessary.

In case they do differ significantly, the next steps in my opinion would be to check -

How the initialization differs in C FastText vs Gensim FastText
Whether the gradients are computed in the same way or not
If there are any subtle differences caused by the other hyperparameters

gojomo · 2017-08-03T20:53:00Z

Also keep in mind most modes of word2vec/fasttext use internal randomness. Even with deterministic seeding, if multiple worker threads are in play, the training texts will be consumed in slightly different orders, and thus also be paired with slightly different draws in the PRNG stream. So the standard for comparison against a reference implementation will usually be, "very close in observable quality" rather than "identical numerical results".

piskvorky · 2017-08-04T07:49:12Z

Never mind multiple workers for now -- let's aim for identical results in the simplest possible mode to start with, single-threaded, same RNG.

gojomo · 2017-08-04T17:44:08Z

While single-thread/same-PRNG could make 'identicality' thinkable, differences in idiomatic constructs and typical datastructures (especially dicts/sets) could do things like reorder operations whose results affect each other. These changes would be orthogonal to the core of the algorithm, and orthogonal to final vector quality, but still change exact numbers. So again I would suggest "very close in observable quality" rather than "identical results" is the right target.

menshikh-iv · 2017-08-10T16:06:43Z

Continued in #1525

Prakhar Pratyush added 2 commits July 13, 2017 12:39

initializing fasttext structure

559c154

[WIP] working on fasttext structure

4add531

souravsingh mentioned this pull request Jul 14, 2017

[WIP] Add sent2vec in Gensim #1458

Closed

Prakhar Pratyush added 4 commits July 19, 2017 16:36

subwords implemented for sliding window in cbow

24d492c

initialized input_ and output_ matrix in build_vocab for final model

7304b52

fix syntax error

473e079

implemented subwords for skipgram model

a3f4a03

implemented update function train_cbow_pair with subwords

19c2f29

prakhar2b changed the title ~~unsupervised fasttext~~ [WIP] unsupervised fasttext Jul 21, 2017

updated subword code in cbow

98bcc37

jayantj suggested changes Jul 21, 2017

View reviewed changes

Prakhar Pratyush added 14 commits July 25, 2017 17:09

syn0_all matrix for subwords in sg/sbow

edd7f2a

use wrapper function for ngrams in build_vocab

2546aa4

modified word vectors to ngram vectors in train_sg/cbow_pair

d5888d2

pep8 fixes

34544f1

build ngrams vocab and initialize weights in syn0_all

f554ac7

pep8 fixes + do_train_job()

df87257

training with subwords for cbow

fbee9df

Added test for training cbow with subwords

8929184

pep8 fixes

4e137af

pep8 fixes + get word vector on fly using its ngrams vector

3ba7ae6

test and codes for training with subwords for skipgram model

25e96e3

pep8 fixes

a7386e4

store in-vocab word vectors in syn0

0031314

pep8 fixes

245064f

jayantj reviewed Jul 28, 2017

View reviewed changes

included word itself in ngrams and updated training/ retrieval code

e17cc56

Prakhar Pratyush added 3 commits July 28, 2017 16:42

test for model sanity

50f82f5

pep8 fixes + commented out model sanity test as it takes too much time

58d4f59

pep8 fixes

36048d3

tmylk reviewed Jul 28, 2017

View reviewed changes

typo correct

2d00ccb

prakhar2b closed this Jul 28, 2017

prakhar2b reopened this Jul 29, 2017

Prakhar Pratyush added 2 commits July 29, 2017 17:13

possible changes added as comment for ngram weight initialization

8b3da4b

pep8 fixes

38edc3e

souravsingh mentioned this pull request Aug 10, 2017

[WIP] Adding unsupervised FastText to Gensim #1525

Merged

menshikh-iv closed this Aug 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] unsupervised fasttext #1482

[WIP] unsupervised fasttext #1482

prakhar2b commented Jul 13, 2017 •

edited

Loading

prakhar2b commented Jul 20, 2017 •

edited

Loading

gojomo commented Jul 20, 2017

jayantj Jul 21, 2017

jayantj Jul 21, 2017

jayantj Jul 21, 2017

jayantj Jul 21, 2017

prakhar2b Jul 26, 2017 •

edited

Loading

jayantj Jul 28, 2017

prakhar2b Jul 28, 2017

jayantj Jul 21, 2017

jayantj Jul 21, 2017

jayantj Jul 21, 2017

jayantj Jul 21, 2017

jayantj Jul 28, 2017

jayantj Jul 28, 2017

tmylk Jul 28, 2017

prakhar2b Jul 29, 2017

prakhar2b commented Jul 28, 2017 •

edited

Loading

piskvorky commented Jul 29, 2017 •

edited

Loading

prakhar2b commented Jul 29, 2017

prakhar2b commented Jul 31, 2017

jayantj commented Jul 31, 2017 •

edited

Loading

gojomo commented Aug 3, 2017

piskvorky commented Aug 4, 2017 •

edited

Loading

gojomo commented Aug 4, 2017

menshikh-iv commented Aug 10, 2017


		# output_ = std::make_shared<Matrix>(dict_->nwords(), args_->dim);

		output_ = np.matrix(len(self.wv.vocab), self.vector_size)


		# probably, we need to put subwords in model.wv.syn0 too

		l1 = np_sum(model.wv.syn0[word2_subwords_indices], axis=0) # 1 x vector_size

[WIP] unsupervised fasttext #1482

[WIP] unsupervised fasttext #1482

Conversation

prakhar2b commented Jul 13, 2017 • edited Loading

prakhar2b commented Jul 20, 2017 • edited Loading

gojomo commented Jul 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prakhar2b Jul 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prakhar2b commented Jul 28, 2017 • edited Loading

piskvorky commented Jul 29, 2017 • edited Loading

prakhar2b commented Jul 29, 2017

prakhar2b commented Jul 31, 2017

jayantj commented Jul 31, 2017 • edited Loading

gojomo commented Aug 3, 2017

piskvorky commented Aug 4, 2017 • edited Loading

gojomo commented Aug 4, 2017

menshikh-iv commented Aug 10, 2017

prakhar2b commented Jul 13, 2017 •

edited

Loading

prakhar2b commented Jul 20, 2017 •

edited

Loading

prakhar2b Jul 26, 2017 •

edited

Loading

prakhar2b commented Jul 28, 2017 •

edited

Loading

piskvorky commented Jul 29, 2017 •

edited

Loading

jayantj commented Jul 31, 2017 •

edited

Loading

piskvorky commented Aug 4, 2017 •

edited

Loading