Poor reproducibility of out-of-vocab word vectors after loading native model #2315

mpenkov · 2019-01-03T07:55:06Z

See this unit test for reproduction:

https://github.com/mpenkov/gensim/blob/0d30caeb8c6d165d63c050de4bf32a0eab241d48/gensim/test/test_fasttext.py#L891

The test passes only if the tolerance is very high (0.1). For lower tolerance values (e.g. 0.01 and below), the test fails.

menshikh-iv · 2019-01-04T09:26:57Z

Partial fix - 07f34e2

gojomo · 2019-01-06T18:26:25Z

I've previously wondered if our code is in fact calculating FastText vectors properly in accordance with the original implementation. To copy my comments from another chat:

Do we have a test in place that when we use gensim Fasttest.load_fasttext_format() on an outside FastText file, then ask for an in-vocabulary word, we get identical results? Because when I compare FastTextKeyedVectors.word_vec() and the original native FastText::getWordVector(), they appear to use different calculations before returning a word's vector. Gensim's word_vec only reports the full-word vector, if present: https://github.com/RaRe-Technologies/gensim/blob/2ccc82bf50bcfbee44932c160db076a873cf893e/gensim/models/keyedvectors.py#L1991 – whereas FastText always sums the vector from all subwords (which also include the full word): https://github.com/facebookresearch/fastText/blob/501b9b1e4543fd2de55e4a621a9924ce7d2b5b17/src/fasttext.cc#L66

The gensim line I've linked to definitively exits if there's a full-word vector; the only way I can currently imagine that giving the same result as the FT code is if gensim's full-word vectors are already tallied from all subwords, which seems to me an risky/unnecessary deviation-in-approach.

Meanwhile the FT routine I've linked only looks at subwords (which if I understand other code elsewhere, include the full-word with special beginning-of-word, end-of-word bumpers) - there's never even a lookup, much less an early exit, of the full-word alone.

menshikh-iv · 2019-01-07T04:04:10Z

Thanks @gojomo

Do we have a test in place that when we use gensim Fasttest.load_fasttext_format() on an outside FastText file, then ask for an in-vocabulary word, we get identical results?

Already yes, @mpenkov on it, see #2313

gojomo · 2019-01-07T05:11:49Z

From the descriptions of #2313/#2160, they seem focused on other related functionality, so it's not clear they'd necessarily include tests that verify identical-word-vectors from a loaded Facebook-FT-trained model. Though of course, it'd be great to have such tests, because it's unclear gensim really supports FT unless it matches Facebook's library's output from loaded models.

mpenkov mentioned this issue Jan 3, 2019

Fix critical issues in FastText #2313

Merged

mpenkov self-assigned this Jan 3, 2019

mpenkov added the fasttext Issues related to the FastText model label Jan 3, 2019

mpenkov changed the title ~~Poor reproducibility of out-of-vocab word vectors when native model~~ Poor reproducibility of out-of-vocab word vectors after loading native model Jan 3, 2019

menshikh-iv closed this as completed in #2313 Jan 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor reproducibility of out-of-vocab word vectors after loading native model #2315

Poor reproducibility of out-of-vocab word vectors after loading native model #2315

mpenkov commented Jan 3, 2019

menshikh-iv commented Jan 4, 2019

gojomo commented Jan 6, 2019

menshikh-iv commented Jan 7, 2019

gojomo commented Jan 7, 2019 •

edited

Loading

Poor reproducibility of out-of-vocab word vectors after loading native model #2315

Poor reproducibility of out-of-vocab word vectors after loading native model #2315

Comments

mpenkov commented Jan 3, 2019

menshikh-iv commented Jan 4, 2019

gojomo commented Jan 6, 2019

menshikh-iv commented Jan 7, 2019

gojomo commented Jan 7, 2019 • edited Loading

gojomo commented Jan 7, 2019 •

edited

Loading