Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor reproducibility of out-of-vocab word vectors after loading native model #2315

Closed
mpenkov opened this issue Jan 3, 2019 · 4 comments · Fixed by #2313
Closed

Poor reproducibility of out-of-vocab word vectors after loading native model #2315

mpenkov opened this issue Jan 3, 2019 · 4 comments · Fixed by #2313
Assignees
Labels
fasttext Issues related to the FastText model

Comments

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 3, 2019

See this unit test for reproduction:

https://github.com/mpenkov/gensim/blob/0d30caeb8c6d165d63c050de4bf32a0eab241d48/gensim/test/test_fasttext.py#L891

The test passes only if the tolerance is very high (0.1). For lower tolerance values (e.g. 0.01 and below), the test fails.

@mpenkov mpenkov self-assigned this Jan 3, 2019
@mpenkov mpenkov added the fasttext Issues related to the FastText model label Jan 3, 2019
@mpenkov mpenkov changed the title Poor reproducibility of out-of-vocab word vectors when native model Poor reproducibility of out-of-vocab word vectors after loading native model Jan 3, 2019
@menshikh-iv
Copy link
Contributor

Partial fix - 07f34e2

@gojomo
Copy link
Collaborator

gojomo commented Jan 6, 2019

I've previously wondered if our code is in fact calculating FastText vectors properly in accordance with the original implementation. To copy my comments from another chat:

Do we have a test in place that when we use gensim Fasttest.load_fasttext_format() on an outside FastText file, then ask for an in-vocabulary word, we get identical results? Because when I compare FastTextKeyedVectors.word_vec() and the original native FastText::getWordVector(), they appear to use different calculations before returning a word's vector. Gensim's word_vec only reports the full-word vector, if present: https://github.com/RaRe-Technologies/gensim/blob/2ccc82bf50bcfbee44932c160db076a873cf893e/gensim/models/keyedvectors.py#L1991 – whereas FastText always sums the vector from all subwords (which also include the full word): https://github.com/facebookresearch/fastText/blob/501b9b1e4543fd2de55e4a621a9924ce7d2b5b17/src/fasttext.cc#L66

The gensim line I've linked to definitively exits if there's a full-word vector; the only way I can currently imagine that giving the same result as the FT code is if gensim's full-word vectors are already tallied from all subwords, which seems to me an risky/unnecessary deviation-in-approach.

Meanwhile the FT routine I've linked only looks at subwords (which if I understand other code elsewhere, include the full-word with special beginning-of-word, end-of-word bumpers) - there's never even a lookup, much less an early exit, of the full-word alone.

@menshikh-iv
Copy link
Contributor

Thanks @gojomo

Do we have a test in place that when we use gensim Fasttest.load_fasttext_format() on an outside FastText file, then ask for an in-vocabulary word, we get identical results?

Already yes, @mpenkov on it, see #2313

@gojomo
Copy link
Collaborator

gojomo commented Jan 7, 2019

From the descriptions of #2313/#2160, they seem focused on other related functionality, so it's not clear they'd necessarily include tests that verify identical-word-vectors from a loaded Facebook-FT-trained model. Though of course, it'd be great to have such tests, because it's unclear gensim really supports FT unless it matches Facebook's library's output from loaded models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fasttext Issues related to the FastText model
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants