Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roll back optimization that trims empty buckets #2329

Closed
mpenkov opened this issue Jan 11, 2019 · 0 comments
Closed

Roll back optimization that trims empty buckets #2329

mpenkov opened this issue Jan 11, 2019 · 0 comments
Assignees
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills fasttext Issues related to the FastText model

Comments

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 11, 2019

Our implementation diverges from Facebook's because we trim buckets that have no ngrams assigned to them, saving memory and training time. We should roll back this optimization because it introduces divergent behavior and makes the code more complex.

See #2313 (comment) and #2313 (comment) for more details

@mpenkov mpenkov self-assigned this Jan 11, 2019
@menshikh-iv menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills fasttext Issues related to the FastText model labels Jan 11, 2019
mpenkov added a commit that referenced this issue Mar 7, 2019
This optimization reduced the number of ngram buckets to include only ngrams that we have seen during training.

This seemed like a good idea at the time, because it saved CPU cycles and RAM, but turned out to be a bad idea, because it introduced divergent behavior when compared to the reference implementation. For example:

We were unable to calculate vectors for terms that were completely out of the vocab (so the term and all its ngrams were unseen). This is bad because the original FB implementation always returns a vector. It may seem useless because it's initialized to a random vector, but that's not entirely true, because that vector is random at initialization time. When we're querying the ngram's vector, the vector is deterministic, so it is useful.

Another problem is that it complicated the implementation. We now needed an additional layer of indirection that mapped hashes to bucket indices. Without this optimization, this mapping is essentially the identifiy function: the hash N always maps to the Nth bucket.

This pull request removes the optimization, resolving the problems that it introduced.

Fixes #2329
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills fasttext Issues related to the FastText model
Projects
None yet
Development

No branches or pull requests

2 participants