Roll back optimization that trims empty buckets #2329

mpenkov · 2019-01-11T11:36:35Z

Our implementation diverges from Facebook's because we trim buckets that have no ngrams assigned to them, saving memory and training time. We should roll back this optimization because it introduces divergent behavior and makes the code more complex.

See #2313 (comment) and #2313 (comment) for more details

This optimization reduced the number of ngram buckets to include only ngrams that we have seen during training. This seemed like a good idea at the time, because it saved CPU cycles and RAM, but turned out to be a bad idea, because it introduced divergent behavior when compared to the reference implementation. For example: We were unable to calculate vectors for terms that were completely out of the vocab (so the term and all its ngrams were unseen). This is bad because the original FB implementation always returns a vector. It may seem useless because it's initialized to a random vector, but that's not entirely true, because that vector is random at initialization time. When we're querying the ngram's vector, the vector is deterministic, so it is useful. Another problem is that it complicated the implementation. We now needed an additional layer of indirection that mapped hashes to bucket indices. Without this optimization, this mapping is essentially the identifiy function: the hash N always maps to the Nth bucket. This pull request removes the optimization, resolving the problems that it introduced. Fixes #2329

mpenkov self-assigned this Jan 11, 2019

menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills fasttext Issues related to the FastText model labels Jan 11, 2019

This was referenced Jan 11, 2019

Fix critical issues in FastText #2313

Merged

Improve FastText loading times #1261

Closed

mpenkov mentioned this issue Feb 3, 2019

Undo the hash2index optimization #2370

Merged

mpenkov closed this as completed in #2370 Mar 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roll back optimization that trims empty buckets #2329

Roll back optimization that trims empty buckets #2329

mpenkov commented Jan 11, 2019 •

edited by menshikh-iv

Loading

Roll back optimization that trims empty buckets #2329

Roll back optimization that trims empty buckets #2329

Comments

mpenkov commented Jan 11, 2019 • edited by menshikh-iv Loading

mpenkov commented Jan 11, 2019 •

edited by menshikh-iv

Loading