Fix critical issues in `FastText` #2313

mpenkov · 2019-01-02T02:50:03Z

Current PR contains fixes for all critical bugs in our fasttext implementation:

Fix fasttext ft_hash and unicode handling #2059: correct hash-function implementation, we're now 100% compatible with FB implementation.
Fix [Feature request] Load full native fastText model to continue training on new data #2160: correctly load FB model and continue training with Gensim (incremental / online training from an existing FB model).
Fix Poor separation of concerns in fasttext design #2312: refactor internal structure (make it simpler, less coupling).
Fix Poor reproducibility of out-of-vocab word vectors after loading native model #2315: produce exactly same vectors as FB models if you load FB model (including non-latin-based languages, same hashing of all unicode characters)
Fix FastText incremental training fails #2139: Fix issue with incremental training, you can now call train() multiple times in a row without any issues.
Fix 'FastTextTrainables' object has no attribute 'vectors' #2062: no attribute "vectors"

In conclusion - this makes FastText in Gensim more reliable, and directly compatible with FB's FT implementation for OOV words and model persistence.

We also identified divergent behavior with the Facebook implementation. This behavior is caused by an optimization that uses a smaller number of buckets than available. The manifestation is that if we compare two models:

Gensim trained from a text file
Facebook trained from the same file (same parameters) and loaded via Gensim

then 1) will have fewer vectors than 2). As a consequence, vectors for OOV terms between the models will differ. This behavior is captured in our unit tests as test_out_of_vocab_gensim.

$ ~/src/fastText-0.1.0/fasttext cbow -input toy-data.txt -output toy-model -bucket 100 Read 0M words Number of words: 22 Number of labels: 0 Progress: 100.0% words/sec/thread: 209 lr: 0.000000 loss: 4.100698 eta: 0h0m -14m

this will make it easier to debug manually $ ~/src/fastText-0.1.0/fasttext cbow -input toy-data.txt -output toy-model -bucket 100 -dim 5 Read 0M words Number of words: 22 Number of labels: 0 Progress: 100.0% words/sec/thread: 199 lr: 0.000000 loss: 0.000000 eta: 0h0m

it cannot pass by design: training is non-deterministic, so conditions must be tightly controlled to guarantee reproducibility, and that is too much effort for a unit test

menshikh-iv · 2019-01-10T03:15:06Z

Great work @mpenkov 🚀 what's missing before the merge

several unresolved comments
Fix loading fasttext models that don't contain ngrams #2058 (add test, if failing - fix)
description of matrix shape mismatches (probably in PR)

This is an internal method masquerading as a public one. There is no reason for anyone to call it. Removing it will have no effect on pickling/unpickling, as methods do not get serialized. Therefore, removing it is safe.

piskvorky · 2019-01-11T11:47:47Z

We also identified divergent behavior with the Facebook implementation. This behavior is caused by an optimization that uses a smaller number of buckets than available.

I'd prefer to have the same implementation as FastText. Reasons:

More straightforward compatibility, less surprises for both users and developers.
If buckets take up too much memory, the user can specify fewer buckets (=up to the user, I see no reason to optimize this on our side).
It looks better for quality to have the (random) vectors for different OOV ngrams contribute differently (like in FastText), not just be skipped = not contribute at all (like in Gensim now). We could probably generate the random vectors on the fly (deterministically), as an optimization, but I don't think that's needed/critical/urgent at this point, plus that would complicate incremental training too, when "OOV" bucket becomes "IV" with new data.

menshikh-iv · 2019-01-11T11:55:17Z

Great, we conclude all stuff. Nice job @mpenkov 🔥
For #2313 (comment) we have separate issue #2329, time to merge current when Appveyor pass, I love Appveyor (sarcasm)

mpenkov added 30 commits December 16, 2018 20:19

WIP

94a20e9

Handle incompatible float size condition

fb2b5b0

update docstring

1a41182

move regression test to unit tests

cd0b318

WIP

3b31288

introduced Tracker class

42626a2

added log examples

12cc3e2

initialize trainables weights when loading native model

64f7f39

adding script to trigger bug

00b472b

minor documentation changes

abfd573

improve unit test

4e46062

retrained toy model

d3544c7

$ ~/src/fastText-0.1.0/fasttext cbow -input toy-data.txt -output toy-model -bucket 100 Read 0M words Number of words: 22 Number of labels: 0 Progress: 100.0% words/sec/thread: 209 lr: 0.000000 loss: 4.100698 eta: 0h0m -14m

update bucket parameter in unit test

b98bc0b

update unit test

30be5bd

WIP

ab1eaf6

git add docs/fasttext-notes.md

392201b

adding some comments and fixmes

f25607f

minor refactoring, update tests

ef394ed

update notes

fe10ca7

update notes

4c2223c

initialize wv.vectors_vocab

28bf757

init vectors_vocab properly

8e0d04f

add test_sanity_vectors

795fed0

no longer segfaulting

671b3c0

adding tests for in-vocab out-of-vocab words

9adb532

removing old test

6de08de

it cannot pass by design: training is non-deterministic, so conditions must be tightly controlled to guarantee reproducibility, and that is too much effort for a unit test

fix typo in test, reduce tolerance

81dd478

update test_continuation, it now fails

5c500f0

test continued training with gensim model

cb045de

mpenkov added 6 commits January 10, 2019 10:38

review response: explain how expected values were generated

39e8844

review response: add test for long OOV word

08ee7d8

review response: remove unused comments

6d8a648

review response: remove comment

734a0ac

add test_continuation_load_gensim

143445e

update model using gensim 3.6.0

250d388

menshikh-iv changed the title ~~WIP: fix critical issues in our fasttext implementation~~ [WIP] Fix critical issues in FastText Jan 10, 2019

menshikh-iv mentioned this pull request Jan 10, 2019

[WIP] Enable continuation of training of models loaded from native fastText #2299

Closed

mpenkov and others added 6 commits January 10, 2019 16:29

review response: get rid of struct_unpack

9fcf35e

This is an internal method masquerading as a public one. There is no reason for anyone to call it. Removing it will have no effect on pickling/unpickling, as methods do not get serialized. Therefore, removing it is safe.

review response: implement handling for zero bucket edge case

c1aeb85

review response: add test_save_load

58c1166

review response: add test_save_load_native

e5960ed

workaround appveyor tempfile issue

52230aa

fix tests

14c497d

menshikh-iv mentioned this pull request Jan 11, 2019

Add Sent2Vec model. Fix #1376 #1619

Closed

menshikh-iv approved these changes Jan 11, 2019

View reviewed changes

mpenkov mentioned this pull request Jan 11, 2019

Roll back optimization that trims empty buckets #2329

Closed

menshikh-iv changed the title ~~[WIP] Fix critical issues in FastText~~ Fix critical issues in FastText Jan 11, 2019

menshikh-iv merged commit b452a5b into piskvorky:develop Jan 11, 2019

This was referenced Jan 11, 2019

FastText incremental training fails #2139

Closed

Transfer Learning word2vec #2330

Closed

FastText save & callbacks suspicious behavior #2235

Closed

'FastTextTrainables' object has no attribute 'vectors' #2062

Closed

akutuzov mentioned this pull request Jan 19, 2019

fastText fixes in 3.7 break compatibility with old models #2341

Closed

akutuzov mentioned this pull request Feb 4, 2019

impossible to load into gensim the fastText model trained with pretrained_vectors #2350

Closed

mpenkov mentioned this pull request Feb 5, 2019

Loading the English wikipedia model hangs indefinitely when low on RAM #2372

Closed

mpenkov deleted the attrs branch June 22, 2019 02:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix critical issues in `FastText` #2313

Fix critical issues in `FastText` #2313

mpenkov commented Jan 2, 2019 •

edited

Loading

menshikh-iv commented Jan 10, 2019

piskvorky commented Jan 11, 2019 •

edited

Loading

menshikh-iv commented Jan 11, 2019

Fix critical issues in FastText #2313

Fix critical issues in FastText #2313

Conversation

mpenkov commented Jan 2, 2019 • edited Loading

menshikh-iv commented Jan 10, 2019

piskvorky commented Jan 11, 2019 • edited Loading

menshikh-iv commented Jan 11, 2019

Fix critical issues in `FastText` #2313

Fix critical issues in `FastText` #2313

mpenkov commented Jan 2, 2019 •

edited

Loading

piskvorky commented Jan 11, 2019 •

edited

Loading