-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix critical issues in FastText
#2313
Conversation
$ ~/src/fastText-0.1.0/fasttext cbow -input toy-data.txt -output toy-model -bucket 100 Read 0M words Number of words: 22 Number of labels: 0 Progress: 100.0% words/sec/thread: 209 lr: 0.000000 loss: 4.100698 eta: 0h0m -14m
this will make it easier to debug manually $ ~/src/fastText-0.1.0/fasttext cbow -input toy-data.txt -output toy-model -bucket 100 -dim 5 Read 0M words Number of words: 22 Number of labels: 0 Progress: 100.0% words/sec/thread: 199 lr: 0.000000 loss: 0.000000 eta: 0h0m
it cannot pass by design: training is non-deterministic, so conditions must be tightly controlled to guarantee reproducibility, and that is too much effort for a unit test
FastText
Great work @mpenkov 🚀 what's missing before the merge
|
This is an internal method masquerading as a public one. There is no reason for anyone to call it. Removing it will have no effect on pickling/unpickling, as methods do not get serialized. Therefore, removing it is safe.
I'd prefer to have the same implementation as FastText. Reasons:
|
Great, we conclude all stuff. Nice job @mpenkov 🔥 |
FastText
FastText
Current PR contains fixes for all critical bugs in our fasttext implementation:
train()
multiple times in a row without any issues.In conclusion - this makes FastText in Gensim more reliable, and directly compatible with FB's FT implementation for OOV words and model persistence.
We also identified divergent behavior with the Facebook implementation. This behavior is caused by an optimization that uses a smaller number of buckets than available. The manifestation is that if we compare two models:
then 1) will have fewer vectors than 2). As a consequence, vectors for OOV terms between the models will differ. This behavior is captured in our unit tests as test_out_of_vocab_gensim.