- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference issue using FB pretrained model if word have no ngrams #2415
Comments
Thanks for the report @menshikh-iv ! Please don't assign people to tickets, we may have different priorities. If you feel something is urgent, feel free to open a PR with a fix for review. |
Ok, no problem. |
Clean up + tightening + docs + web. Nothing major, very little capacity. We're still discussing priorities and concrete objectives (also for grants). |
With develop, I get this:
So the bug @menshikh-iv described is fixed, but the fix uncovered a divide-by-zero case, because there can be zero ngrams extracted from a single space character. We have several ways to handle this:
@piskvorky Which one do you think is best? |
No idea. What does FB's FT do? |
The command-line utility ignores the request for a vector for a blank space. So if you say "give me the vector for a blank space", the utility just stares back at you. If you give it a term with spaces, they first split the term into subterms by spaces, and return the vectors for each subterm. |
OK, thanks. And what does their API (as opposed to CLI) do? |
Also: with the default So what does FT CLI do for single-character OOV words with no n-grams to look up? We should probably do similar for OOV tokens with zero relevant n-grams, like '' (empty-string), ' ' (single space), 'j' (single-character). (An OOV token like 'qz' would be padded to '<qz>', which would yield one 4-char n-gram '<qz>' capable of being looked up, and get whatever n-gram vector happens to be at that bucket, trained or not.) |
Regarding roadmap, I found this: https://github.com/RaRe-Technologies/gensim/wiki/Roadmap-(2018) (our planned roadmap for 2018). Everything in there still stands (not much progress in 2018). Especially the clean up and discoverability. Let me just rename it to "Roadmap / priorities for 2019". |
Fixed by #2411 |
Problem:
FastText
in gensim and official version still produce different output on FB pretrained model (issue with oov word without ngrams).Prepare data:
Code:
Exception is correct, but behaviour is wrong (should return zero vector as FB implementation. instead of raising an exception).
BTW - when we load & use FB model - we shouldn't raise an exception at all.
The text was updated successfully, but these errors were encountered: