Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference issue using FB pretrained model if word have no ngrams #2415

Closed
menshikh-iv opened this issue Mar 13, 2019 · 11 comments
Closed

Inference issue using FB pretrained model if word have no ngrams #2415

menshikh-iv opened this issue Mar 13, 2019 · 11 comments
Assignees
Labels
bug Issue described a bug fasttext Issues related to the FastText model

Comments

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Mar 13, 2019

Problem: FastText in gensim and official version still produce different output on FB pretrained model (issue with oov word without ngrams).

Prepare data:

curl https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz --output cc.en.300.bin.gz
gunzip cc.en.300.bin.gz

Code:

from gensim.models import FastText
from fastText import load_model
import numpy as np

fname = "cc.en.300.bin"

fb_model = load_model(fname)
gensim_model = FastText.load_fasttext_format(fname)

word = "`"  # you can't generate subwords from this word

assert len(fb_model.get_subwords(word)[0]) == 0  # really, no subwords

fb_vector = fb_model.get_word_vector("`")  # fb_vector exist, though no subwords and not in vocab
np.testing.assert_allclose(fb_vector, np.zeros(fb_vector.shape[0]))  # and fb vector is zero vector

gensim_vector = gensim_model["`"]  # raise an exception
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-13-243ea594a1c4> in <module>
      8 np.testing.assert_allclose(fb_vector, np.zeros(fb_vector.shape[0]))
      9 
---> 10 gensim_vector = gensim_model["`"]

~/.local/lib/python3.6/site-packages/gensim/utils.py in new_func1(*args, **kwargs)
   1445                     stacklevel=2
   1446                 )
-> 1447                 return func(*args, **kwargs)
   1448 
   1449             return new_func1

~/.local/lib/python3.6/site-packages/gensim/models/fasttext.py in __getitem__(self, words)
    926 
    927         """
--> 928         return self.wv.__getitem__(words)
    929 
    930     @deprecated("Method will be removed in 4.0.0, use self.wv.__contains__() instead")

~/.local/lib/python3.6/site-packages/gensim/models/keyedvectors.py in __getitem__(self, entities)
    345         if isinstance(entities, string_types):
    346             # allow calls like trained_model['office'], as a shorthand for trained_model[['office']]
--> 347             return self.get_vector(entities)
    348 
    349         return vstack([self.get_vector(entity) for entity in entities])

~/.local/lib/python3.6/site-packages/gensim/models/keyedvectors.py in get_vector(self, word)
    463 
    464     def get_vector(self, word):
--> 465         return self.word_vec(word)
    466 
    467     def words_closer_than(self, w1, w2):

~/.local/lib/python3.6/site-packages/gensim/models/keyedvectors.py in word_vec(self, word, use_norm)
   2096                 return word_vec / max(1, ngrams_found)
   2097             else:  # No ngrams of the word are present in self.ngrams
-> 2098                 raise KeyError('all ngrams for word %s absent from model' % word)
   2099 
   2100     def init_sims(self, replace=False):

KeyError: 'all ngrams for word ` absent from model'

Exception is correct, but behaviour is wrong (should return zero vector as FB implementation. instead of raising an exception).
BTW - when we load & use FB model - we shouldn't raise an exception at all.

@menshikh-iv menshikh-iv added the bug Issue described a bug label Mar 13, 2019
@piskvorky
Copy link
Owner

Thanks for the report @menshikh-iv !

Please don't assign people to tickets, we may have different priorities. If you feel something is urgent, feel free to open a PR with a fix for review.

@menshikh-iv
Copy link
Contributor Author

@piskvorky

Please don't assign people to tickets, we may have different priorities.

Ok, no problem.
BTW, what're open-source priorities right now? Any plan/roadmap for this year (maybe I miss something?)

@piskvorky
Copy link
Owner

Clean up + tightening + docs + web. Nothing major, very little capacity. We're still discussing priorities and concrete objectives (also for grants).

@gojomo
Copy link
Collaborator

gojomo commented Mar 14, 2019

Was this seen in develop or latest-release? Because I think @mpenkov's latest fixes (including in #2370) may have finally corrected this behavior to match FB FT.

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 14, 2019

With develop, I get this:

(devel.env) mpenkov@hetrad2:~/data/2415$ python bug.py 
INFO:gensim.models._fasttext_bin:loading 2000000 words for fastText model from cc.en.300.bin
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.word2vec:Updating model with new vocabulary
INFO:gensim.models.word2vec:New added 2000000 unique words (50% of original 4000000) and increased the count of 2000000 pre-existing words (50% of original 4000000)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 2000000 items
INFO:gensim.models.word2vec:sample=1e-05 downsamples 6996 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 390315457935 word corpus (70.7% of prior 552001338161)
INFO:gensim.models.fasttext:loaded (4000000, 300) weight matrix for fastText model from cc.en.300.bin
/home/mpenkov/git/gensim/gensim/models/keyedvectors.py:2103: RuntimeWarning: invalid value encountered in true_divide
  return word_vec / len(ngram_hashes)

So the bug @menshikh-iv described is fixed, but the fix uncovered a divide-by-zero case, because there can be zero ngrams extracted from a single space character. We have several ways to handle this:

  1. Do nothing
  2. Avoid the division, return the origin vector
  3. Raise an exception

@piskvorky Which one do you think is best?

@piskvorky
Copy link
Owner

No idea. What does FB's FT do?

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 14, 2019

word
word 0.039659 0.094314 0.057308 0.060724 0.026905 
 
word space
word 0.039659 0.094314 0.057308 0.060724 0.026905 
space -0.039249 0.032824 0.047025 0.065525 0.055325 
^C

The command-line utility ignores the request for a vector for a blank space. So if you say "give me the vector for a blank space", the utility just stares back at you. If you give it a term with spaces, they first split the term into subterms by spaces, and return the vectors for each subterm.

@piskvorky
Copy link
Owner

OK, thanks. And what does their API (as opposed to CLI) do?

@gojomo
Copy link
Collaborator

gojomo commented Mar 14, 2019

Also: with the default min_n=4, any single-character string, even with the '<' and '>' end-bumpers added to distinguish leading/ending n-grams, will decompose into zero n-grams to look up.

So what does FT CLI do for single-character OOV words with no n-grams to look up?

We should probably do similar for OOV tokens with zero relevant n-grams, like '' (empty-string), ' ' (single space), 'j' (single-character). (An OOV token like 'qz' would be padded to '<qz>', which would yield one 4-char n-gram '<qz>' capable of being looked up, and get whatever n-gram vector happens to be at that bucket, trained or not.)

@piskvorky
Copy link
Owner

piskvorky commented Mar 15, 2019

Regarding roadmap, I found this: https://github.com/RaRe-Technologies/gensim/wiki/Roadmap-(2018) (our planned roadmap for 2018).

Everything in there still stands (not much progress in 2018). Especially the clean up and discoverability.

Let me just rename it to "Roadmap / priorities for 2019".

@mpenkov mpenkov self-assigned this Apr 20, 2019
@mpenkov mpenkov added the fasttext Issues related to the FastText model label Apr 20, 2019
@mpenkov
Copy link
Collaborator

mpenkov commented Apr 20, 2019

Fixed by #2411

@mpenkov mpenkov closed this as completed Apr 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug fasttext Issues related to the FastText model
Projects
None yet
Development

No branches or pull requests

4 participants