Inference issue using FB pretrained model if word have no ngrams #2415

menshikh-iv · 2019-03-13T15:58:39Z

Problem: FastText in gensim and official version still produce different output on FB pretrained model (issue with oov word without ngrams).

Prepare data:

curl https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz --output cc.en.300.bin.gz
gunzip cc.en.300.bin.gz

Code:

from gensim.models import FastText
from fastText import load_model
import numpy as np

fname = "cc.en.300.bin"

fb_model = load_model(fname)
gensim_model = FastText.load_fasttext_format(fname)

word = "`"  # you can't generate subwords from this word

assert len(fb_model.get_subwords(word)[0]) == 0  # really, no subwords

fb_vector = fb_model.get_word_vector("`")  # fb_vector exist, though no subwords and not in vocab
np.testing.assert_allclose(fb_vector, np.zeros(fb_vector.shape[0]))  # and fb vector is zero vector

gensim_vector = gensim_model["`"]  # raise an exception

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-13-243ea594a1c4> in <module>
      8 np.testing.assert_allclose(fb_vector, np.zeros(fb_vector.shape[0]))
      9 
---> 10 gensim_vector = gensim_model["`"]

~/.local/lib/python3.6/site-packages/gensim/utils.py in new_func1(*args, **kwargs)
   1445                     stacklevel=2
   1446                 )
-> 1447                 return func(*args, **kwargs)
   1448 
   1449             return new_func1

~/.local/lib/python3.6/site-packages/gensim/models/fasttext.py in __getitem__(self, words)
    926 
    927         """
--> 928         return self.wv.__getitem__(words)
    929 
    930     @deprecated("Method will be removed in 4.0.0, use self.wv.__contains__() instead")

~/.local/lib/python3.6/site-packages/gensim/models/keyedvectors.py in __getitem__(self, entities)
    345         if isinstance(entities, string_types):
    346             # allow calls like trained_model['office'], as a shorthand for trained_model[['office']]
--> 347             return self.get_vector(entities)
    348 
    349         return vstack([self.get_vector(entity) for entity in entities])

~/.local/lib/python3.6/site-packages/gensim/models/keyedvectors.py in get_vector(self, word)
    463 
    464     def get_vector(self, word):
--> 465         return self.word_vec(word)
    466 
    467     def words_closer_than(self, w1, w2):

~/.local/lib/python3.6/site-packages/gensim/models/keyedvectors.py in word_vec(self, word, use_norm)
   2096                 return word_vec / max(1, ngrams_found)
   2097             else:  # No ngrams of the word are present in self.ngrams
-> 2098                 raise KeyError('all ngrams for word %s absent from model' % word)
   2099 
   2100     def init_sims(self, replace=False):

KeyError: 'all ngrams for word ` absent from model'

Exception is correct, but behaviour is wrong (should return zero vector as FB implementation. instead of raising an exception).
BTW - when we load & use FB model - we shouldn't raise an exception at all.

The text was updated successfully, but these errors were encountered:

piskvorky · 2019-03-13T16:41:48Z

Thanks for the report @menshikh-iv !

Please don't assign people to tickets, we may have different priorities. If you feel something is urgent, feel free to open a PR with a fix for review.

menshikh-iv · 2019-03-13T16:47:50Z

@piskvorky

Please don't assign people to tickets, we may have different priorities.

Ok, no problem.
BTW, what're open-source priorities right now? Any plan/roadmap for this year (maybe I miss something?)

piskvorky · 2019-03-13T17:14:48Z

Clean up + tightening + docs + web. Nothing major, very little capacity. We're still discussing priorities and concrete objectives (also for grants).

gojomo · 2019-03-14T00:12:07Z

Was this seen in develop or latest-release? Because I think @mpenkov's latest fixes (including in #2370) may have finally corrected this behavior to match FB FT.

mpenkov · 2019-03-14T02:31:38Z

With develop, I get this:

(devel.env) mpenkov@hetrad2:~/data/2415$ python bug.py 
INFO:gensim.models._fasttext_bin:loading 2000000 words for fastText model from cc.en.300.bin
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.word2vec:Updating model with new vocabulary
INFO:gensim.models.word2vec:New added 2000000 unique words (50% of original 4000000) and increased the count of 2000000 pre-existing words (50% of original 4000000)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 2000000 items
INFO:gensim.models.word2vec:sample=1e-05 downsamples 6996 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 390315457935 word corpus (70.7% of prior 552001338161)
INFO:gensim.models.fasttext:loaded (4000000, 300) weight matrix for fastText model from cc.en.300.bin
/home/mpenkov/git/gensim/gensim/models/keyedvectors.py:2103: RuntimeWarning: invalid value encountered in true_divide
  return word_vec / len(ngram_hashes)

So the bug @menshikh-iv described is fixed, but the fix uncovered a divide-by-zero case, because there can be zero ngrams extracted from a single space character. We have several ways to handle this:

Do nothing
Avoid the division, return the origin vector
Raise an exception

@piskvorky Which one do you think is best?

piskvorky · 2019-03-14T11:48:48Z

No idea. What does FB's FT do?

mpenkov · 2019-03-14T14:56:44Z

word
word 0.039659 0.094314 0.057308 0.060724 0.026905 
 
word space
word 0.039659 0.094314 0.057308 0.060724 0.026905 
space -0.039249 0.032824 0.047025 0.065525 0.055325 
^C

The command-line utility ignores the request for a vector for a blank space. So if you say "give me the vector for a blank space", the utility just stares back at you. If you give it a term with spaces, they first split the term into subterms by spaces, and return the vectors for each subterm.

piskvorky · 2019-03-14T15:48:23Z

OK, thanks. And what does their API (as opposed to CLI) do?

gojomo · 2019-03-14T19:15:41Z

Also: with the default min_n=4, any single-character string, even with the '<' and '>' end-bumpers added to distinguish leading/ending n-grams, will decompose into zero n-grams to look up.

So what does FT CLI do for single-character OOV words with no n-grams to look up?

We should probably do similar for OOV tokens with zero relevant n-grams, like '' (empty-string), ' ' (single space), 'j' (single-character). (An OOV token like 'qz' would be padded to '<qz>', which would yield one 4-char n-gram '<qz>' capable of being looked up, and get whatever n-gram vector happens to be at that bucket, trained or not.)

piskvorky · 2019-03-15T10:27:38Z

Regarding roadmap, I found this: https://github.com/RaRe-Technologies/gensim/wiki/Roadmap-(2018) (our planned roadmap for 2018).

Everything in there still stands (not much progress in 2018). Especially the clean up and discoverability.

Let me just rename it to "Roadmap / priorities for 2019".

mpenkov · 2019-04-20T12:48:57Z

Fixed by #2411

menshikh-iv added the bug label Mar 13, 2019

menshikh-iv assigned mpenkov Mar 13, 2019

piskvorky unassigned mpenkov Mar 13, 2019

mpenkov self-assigned this Apr 20, 2019

mpenkov added the fasttext label Apr 20, 2019

mpenkov closed this as completed Apr 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference issue using FB pretrained model if word have no ngrams #2415

Inference issue using FB pretrained model if word have no ngrams #2415

menshikh-iv commented Mar 13, 2019 •

edited

Loading

piskvorky commented Mar 13, 2019

menshikh-iv commented Mar 13, 2019

piskvorky commented Mar 13, 2019

gojomo commented Mar 14, 2019

mpenkov commented Mar 14, 2019

piskvorky commented Mar 14, 2019

mpenkov commented Mar 14, 2019 •

edited

Loading

piskvorky commented Mar 14, 2019

gojomo commented Mar 14, 2019 •

edited

Loading

piskvorky commented Mar 15, 2019 •

edited

Loading

mpenkov commented Apr 20, 2019

Inference issue using FB pretrained model if word have no ngrams #2415

Inference issue using FB pretrained model if word have no ngrams #2415

Comments

menshikh-iv commented Mar 13, 2019 • edited Loading

piskvorky commented Mar 13, 2019

menshikh-iv commented Mar 13, 2019

piskvorky commented Mar 13, 2019

gojomo commented Mar 14, 2019

mpenkov commented Mar 14, 2019

piskvorky commented Mar 14, 2019

mpenkov commented Mar 14, 2019 • edited Loading

piskvorky commented Mar 14, 2019

gojomo commented Mar 14, 2019 • edited Loading

piskvorky commented Mar 15, 2019 • edited Loading

mpenkov commented Apr 20, 2019

menshikh-iv commented Mar 13, 2019 •

edited

Loading

mpenkov commented Mar 14, 2019 •

edited

Loading

gojomo commented Mar 14, 2019 •

edited

Loading

piskvorky commented Mar 15, 2019 •

edited

Loading