-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fastText fixes in 3.7 break compatibility with old models #2341
Comments
'Full' fastText models (not KeyedVectors objects) trained in older Gensim versions can be loaded and worked with. There is even a warning message in the logs about the hash function being buggy. Unfortunately, this message itself is buggy and fails to show properly:
|
Hello @akutuzov, thanks for the fast report 👍 About "full" model message - fix already here: #2339. |
upd: @akutuzov I reproduced KV problem (no need additional info from you) Reproduce backward compatibility bug
Full trace ---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-1-340e13f11fe0> in <module>()
2
3 m = KeyedVectors.load("ft_kv.model")
----> 4 m.most_similar("human") # exception "AttributeError: 'FastTextKeyedVectors' object has no attribute 'compatible_hash'"
/home/ivan/.virtualenvs/abc_g37/local/lib/python2.7/site-packages/gensim/models/keyedvectors.pyc in most_similar(self, positive, negative, topn, restrict_vocab, indexer)
541 mean.append(weight * word)
542 else:
--> 543 mean.append(weight * self.word_vec(word, use_norm=True))
544 if word in self.vocab:
545 all_words.add(self.vocab[word].index)
/home/ivan/.virtualenvs/abc_g37/local/lib/python2.7/site-packages/gensim/models/keyedvectors.pyc in word_vec(self, word, use_norm)
2057
2058 """
-> 2059 hash_fn = _ft_hash if self.compatible_hash else _ft_hash_broken
2060
2061 if word in self.vocab:
AttributeError: 'FastTextKeyedVectors' object has no attribute 'compatible_hash' |
preliminary variant of fix diff --git a/gensim/models/keyedvectors.py b/gensim/models/keyedvectors.py
index d9dad1cc..881aaf18 100644
--- a/gensim/models/keyedvectors.py
+++ b/gensim/models/keyedvectors.py
@@ -1974,6 +1974,14 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
self.num_ngram_vectors = 0
self.compatible_hash = compatible_hash
+ @classmethod
+ def load(cls, fname_or_handle, **kwargs):
+ model = super(WordEmbeddingsKeyedVectors, cls).load(fname_or_handle, **kwargs)
+ if not hasattr(model, 'compatible_hash'):
+ model.compatible_hash = False
+
+ return model
+
@property
@deprecated("Attribute will be removed in 4.0.0, use self.vectors_vocab instead")
def syn0_vocab(self):
@@ -2012,7 +2020,7 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
if word in self.vocab:
return True
else:
- hash_fn = _ft_hash if self.compatible_hash else _ft_hash_broken
+ hash_fn = _ft_hash if getattr(self, "compatible_hash", False) else _ft_hash_broken
char_ngrams = _compute_ngrams(word, self.min_n, self.max_n)
return any(hash_fn(ng) % self.bucket in self.hash2index for ng in char_ngrams)
@@ -2056,7 +2064,7 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
If word and all ngrams not in vocabulary.
"""
- hash_fn = _ft_hash if self.compatible_hash else _ft_hash_broken
+ hash_fn = _ft_hash if getattr(self, "compatible_hash", False) else _ft_hash_broken
if word in self.vocab:
return super(FastTextKeyedVectors, self).word_vec(word, use_norm)
@@ -2237,7 +2245,7 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
if self.bucket == 0:
return
- hash_fn = _ft_hash if self.compatible_hash else _ft_hash_broken
+ hash_fn = _ft_hash if getattr(self, "compatible_hash", False) else _ft_hash_broken
for w, v in self.vocab.items():
word_vec = np.copy(self.vectors_vocab[v.index]) |
Fixed partially (#2341 (comment)) in #2339 |
Recent fixes to Gensim's fastText implementation introduced in #2313 are great. Unfortunately, they also break compatibility with fastText models trained by older Gensim versions - if the models are stored as a KeyedVectors() object. One can load such a model, but as soon as you try to do anything useful (like
most_similar()
, etc), it fails, because thecompatible_hash
attribute is missing.If this attribute is added manually after the loading, everything goes fine.
Steps/Code/Corpus to Reproduce
Expected Results
The
compatible_hash
attribute is automatically assigned the False value on load, and the model works as before.Actual Results
Versions
Linux-4.15.0-43-generic-x86_64-with-LinuxMint-18.3-sylvia
Python 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609]
NumPy 1.14.5
SciPy 1.1.0
gensim 3.7.0
FAST_VERSION 1
The text was updated successfully, but these errors were encountered: