Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastText fixes in 3.7 break compatibility with old models #2341

Closed
akutuzov opened this issue Jan 19, 2019 · 6 comments · Fixed by #2339 or #2349
Closed

fastText fixes in 3.7 break compatibility with old models #2341

akutuzov opened this issue Jan 19, 2019 · 6 comments · Fixed by #2339 or #2349
Assignees
Labels
bug Issue described a bug fasttext Issues related to the FastText model

Comments

@akutuzov
Copy link
Contributor

akutuzov commented Jan 19, 2019

Recent fixes to Gensim's fastText implementation introduced in #2313 are great. Unfortunately, they also break compatibility with fastText models trained by older Gensim versions - if the models are stored as a KeyedVectors() object. One can load such a model, but as soon as you try to do anything useful (like most_similar(), etc), it fails, because the compatible_hash attribute is missing.
If this attribute is added manually after the loading, everything goes fine.

Steps/Code/Corpus to Reproduce

import gensim

model = gensim.models.KeyedVectors.load(ANY_KEYED_VECTORS_FASTTEXT_MODEL)
model.most_similar(positive=ANY_WORD)

Expected Results

The compatible_hash attribute is automatically assigned the False value on load, and the model works as before.

Actual Results

/usr/local/lib/python3.5/dist-packages/gensim/models/keyedvectors.py in word_vec(self, word, use_norm)
   2057 
   2058         """
-> 2059         hash_fn = _ft_hash if self.compatible_hash else _ft_hash_broken
   2060 
   2061         if word in self.vocab:

AttributeError: 'FastTextKeyedVectors' object has no attribute 'compatible_hash'

Versions

Linux-4.15.0-43-generic-x86_64-with-LinuxMint-18.3-sylvia
Python 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609]
NumPy 1.14.5
SciPy 1.1.0
gensim 3.7.0
FAST_VERSION 1

@akutuzov
Copy link
Contributor Author

@mpenkov
@menshikh-iv

@akutuzov
Copy link
Contributor Author

akutuzov commented Jan 19, 2019

'Full' fastText models (not KeyedVectors objects) trained in older Gensim versions can be loaded and worked with. There is even a warning message in the logs about the hash function being buggy. Unfortunately, this message itself is buggy and fails to show properly:

2019-01-19 20:36:08,303 : INFO : loaded test_fasttext.model
--- Logging error ---
Traceback (most recent call last):
  File "/projects/ltg/python3/lib/python3.5/logging/__init__.py", line 986, in emit
    msg = self.format(record)
  File "/projects/ltg/python3/lib/python3.5/logging/__init__.py", line 836, in format
    return fmt.format(record)
  File "/projects/ltg/python3/lib/python3.5/logging/__init__.py", line 573, in format
    record.message = record.getMessage()
  File "/projects/ltg/python3/lib/python3.5/logging/__init__.py", line 336, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "test.py", line 6, in <module>
    model = gensim.models.fasttext.FastText.load('test_fasttext.model')
  File "/projects/ltg/python3/lib/python3.5/site-packages/gensim/models/fasttext.py", line 845, in load
    "The model will continue to work, but consider training it "                                                                                                                             
Message: 'This older model was trained with a buggy hash function.  '                                                                                                                        
Arguments: ('The model will continue to work, but consider training it from scratch.',)     

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jan 20, 2019

Hello @akutuzov,

thanks for the fast report 👍

About "full" model message - fix already here: #2339.
About KeyedVectors - can you share any ANY_KEYED_VECTORS_FASTTEXT_MODEL reproduced in #2341 (comment) (we'll fix it ASAP in that case, I'm not sure, but 3.7.1 can appears in next 2 weeks)

@menshikh-iv menshikh-iv added bug Issue described a bug fasttext Issues related to the FastText model labels Jan 20, 2019
@menshikh-iv
Copy link
Contributor

upd: @akutuzov I reproduced KV problem (no need additional info from you)

Reproduce backward compatibility bug

  1. Train FT & save KV in gensim==3.6.0

    from gensim.test.utils import common_texts
    from gensim.models import FastText
    
    m = FastText(common_texts, min_count=0)
    m.wv.save("ft_kv.model")

    produced file (gzipped after, for uploading to github): ft_kv.model.gz

  2. Load KV in gensim==3.7.0 and use it

    from gensim.models.keyedvectors import FastTextKeyedVectors, KeyedVectors
    
    m = KeyedVectors.load("ft_kv.model")
    m.most_similar("human")  # exception "AttributeError: 'FastTextKeyedVectors' object has no attribute 'compatible_hash'"
    
    m = FastTextKeyedVectors.load("ft_kv.model")
    m.most_similar("human")  # exception "AttributeError: 'FastTextKeyedVectors' object has no attribute 'compatible_hash'"

Full trace

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-1-340e13f11fe0> in <module>()
      2 
      3 m = KeyedVectors.load("ft_kv.model")
----> 4 m.most_similar("human")  # exception "AttributeError: 'FastTextKeyedVectors' object has no attribute 'compatible_hash'"

/home/ivan/.virtualenvs/abc_g37/local/lib/python2.7/site-packages/gensim/models/keyedvectors.pyc in most_similar(self, positive, negative, topn, restrict_vocab, indexer)
    541                 mean.append(weight * word)
    542             else:
--> 543                 mean.append(weight * self.word_vec(word, use_norm=True))
    544                 if word in self.vocab:
    545                     all_words.add(self.vocab[word].index)

/home/ivan/.virtualenvs/abc_g37/local/lib/python2.7/site-packages/gensim/models/keyedvectors.pyc in word_vec(self, word, use_norm)
   2057 
   2058         """
-> 2059         hash_fn = _ft_hash if self.compatible_hash else _ft_hash_broken
   2060 
   2061         if word in self.vocab:

AttributeError: 'FastTextKeyedVectors' object has no attribute 'compatible_hash'

@menshikh-iv
Copy link
Contributor

preliminary variant of fix
CC @mpenkov

diff --git a/gensim/models/keyedvectors.py b/gensim/models/keyedvectors.py
index d9dad1cc..881aaf18 100644
--- a/gensim/models/keyedvectors.py
+++ b/gensim/models/keyedvectors.py
@@ -1974,6 +1974,14 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
         self.num_ngram_vectors = 0
         self.compatible_hash = compatible_hash
 
+    @classmethod
+    def load(cls, fname_or_handle, **kwargs):
+        model = super(WordEmbeddingsKeyedVectors, cls).load(fname_or_handle, **kwargs)
+        if not hasattr(model, 'compatible_hash'):
+            model.compatible_hash = False
+
+        return model
+
     @property
     @deprecated("Attribute will be removed in 4.0.0, use self.vectors_vocab instead")
     def syn0_vocab(self):
@@ -2012,7 +2020,7 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
         if word in self.vocab:
             return True
         else:
-            hash_fn = _ft_hash if self.compatible_hash else _ft_hash_broken
+            hash_fn = _ft_hash if getattr(self, "compatible_hash", False) else _ft_hash_broken
             char_ngrams = _compute_ngrams(word, self.min_n, self.max_n)
             return any(hash_fn(ng) % self.bucket in self.hash2index for ng in char_ngrams)
 
@@ -2056,7 +2064,7 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
             If word and all ngrams not in vocabulary.
 
         """
-        hash_fn = _ft_hash if self.compatible_hash else _ft_hash_broken
+        hash_fn = _ft_hash if getattr(self, "compatible_hash", False) else _ft_hash_broken
 
         if word in self.vocab:
             return super(FastTextKeyedVectors, self).word_vec(word, use_norm)
@@ -2237,7 +2245,7 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
         if self.bucket == 0:
             return
 
-        hash_fn = _ft_hash if self.compatible_hash else _ft_hash_broken
+        hash_fn = _ft_hash if getattr(self, "compatible_hash", False) else _ft_hash_broken
 
         for w, v in self.vocab.items():
             word_vec = np.copy(self.vectors_vocab[v.index])

@menshikh-iv
Copy link
Contributor

Fixed partially (#2341 (comment)) in #2339
Waiting #2340 for full fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug fasttext Issues related to the FastText model
Projects
None yet
3 participants