-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
impossible to load into gensim the fastText model trained with pretrained_vectors #2350
Comments
Thank you @lynochka, awesome description 🌟 import numpy as np
import gensim.downloader as api
from fastText import train_unsupervised
from gensim.models import FastText as FT
PRETRAINED_VECTOR_DIM = 50
TRAINING_TEXT = "corpus.txt"
PRETRAINED_FILE = "pretrained.txt"
FT_MODELL_W_PRETRAINED = "ft_with_pretrained.bin"
FT_MODEL = "ft.bin"
WORD = 'additional'
vectors = api.load('glove-wiki-gigaword-50')
vectors.save_word2vec_format(PRETRAINED_FILE)
corpus = api.load("text8")
with open(TRAINING_TEXT, 'w') as outfile:
for idx, doc in enumerate(corpus):
if idx == 100:
break
outfile.write(" ".join(doc) + "\n")
# No 'pretrainedVectors' passed to FB (works as expected)
fb_model = train_unsupervised(TRAINING_TEXT, model='skipgram', dim=PRETRAINED_VECTOR_DIM)
fb_model.save_model(FT_MODEL)
gs_model = FT.load_fasttext_format(FT_MODEL)
assert np.allclose(gs_model.wv[WORD], fb_model.get_word_vector(WORD)) # works as expected
# Use 'pretrainedVectors=PRETRAINED_FILE' (error on loading to gensim)
fb_model_pre = train_unsupervised(TRAINING_TEXT, model='skipgram', dim=PRETRAINED_VECTOR_DIM, pretrainedVectors=PRETRAINED_FILE)
fb_model_pre.save_model(FT_MODELL_W_PRETRAINED)
gs_model = FT.load_fasttext_format(FT_MODELL_W_PRETRAINED) # raised an exception (vector shape missmatch)
assert np.allclose(gs_model.wv[WORD], fb_model_pre.get_word_vector(WORD)) stacktrace AssertionError Traceback (most recent call last)
<ipython-input-1-acb5138754e9> in <module>()
42
43
---> 44 gs_model = FT.load_fasttext_format(FT_MODELL_W_PRETRAINED) # raised an exception (vector shape missmatch)
45 assert np.allclose(gs_model.wv[WORD], fb_model_pre.get_word_vector(WORD))
/home/ivan/.virtualenvs/ft36/local/lib/python2.7/site-packages/gensim/models/fasttext.pyc in load_fasttext_format(cls, model_file, encoding)
778
779 """
--> 780 return _load_fasttext_format(model_file, encoding=encoding)
781
782 def load_binary_data(self, encoding='utf8'):
/home/ivan/.virtualenvs/ft36/local/lib/python2.7/site-packages/gensim/models/fasttext.pyc in _load_fasttext_format(model_file, encoding)
1005 model.num_original_vectors = m.vectors_ngrams.shape[0]
1006
-> 1007 model.wv.init_post_load(m.vectors_ngrams)
1008 model.trainables.init_post_load(model, m.hidden_output)
1009
/home/ivan/.virtualenvs/ft36/local/lib/python2.7/site-packages/gensim/models/keyedvectors.pyc in init_post_load(self, vectors, match_gensim)
2189 """
2190 vocab_words = len(self.vocab)
-> 2191 assert vectors.shape[0] == vocab_words + self.bucket, 'unexpected number of vectors'
2192 assert vectors.shape[1] == self.vector_size, 'unexpected vector dimensionality'
2193
AssertionError: unexpected number of vectors Definitely something changed if FB model initialized with pretrained word-vectors. |
This is yet another regression after the fastText code refactoring in Gensim 3.7 (another one was fixed in #2341).
However, Gensim 3.7 is doing weird things here (retraining the model instead of loading it?):
After it went like this for an hour, I killed the process. Gensim 3.7.1 does the same, nothing changed. I'm sorry, but it seems that fastText refactoring in 3.7 was extremely badly tested, with so many things broken :-( |
Interestingly, even the code proudly presented in the 3.7 Changelog fails in exactly the same way, when using exactly the same model mentioned in this changelog. 'Massive improvement FastText compatibilities', indeed :-) |
OK, I think we may be dealing with two separate issues here.
@akutuzov I understand your frustration, and as the author of the refactoring, I apologize for causing you discomfort. I've opened a separate ticket to cover the issue you reported. Let's continue the discussion about your issue there. |
@lynochka I've investigated your issue and found the cause:
As you can see, adding pre-trained vectors modifies the model:
Unfortunately, our loading code always respects the min_count parameter, and incorrectly trims the vocabulary when loading the model, causing an inconstency between the vocab size and the number of vectors. This inconsistency is what trips the assert. |
I seem to be getting the same sorts of errors, and not sure what the fix is given the above discussion.
produce the error: |
@dpalbrecht That looks like a different issue. Can you please open a new ticket, and provide the full stack trace, reproducible sample, version numbers, etc? |
@mpenkov Sure, will do. |
How to load pre-trained fasttext model for any langauge? |
@kusumlata123 Please use the mailing list for questions. Github tickets are for feature requests and bug reports only. |
I use the code to load a fasttext model, fasttext.5.bin, on my laptop and the CHPC server. However, it works well on my PC, but comes with memory errors on the CHPC. Both of servers use Python 3.7.3 and Gensim 3.8.0. The following code is run under Python 3.7.1. I did try the same under Python 3.7.3, but it failed to go. Python 3.7.1 (default, Oct 23 2018, 19:19:42)
|
Description
When using using fastText model, trained itself with the pretrained vectors, impossible to load the model with gensim.models.FastText.load_fasttext_format
Steps/Code/Corpus to Reproduce
First we make glove into word2vec format with gensim.
Keep "glove.6B.50d.txt" in local
Then use any sample text TEXT_FOR_WE_FILENAME, e.g.,
https://raw.githubusercontent.com/bbejeck/hadoop-algorithms/master/src/shakespeare.txt
(keep "shakespeare.txt" in local)
and train with pretrained vectors from "w2v_from_glove.6B.50d.txt" on the text:
Error comes when trying to load this new fasttext model into gensim (while it is possible to load it into
the original fastText
Output:
Versions
Linux-4.15.0-1036-azure-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
NumPy 1.15.3
SciPy 0.18.1
gensim 3.7.0, FAST_VERSION 0
fasttext 0.8.22
The text was updated successfully, but these errors were encountered: