impossible to load into gensim the fastText model trained with pretrained_vectors #2350

lynochka · 2019-01-24T13:00:42Z

Description

When using using fastText model, trained itself with the pretrained vectors, impossible to load the model with gensim.models.FastText.load_fasttext_format

Steps/Code/Corpus to Reproduce

First we make glove into word2vec format with gensim.
Keep "glove.6B.50d.txt" in local

from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors
_ = glove2word2vec("glove.6B.50d.txt", "w2v_from_glove.6B.50d.txt")

Then use any sample text TEXT_FOR_WE_FILENAME, e.g.,
https://raw.githubusercontent.com/bbejeck/hadoop-algorithms/master/src/shakespeare.txt
(keep "shakespeare.txt" in local)
and train with pretrained vectors from "w2v_from_glove.6B.50d.txt" on the text:

TEXT_FOR_WE_FILENAME = shakespeare.txt
PRETRAINED_VECTOR_DIM = 50
PRETRAINED_FILE =  "w2v_from_glove.6B.50d.txt"
import fastText
model_pre = fastText.train_unsupervised(TEXT_FOR_WE_FILENAME, model='skipgram', dim=PRETRAINED_VECTOR_DIM, pretrainedVectors=PRETRAINED_FILE)
model_pre.save_model("fasttext_model.bin")

Error comes when trying to load this new fasttext model into gensim (while it is possible to load it into
the original fastText

import fastText
from gensim.models import FastText as ge_ft
FASTTEXT_MODEL_BIN = "fasttext_model.bin"
#this works
ft_model = fastText.load_model(FASTTEXT_MODEL_BIN)
ft_model.get_word_vector("additional")

#this one does not: 
ge_model = ge_ft.load_fasttext_format(FASTTEXT_MODEL_BIN)

Output:

AssertionError: unexpected number of vectors
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<command-3269280551404242> in <module>()
      1 #gensim FastText (having  some different features)
----> 2 ge_model = ge_ft.load_fasttext_format(FASTTEXT_MODEL_BIN)

/databricks/python/lib/python3.5/site-packages/gensim/models/fasttext.py in load_fasttext_format(cls, model_file, encoding)
    778 
    779         """
--> 780         return _load_fasttext_format(model_file, encoding=encoding)
    781 
    782     def load_binary_data(self, encoding='utf8'):

/databricks/python/lib/python3.5/site-packages/gensim/models/fasttext.py in _load_fasttext_format(model_file, encoding)
   1005     model.num_original_vectors = m.vectors_ngrams.shape[0]
   1006 
-> 1007     model.wv.init_post_load(m.vectors_ngrams)
   1008     model.trainables.init_post_load(model, m.hidden_output)
   1009 

/databricks/python/lib/python3.5/site-packages/gensim/models/keyedvectors.py in init_post_load(self, vectors, match_gensim)
   2189         """
   2190         vocab_words = len(self.vocab)
-> 2191         assert vectors.shape[0] == vocab_words + self.bucket, 'unexpected number of vectors'
   2192         assert vectors.shape[1] == self.vector_size, 'unexpected vector dimensionality'
   2193 

AssertionError: unexpected number of vectors

Versions

Linux-4.15.0-1036-azure-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
NumPy 1.15.3
SciPy 0.18.1
gensim 3.7.0, FAST_VERSION 0
fasttext 0.8.22

The text was updated successfully, but these errors were encountered:

menshikh-iv · 2019-01-25T02:26:48Z

Thank you @lynochka, awesome description 🌟
Issue reproduced (I just re-write example code a bit to get "copy-pastable" example)

import numpy as np
import gensim.downloader as api
from fastText import train_unsupervised
from gensim.models import FastText as FT


PRETRAINED_VECTOR_DIM = 50
TRAINING_TEXT = "corpus.txt"
PRETRAINED_FILE = "pretrained.txt"

FT_MODELL_W_PRETRAINED = "ft_with_pretrained.bin"
FT_MODEL = "ft.bin"

WORD = 'additional'

vectors = api.load('glove-wiki-gigaword-50')
vectors.save_word2vec_format(PRETRAINED_FILE)

corpus = api.load("text8")

with open(TRAINING_TEXT, 'w') as outfile:
    for idx, doc in enumerate(corpus):
        if idx == 100:
            break

        outfile.write(" ".join(doc) + "\n")


# No 'pretrainedVectors' passed to FB (works as expected)

fb_model = train_unsupervised(TRAINING_TEXT, model='skipgram', dim=PRETRAINED_VECTOR_DIM)
fb_model.save_model(FT_MODEL)

gs_model = FT.load_fasttext_format(FT_MODEL)
assert np.allclose(gs_model.wv[WORD], fb_model.get_word_vector(WORD))  # works as expected


# Use 'pretrainedVectors=PRETRAINED_FILE' (error on loading to gensim)

fb_model_pre = train_unsupervised(TRAINING_TEXT, model='skipgram', dim=PRETRAINED_VECTOR_DIM, pretrainedVectors=PRETRAINED_FILE)
fb_model_pre.save_model(FT_MODELL_W_PRETRAINED)


gs_model = FT.load_fasttext_format(FT_MODELL_W_PRETRAINED)  # raised an exception (vector shape missmatch)
assert np.allclose(gs_model.wv[WORD], fb_model_pre.get_word_vector(WORD))

stacktrace

AssertionError                            Traceback (most recent call last)
<ipython-input-1-acb5138754e9> in <module>()
     42 
     43 
---> 44 gs_model = FT.load_fasttext_format(FT_MODELL_W_PRETRAINED)  # raised an exception (vector shape missmatch)
     45 assert np.allclose(gs_model.wv[WORD], fb_model_pre.get_word_vector(WORD))

/home/ivan/.virtualenvs/ft36/local/lib/python2.7/site-packages/gensim/models/fasttext.pyc in load_fasttext_format(cls, model_file, encoding)
    778 
    779         """
--> 780         return _load_fasttext_format(model_file, encoding=encoding)
    781 
    782     def load_binary_data(self, encoding='utf8'):

/home/ivan/.virtualenvs/ft36/local/lib/python2.7/site-packages/gensim/models/fasttext.pyc in _load_fasttext_format(model_file, encoding)
   1005     model.num_original_vectors = m.vectors_ngrams.shape[0]
   1006 
-> 1007     model.wv.init_post_load(m.vectors_ngrams)
   1008     model.trainables.init_post_load(model, m.hidden_output)
   1009 

/home/ivan/.virtualenvs/ft36/local/lib/python2.7/site-packages/gensim/models/keyedvectors.pyc in init_post_load(self, vectors, match_gensim)
   2189         """
   2190         vocab_words = len(self.vocab)
-> 2191         assert vectors.shape[0] == vocab_words + self.bucket, 'unexpected number of vectors'
   2192         assert vectors.shape[1] == self.vector_size, 'unexpected vector dimensionality'
   2193 

AssertionError: unexpected number of vectors

Definitely something changed if FB model initialized with pretrained word-vectors.
CC: @mpenkov

akutuzov · 2019-02-04T15:34:44Z

This is yet another regression after the fastText code refactoring in Gensim 3.7 (another one was fixed in #2341).
Indeed, Gensim 3.6 loads pre-trained fastText models without any trouble. Below are examples with the Wikipedia model from https://fasttext.cc/, but the same stuff happens with any models trained using native fastText.

import gensim
gensim.__version__
'3.6.0'
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : (message)s', level=logging.INFO)
model = gensim.models.fasttext.FastText.load_fasttext_format('wiki_en')
2019-01-24 16:23:47,740 : INFO : loading 2519370 words for fastText model from wiki.en.bin
2019-01-24 16:29:54,820 : INFO : loading weights for 2519370 words for fastText model from wiki.en.bin
2019-01-24 16:37:43,068 : INFO : loaded (2519370, 300) weight matrix for fastText model from wiki.en.bin 
model
<gensim.models.fasttext.FastText at 0x7f8e98e2c320>

However, Gensim 3.7 is doing weird things here (retraining the model instead of loading it?):

import gensim
gensim.__version__
'3.7.0'
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model = gensim.models.fasttext.FastText.load_fasttext_format('wiki.en')
2019-01-24 16:25:50,816 : INFO : loading 2519370 words for fastTextmodel from wiki.en.bin
2019-01-24 16:30:14,701 : INFO : resetting layer weights
2019-01-24 16:30:14,702 : INFO : Total number of ngrams is 0
2019-01-24 16:30:14,702 : INFO : Updating model with new vocabulary
2019-01-24 16:30:40,839 : INFO : New added 2519370 unique words (50% of original 5038740) and increased the count of 2519370 pre-existing words (50% of original 5038740)
2019-01-24 16:31:02,325 : INFO : deleting the raw counts dictionary of 2519370 items
2019-01-24 16:31:02,325 : INFO : sample=0.0001 downsamples 650 most-common words
2019-01-24 16:31:02,326 : INFO : downsampling leaves estimated 4076481917 word corpus (103.2% of prior 3949186974)

After it went like this for an hour, I killed the process.

Gensim 3.7.1 does the same, nothing changed. I'm sorry, but it seems that fastText refactoring in 3.7 was extremely badly tested, with so many things broken :-(

piskvorky · 2019-02-04T15:39:17Z

Sorry to hear that @akutuzov . Clearer and more predictable model loading was actually one of the main objectives of that refactoring, so this is surprising.

Can you have a look @mpenkov ?

akutuzov · 2019-02-04T17:43:46Z

Interestingly, even the code proudly presented in the 3.7 Changelog fails in exactly the same way, when using exactly the same model mentioned in this changelog.

'Massive improvement FastText compatibilities', indeed :-)

mpenkov · 2019-02-05T01:54:18Z

OK, I think we may be dealing with two separate issues here.

@lynochka Loading a native fastText model trained with pretrained_vectors triggers an assertion. We didn't have a test case to cover this use case. I'll add one based on the example provided by @lynochka and @menshikh-iv and resolve the problem.
@akutuzov Gensim doing weird things on loading the native model trained from Wikipedia. I don't think saying that it breaks on any model is fair, given that the example from @lynochka correctly loads a native FB model (no pretrained vectors) without a problem (and our unit tests targeting the same functionality). That said, your example does illustrate a real problem, and I will investigate.

@akutuzov I understand your frustration, and as the author of the refactoring, I apologize for causing you discomfort. I've opened a separate ticket to cover the issue you reported. Let's continue the discussion about your issue there.

mpenkov · 2019-02-05T03:13:25Z

@lynochka I've investigated your issue and found the cause:

In [8]: from gensim.models._fasttext_bin import load                                                                                                                                                          

In [9]: ft = load('ft.bin')                                                                                                                                                                                   

In [10]: ft.min_count                                                                                                                                                                                         
Out[10]: 5

In [11]: len(ft.raw_vocab)                                                                                                                                                                                    
Out[11]: 13967

In [12]: ft.bucket                                                                                                                                                                                            
Out[12]: 2000000

In [13]: ft.vectors_ngrams.shape                                                                                                                                                                              
Out[13]: (2013967, 50)

In [15]: pre = load('ft_with_pretrained.bin')                                                                                                                                                                 

In [16]: pre.min_count                                                                                                                                                                                        
Out[16]: 5

In [17]: len(pre.raw_vocab)                                                                                                                                                                                   
Out[17]: 400142

In [18]: len([word for (word, count) in pre.raw_vocab.items() if count >= pre.min_count])                                                                                                                     
Out[18]: 13967

In [19]: pre.vectors_ngrams.shape                                                                                                                                                                             
Out[19]: (2400142, 50)

As you can see, adding pre-trained vectors modifies the model:

Without pre-trained vectors, the model contains only vectors for words that occurred more frequently than the min_count parameter
With pre-trained vectors, the model contains vectors for all words

Unfortunately, our loading code always respects the min_count parameter, and incorrectly trims the vocabulary when loading the model, causing an inconstency between the vocab size and the number of vectors.

This inconsistency is what trips the assert.

dpalbrecht · 2019-05-02T20:38:49Z

I seem to be getting the same sorts of errors, and not sure what the fix is given the above discussion.

model_fast = gensim.models.fasttext.load_facebook_vectors('cc.en.300.bin.gz')
or
model_fast = gensim.models.fasttext.load_facebook_model('cc.en.300.bin.gz')

produce the error:
ValueError: cannot reshape array of size 1116604308 into shape (4000000,300)

mpenkov · 2019-05-03T01:14:09Z

@dpalbrecht That looks like a different issue. Can you please open a new ticket, and provide the full stack trace, reproducible sample, version numbers, etc?

dpalbrecht · 2019-05-03T18:26:10Z

@mpenkov Sure, will do.

kusumlata123 · 2019-06-26T02:50:02Z

How to load pre-trained fasttext model for any langauge?

mpenkov · 2019-06-26T04:06:35Z

@kusumlata123 Please use the mailing list for questions. Github tickets are for feature requests and bug reports only.

HongyiLiu90 · 2019-07-19T00:12:22Z

I use the code to load a fasttext model, fasttext.5.bin, on my laptop and the CHPC server. However, it works well on my PC, but comes with memory errors on the CHPC. Both of servers use Python 3.7.3 and Gensim 3.8.0. The following code is run under Python 3.7.1. I did try the same under Python 3.7.3, but it failed to go.

Python 3.7.1 (default, Oct 23 2018, 19:19:42)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.

from gensim.models.fasttext import load_facebook_model
model = load_facebook_model('fasttext.5.bin')
Traceback (most recent call last):
File "", line 1, in
File "/home/USER/.conda/envs/Anaconda3-5.2.0/lib/python3.7/site-packages/gensim/models/fasttext.py", line 1250, in load_facebook_model
return _load_fasttext_format(path, encoding=encoding, full_model=True)
File "/home/USER/.conda/envs/Anaconda3-5.2.0/lib/python3.7/site-packages/gensim/models/fasttext.py", line 1343, in _load_fasttext_format
max_n=m.maxn,
File "/home/USER/.conda/envs/Anaconda3-5.2.0/lib/python3.7/site-packages/gensim/models/fasttext.py", line 595, in init
self.trainables.prepare_weights(hs, negative, self.wv, update=False, vocabulary=self.vocabulary)
File "/home/USER/.conda/envs/Anaconda3-5.2.0/lib/python3.7/site-packages/gensim/models/fasttext.py", line 1130, in prepare_weights
self.init_ngrams_weights(wv, update=update, vocabulary=vocabulary)
File "/home/USER/.conda/envs/Anaconda3-5.2.0/lib/python3.7/site-packages/gensim/models/fasttext.py", line 1150, in init_ngrams_weights
wv.init_ngrams_weights(self.seed)
File "/home/USER/.conda/envs/Anaconda3-5.2.0/lib/python3.7/site-packages/gensim/models/keyedvectors.py", line 2219, in init_ngrams_weights
self.vectors_ngrams = rand_obj.uniform(lo, hi, ngrams_shape).astype(REAL)
File "mtrand.pyx", line 1312, in mtrand.RandomState.uniform
File "mtrand.pyx", line 242, in mtrand.cont2_array_sc
MemoryError

lynochka changed the title ~~impossible to load the model with fastText model~~ impossible to load the fastText trained with pretrained_vectors Jan 24, 2019

lynochka changed the title ~~impossible to load the fastText trained with pretrained_vectors~~ impossible to load into gensim the fastText model trained with pretrained_vectors Jan 24, 2019

menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills fasttext Issues related to the FastText model labels Jan 25, 2019

piskvorky assigned mpenkov Feb 4, 2019

mpenkov mentioned this issue Feb 5, 2019

Loading the English wikipedia model hangs indefinitely when low on RAM #2372

Closed

mpenkov mentioned this issue Feb 5, 2019

fix loading native model with pretrained vectors #2373

Merged

mpenkov closed this as completed in #2373 Feb 6, 2019

cbjrobertson mentioned this issue Feb 9, 2019

loading fastText model trained with pretrained_vectors still fails (see: #2350) #2378

Closed

Repository owner locked as resolved and limited conversation to collaborators Jul 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

impossible to load into gensim the fastText model trained with pretrained_vectors #2350

impossible to load into gensim the fastText model trained with pretrained_vectors #2350

lynochka commented Jan 24, 2019

menshikh-iv commented Jan 25, 2019

akutuzov commented Feb 4, 2019 •

edited

Loading

piskvorky commented Feb 4, 2019 •

edited

Loading

akutuzov commented Feb 4, 2019 •

edited

Loading

mpenkov commented Feb 5, 2019

mpenkov commented Feb 5, 2019

dpalbrecht commented May 2, 2019 •

edited

Loading

mpenkov commented May 3, 2019

dpalbrecht commented May 3, 2019

kusumlata123 commented Jun 26, 2019

mpenkov commented Jun 26, 2019

HongyiLiu90 commented Jul 19, 2019

impossible to load into gensim the fastText model trained with pretrained_vectors #2350

impossible to load into gensim the fastText model trained with pretrained_vectors #2350

Comments

lynochka commented Jan 24, 2019

Description

Steps/Code/Corpus to Reproduce

Versions

menshikh-iv commented Jan 25, 2019

akutuzov commented Feb 4, 2019 • edited Loading

piskvorky commented Feb 4, 2019 • edited Loading

akutuzov commented Feb 4, 2019 • edited Loading

mpenkov commented Feb 5, 2019

mpenkov commented Feb 5, 2019

dpalbrecht commented May 2, 2019 • edited Loading

mpenkov commented May 3, 2019

dpalbrecht commented May 3, 2019

kusumlata123 commented Jun 26, 2019

mpenkov commented Jun 26, 2019

HongyiLiu90 commented Jul 19, 2019

akutuzov commented Feb 4, 2019 •

edited

Loading

piskvorky commented Feb 4, 2019 •

edited

Loading

akutuzov commented Feb 4, 2019 •

edited

Loading

dpalbrecht commented May 2, 2019 •

edited

Loading