Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

impossible to load into gensim the fastText model trained with pretrained_vectors #2350

Closed
lynochka opened this issue Jan 24, 2019 · 12 comments
Assignees
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills fasttext Issues related to the FastText model

Comments

@lynochka
Copy link

Description

When using using fastText model, trained itself with the pretrained vectors, impossible to load the model with gensim.models.FastText.load_fasttext_format

Steps/Code/Corpus to Reproduce

First we make glove into word2vec format with gensim.
Keep "glove.6B.50d.txt" in local

from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors
_ = glove2word2vec("glove.6B.50d.txt", "w2v_from_glove.6B.50d.txt")

Then use any sample text TEXT_FOR_WE_FILENAME, e.g.,
https://raw.githubusercontent.com/bbejeck/hadoop-algorithms/master/src/shakespeare.txt
(keep "shakespeare.txt" in local)
and train with pretrained vectors from "w2v_from_glove.6B.50d.txt" on the text:

TEXT_FOR_WE_FILENAME = shakespeare.txt
PRETRAINED_VECTOR_DIM = 50
PRETRAINED_FILE =  "w2v_from_glove.6B.50d.txt"
import fastText
model_pre = fastText.train_unsupervised(TEXT_FOR_WE_FILENAME, model='skipgram', dim=PRETRAINED_VECTOR_DIM, pretrainedVectors=PRETRAINED_FILE)
model_pre.save_model("fasttext_model.bin")

Error comes when trying to load this new fasttext model into gensim (while it is possible to load it into
the original fastText

import fastText
from gensim.models import FastText as ge_ft
FASTTEXT_MODEL_BIN = "fasttext_model.bin"
#this works
ft_model = fastText.load_model(FASTTEXT_MODEL_BIN)
ft_model.get_word_vector("additional")

#this one does not: 
ge_model = ge_ft.load_fasttext_format(FASTTEXT_MODEL_BIN)

Output:

AssertionError: unexpected number of vectors
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<command-3269280551404242> in <module>()
      1 #gensim FastText (having  some different features)
----> 2 ge_model = ge_ft.load_fasttext_format(FASTTEXT_MODEL_BIN)

/databricks/python/lib/python3.5/site-packages/gensim/models/fasttext.py in load_fasttext_format(cls, model_file, encoding)
    778 
    779         """
--> 780         return _load_fasttext_format(model_file, encoding=encoding)
    781 
    782     def load_binary_data(self, encoding='utf8'):

/databricks/python/lib/python3.5/site-packages/gensim/models/fasttext.py in _load_fasttext_format(model_file, encoding)
   1005     model.num_original_vectors = m.vectors_ngrams.shape[0]
   1006 
-> 1007     model.wv.init_post_load(m.vectors_ngrams)
   1008     model.trainables.init_post_load(model, m.hidden_output)
   1009 

/databricks/python/lib/python3.5/site-packages/gensim/models/keyedvectors.py in init_post_load(self, vectors, match_gensim)
   2189         """
   2190         vocab_words = len(self.vocab)
-> 2191         assert vectors.shape[0] == vocab_words + self.bucket, 'unexpected number of vectors'
   2192         assert vectors.shape[1] == self.vector_size, 'unexpected vector dimensionality'
   2193 

AssertionError: unexpected number of vectors

Versions

Linux-4.15.0-1036-azure-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
NumPy 1.15.3
SciPy 0.18.1
gensim 3.7.0, FAST_VERSION 0
fasttext 0.8.22

@lynochka lynochka changed the title impossible to load the model with fastText model impossible to load the fastText trained with pretrained_vectors Jan 24, 2019
@lynochka lynochka changed the title impossible to load the fastText trained with pretrained_vectors impossible to load into gensim the fastText model trained with pretrained_vectors Jan 24, 2019
@menshikh-iv
Copy link
Contributor

Thank you @lynochka, awesome description 🌟
Issue reproduced (I just re-write example code a bit to get "copy-pastable" example)

import numpy as np
import gensim.downloader as api
from fastText import train_unsupervised
from gensim.models import FastText as FT


PRETRAINED_VECTOR_DIM = 50
TRAINING_TEXT = "corpus.txt"
PRETRAINED_FILE = "pretrained.txt"

FT_MODELL_W_PRETRAINED = "ft_with_pretrained.bin"
FT_MODEL = "ft.bin"

WORD = 'additional'

vectors = api.load('glove-wiki-gigaword-50')
vectors.save_word2vec_format(PRETRAINED_FILE)

corpus = api.load("text8")

with open(TRAINING_TEXT, 'w') as outfile:
    for idx, doc in enumerate(corpus):
        if idx == 100:
            break

        outfile.write(" ".join(doc) + "\n")


# No 'pretrainedVectors' passed to FB (works as expected)

fb_model = train_unsupervised(TRAINING_TEXT, model='skipgram', dim=PRETRAINED_VECTOR_DIM)
fb_model.save_model(FT_MODEL)

gs_model = FT.load_fasttext_format(FT_MODEL)
assert np.allclose(gs_model.wv[WORD], fb_model.get_word_vector(WORD))  # works as expected


# Use 'pretrainedVectors=PRETRAINED_FILE' (error on loading to gensim)

fb_model_pre = train_unsupervised(TRAINING_TEXT, model='skipgram', dim=PRETRAINED_VECTOR_DIM, pretrainedVectors=PRETRAINED_FILE)
fb_model_pre.save_model(FT_MODELL_W_PRETRAINED)


gs_model = FT.load_fasttext_format(FT_MODELL_W_PRETRAINED)  # raised an exception (vector shape missmatch)
assert np.allclose(gs_model.wv[WORD], fb_model_pre.get_word_vector(WORD))

stacktrace

AssertionError                            Traceback (most recent call last)
<ipython-input-1-acb5138754e9> in <module>()
     42 
     43 
---> 44 gs_model = FT.load_fasttext_format(FT_MODELL_W_PRETRAINED)  # raised an exception (vector shape missmatch)
     45 assert np.allclose(gs_model.wv[WORD], fb_model_pre.get_word_vector(WORD))

/home/ivan/.virtualenvs/ft36/local/lib/python2.7/site-packages/gensim/models/fasttext.pyc in load_fasttext_format(cls, model_file, encoding)
    778 
    779         """
--> 780         return _load_fasttext_format(model_file, encoding=encoding)
    781 
    782     def load_binary_data(self, encoding='utf8'):

/home/ivan/.virtualenvs/ft36/local/lib/python2.7/site-packages/gensim/models/fasttext.pyc in _load_fasttext_format(model_file, encoding)
   1005     model.num_original_vectors = m.vectors_ngrams.shape[0]
   1006 
-> 1007     model.wv.init_post_load(m.vectors_ngrams)
   1008     model.trainables.init_post_load(model, m.hidden_output)
   1009 

/home/ivan/.virtualenvs/ft36/local/lib/python2.7/site-packages/gensim/models/keyedvectors.pyc in init_post_load(self, vectors, match_gensim)
   2189         """
   2190         vocab_words = len(self.vocab)
-> 2191         assert vectors.shape[0] == vocab_words + self.bucket, 'unexpected number of vectors'
   2192         assert vectors.shape[1] == self.vector_size, 'unexpected vector dimensionality'
   2193 

AssertionError: unexpected number of vectors

Definitely something changed if FB model initialized with pretrained word-vectors.
CC: @mpenkov

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills fasttext Issues related to the FastText model labels Jan 25, 2019
@akutuzov
Copy link
Contributor

akutuzov commented Feb 4, 2019

This is yet another regression after the fastText code refactoring in Gensim 3.7 (another one was fixed in #2341).
Indeed, Gensim 3.6 loads pre-trained fastText models without any trouble. Below are examples with the Wikipedia model from https://fasttext.cc/, but the same stuff happens with any models trained using native fastText.

import gensim
gensim.__version__
'3.6.0'
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : (message)s', level=logging.INFO)
model = gensim.models.fasttext.FastText.load_fasttext_format('wiki_en')
2019-01-24 16:23:47,740 : INFO : loading 2519370 words for fastText model from wiki.en.bin
2019-01-24 16:29:54,820 : INFO : loading weights for 2519370 words for fastText model from wiki.en.bin
2019-01-24 16:37:43,068 : INFO : loaded (2519370, 300) weight matrix for fastText model from wiki.en.bin 
model
<gensim.models.fasttext.FastText at 0x7f8e98e2c320>

However, Gensim 3.7 is doing weird things here (retraining the model instead of loading it?):

import gensim
gensim.__version__
'3.7.0'
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model = gensim.models.fasttext.FastText.load_fasttext_format('wiki.en')
2019-01-24 16:25:50,816 : INFO : loading 2519370 words for fastTextmodel from wiki.en.bin
2019-01-24 16:30:14,701 : INFO : resetting layer weights
2019-01-24 16:30:14,702 : INFO : Total number of ngrams is 0
2019-01-24 16:30:14,702 : INFO : Updating model with new vocabulary
2019-01-24 16:30:40,839 : INFO : New added 2519370 unique words (50% of original 5038740) and increased the count of 2519370 pre-existing words (50% of original 5038740)
2019-01-24 16:31:02,325 : INFO : deleting the raw counts dictionary of 2519370 items
2019-01-24 16:31:02,325 : INFO : sample=0.0001 downsamples 650 most-common words
2019-01-24 16:31:02,326 : INFO : downsampling leaves estimated 4076481917 word corpus (103.2% of prior 3949186974)

After it went like this for an hour, I killed the process.

Gensim 3.7.1 does the same, nothing changed. I'm sorry, but it seems that fastText refactoring in 3.7 was extremely badly tested, with so many things broken :-(

@piskvorky
Copy link
Owner

piskvorky commented Feb 4, 2019

Sorry to hear that @akutuzov . Clearer and more predictable model loading was actually one of the main objectives of that refactoring, so this is surprising.

Can you have a look @mpenkov ?

@akutuzov
Copy link
Contributor

akutuzov commented Feb 4, 2019

Interestingly, even the code proudly presented in the 3.7 Changelog fails in exactly the same way, when using exactly the same model mentioned in this changelog.

'Massive improvement FastText compatibilities', indeed :-)

@mpenkov
Copy link
Collaborator

mpenkov commented Feb 5, 2019

OK, I think we may be dealing with two separate issues here.

  1. @lynochka Loading a native fastText model trained with pretrained_vectors triggers an assertion. We didn't have a test case to cover this use case. I'll add one based on the example provided by @lynochka and @menshikh-iv and resolve the problem.
  2. @akutuzov Gensim doing weird things on loading the native model trained from Wikipedia. I don't think saying that it breaks on any model is fair, given that the example from @lynochka correctly loads a native FB model (no pretrained vectors) without a problem (and our unit tests targeting the same functionality). That said, your example does illustrate a real problem, and I will investigate.

@akutuzov I understand your frustration, and as the author of the refactoring, I apologize for causing you discomfort. I've opened a separate ticket to cover the issue you reported. Let's continue the discussion about your issue there.

@mpenkov
Copy link
Collaborator

mpenkov commented Feb 5, 2019

@lynochka I've investigated your issue and found the cause:

In [8]: from gensim.models._fasttext_bin import load                                                                                                                                                          

In [9]: ft = load('ft.bin')                                                                                                                                                                                   

In [10]: ft.min_count                                                                                                                                                                                         
Out[10]: 5

In [11]: len(ft.raw_vocab)                                                                                                                                                                                    
Out[11]: 13967

In [12]: ft.bucket                                                                                                                                                                                            
Out[12]: 2000000

In [13]: ft.vectors_ngrams.shape                                                                                                                                                                              
Out[13]: (2013967, 50)

In [15]: pre = load('ft_with_pretrained.bin')                                                                                                                                                                 

In [16]: pre.min_count                                                                                                                                                                                        
Out[16]: 5

In [17]: len(pre.raw_vocab)                                                                                                                                                                                   
Out[17]: 400142

In [18]: len([word for (word, count) in pre.raw_vocab.items() if count >= pre.min_count])                                                                                                                     
Out[18]: 13967

In [19]: pre.vectors_ngrams.shape                                                                                                                                                                             
Out[19]: (2400142, 50)

As you can see, adding pre-trained vectors modifies the model:

  1. Without pre-trained vectors, the model contains only vectors for words that occurred more frequently than the min_count parameter
  2. With pre-trained vectors, the model contains vectors for all words

Unfortunately, our loading code always respects the min_count parameter, and incorrectly trims the vocabulary when loading the model, causing an inconstency between the vocab size and the number of vectors.

This inconsistency is what trips the assert.

@dpalbrecht
Copy link

dpalbrecht commented May 2, 2019

I seem to be getting the same sorts of errors, and not sure what the fix is given the above discussion.

model_fast = gensim.models.fasttext.load_facebook_vectors('cc.en.300.bin.gz')
or
model_fast = gensim.models.fasttext.load_facebook_model('cc.en.300.bin.gz')

produce the error:
ValueError: cannot reshape array of size 1116604308 into shape (4000000,300)

@mpenkov
Copy link
Collaborator

mpenkov commented May 3, 2019

@dpalbrecht That looks like a different issue. Can you please open a new ticket, and provide the full stack trace, reproducible sample, version numbers, etc?

@dpalbrecht
Copy link

@mpenkov Sure, will do.

@kusumlata123
Copy link

How to load pre-trained fasttext model for any langauge?

@mpenkov
Copy link
Collaborator

mpenkov commented Jun 26, 2019

@kusumlata123 Please use the mailing list for questions. Github tickets are for feature requests and bug reports only.

@HongyiLiu90
Copy link

I use the code to load a fasttext model, fasttext.5.bin, on my laptop and the CHPC server. However, it works well on my PC, but comes with memory errors on the CHPC. Both of servers use Python 3.7.3 and Gensim 3.8.0. The following code is run under Python 3.7.1. I did try the same under Python 3.7.3, but it failed to go.

Python 3.7.1 (default, Oct 23 2018, 19:19:42)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.

from gensim.models.fasttext import load_facebook_model
model = load_facebook_model('fasttext.5.bin')
Traceback (most recent call last):
File "", line 1, in
File "/home/USER/.conda/envs/Anaconda3-5.2.0/lib/python3.7/site-packages/gensim/models/fasttext.py", line 1250, in load_facebook_model
return _load_fasttext_format(path, encoding=encoding, full_model=True)
File "/home/USER/.conda/envs/Anaconda3-5.2.0/lib/python3.7/site-packages/gensim/models/fasttext.py", line 1343, in _load_fasttext_format
max_n=m.maxn,
File "/home/USER/.conda/envs/Anaconda3-5.2.0/lib/python3.7/site-packages/gensim/models/fasttext.py", line 595, in init
self.trainables.prepare_weights(hs, negative, self.wv, update=False, vocabulary=self.vocabulary)
File "/home/USER/.conda/envs/Anaconda3-5.2.0/lib/python3.7/site-packages/gensim/models/fasttext.py", line 1130, in prepare_weights
self.init_ngrams_weights(wv, update=update, vocabulary=vocabulary)
File "/home/USER/.conda/envs/Anaconda3-5.2.0/lib/python3.7/site-packages/gensim/models/fasttext.py", line 1150, in init_ngrams_weights
wv.init_ngrams_weights(self.seed)
File "/home/USER/.conda/envs/Anaconda3-5.2.0/lib/python3.7/site-packages/gensim/models/keyedvectors.py", line 2219, in init_ngrams_weights
self.vectors_ngrams = rand_obj.uniform(lo, hi, ngrams_shape).astype(REAL)
File "mtrand.pyx", line 1312, in mtrand.RandomState.uniform
File "mtrand.pyx", line 242, in mtrand.cont2_array_sc
MemoryError

Repository owner locked as resolved and limited conversation to collaborators Jul 21, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills fasttext Issues related to the FastText model
Projects
None yet
Development

No branches or pull requests

8 participants