loading fastText model trained with pretrained_vectors still fails (see: #2350) #2378

cbjrobertson · 2019-02-09T12:49:09Z

Description

Loading pretrained fastext_model.bin with gensim.models.fasttext.FastText.load_fasttext_format('wiki-news-300d-1M-subword.bin') fails with AssertionError: unexpected number of vectors despite fix for #2350.

Steps/Code/Corpus to Reproduce

first install develop branch with: pip install --upgrade git+git://github.com/RaRe-Technologies/gensim@develop, then:

#dependencies 
import requests, zipfile, io
from gensim.models.fasttext import FastText

#download model
ft_url = 'https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M-subword.bin.zip'
r = requests.get(ft_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

#attempt load
mod = FastText.load_fasttext_format('wiki-news-300d-1M-subword.bin')

Expected Results

Loaded model.

Actual Results

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-29-a054256d6f88> in <module>
      1 #load model
      2 from gensim.models.fasttext import FastText
----> 3 mod = FastText.load_fasttext_format('wiki-news-300d-1M-subword.bin')
      4 # from gensim.models import KeyedVectors
      5 # wv = KeyedVectors.load_word2vec_format('wiki-news-300d-1M-subword.vec')

/anaconda3/envs/tensor_env/lib/python3.7/site-packages/gensim/models/fasttext.py in load_fasttext_format(cls, model_file, encoding, full_model)
   1012 
   1013         """
-> 1014         return _load_fasttext_format(model_file, encoding=encoding, full_model=full_model)
   1015 
   1016     def load_binary_data(self, encoding='utf8'):

/anaconda3/envs/tensor_env/lib/python3.7/site-packages/gensim/models/fasttext.py in _load_fasttext_format(model_file, encoding, full_model)
   1270     #
   1271     # We explicitly set min_count=1 regardless of the model's parameters to
-> 1272     # ignore the trim rule when building the vocabulary.  We do this in order
   1273     # to support loading native models that were trained with pretrained vectors.
   1274     # Such models will contain vectors for _all_ encountered words, not only

/anaconda3/envs/tensor_env/lib/python3.7/site-packages/gensim/models/keyedvectors.py in init_post_load(self, vectors, match_gensim)
   2205         """
   2206         vocab_words = len(self.vocab)
-> 2207         assert vectors.shape[0] == vocab_words + self.bucket, 'unexpected number of vectors'
   2208         assert vectors.shape[1] == self.vector_size, 'unexpected vector dimensionality'
   2209 

AssertionError: unexpected number of vectors

Versions

Darwin-18.2.0-x86_64-i386-64bit
Python 3.7.2 (default, Dec 29 2018, 00:00:04)
[Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.16.1
SciPy 1.2.0
gensim 3.7.1
FAST_VERSION 1

thanks for your work!

The text was updated successfully, but these errors were encountered:

cbjrobertson · 2019-02-12T10:17:34Z

Is there anything I can do to help this get looked at?

mpenkov · 2019-02-12T13:34:41Z

@cbjrobertson Thank you for reporting this.

Is there anything I can do to help this get looked at?

One thing that would help is a reproducible example with a smaller model. The model you've linked to is several GB, if you could find a smaller model, it'd be easier (quicker) to reproduce the bug.

If not, then no big deal, and I'll try to have a look at it during the week.

cbjrobertson · 2019-02-12T15:39:46Z

@cbjrobertson Thank you for reporting this.

Is there anything I can do to help this get looked at?

One thing that would help is a reproducible example with a smaller model. The model you've linked to is several GB, if you could find a smaller model, it'd be easier (quicker) to reproduce the bug.

If not, then no big deal, and I'll try to have a look at it during the week.

Cool... so... here goes:

Reproducible code:

#dependencies 
import requests, zipfile, io, os
from gensim.models.fasttext import FastText


#download model
ft_url = 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip'
zpath = 'crawl-300d-2M-subword.zip'
mpath = 'crawl-300d-2M-subword.bin'

if os.path.isfile(mpath):
    #attempt load
    mod = FastText.load_fasttext_format(mpath,encoding='utf-8')
elif not os.path.isfile(zpath):
    r = requests.get(ft_url)
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall()
else:
    z = zipfile.ZipFile(zpath)
    z.extractall()
    mod = FastText.load_fasttext_format(mpath,encoding='utf-8')

This is at least downloads from a faster source. However, it fails with:

Traceback (most recent call last):

  File "<ipython-input-1-5ff598c5927b>", line 1, in <module>
    runfile('/Users/cole/Desktop/fast_text_test.py', wdir='/Users/cole/Desktop')

  File "/anaconda3/envs/tensor_env/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 704, in runfile
    execfile(filename, namespace)

  File "/anaconda3/envs/tensor_env/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 108, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/Users/cole/Desktop/fast_text_test.py", line 21, in <module>
    mod = FastText.load_fasttext_format(mpath,encoding='utf-8')

  File "/anaconda3/envs/tensor_env/lib/python3.7/site-packages/gensim/models/fasttext.py", line 1014, in load_fasttext_format
    return _load_fasttext_format(model_file, encoding=encoding, full_model=full_model)

  File "/anaconda3/envs/tensor_env/lib/python3.7/site-packages/gensim/models/fasttext.py", line 1248, in _load_fasttext_format
    m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)

  File "/anaconda3/envs/tensor_env/lib/python3.7/site-packages/gensim/models/_fasttext_bin.py", line 257, in load
    raw_vocab, vocab_size, nwords = _load_vocab(fin, new_format, encoding=encoding)

  File "/anaconda3/envs/tensor_env/lib/python3.7/site-packages/gensim/models/_fasttext_bin.py", line 177, in _load_vocab
    word = word_bytes.decode(encoding)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 57: unexpected end of data

I patched/hacked my way around that by adding the following two lines of code gensim/models/_fasttext_bin.py:

+ Line 35         from bs4 import UnicodeDammit
- Line 178        #word = word_bytes.decode(encoding)
+ Line 178        word = UnicodeDammit(word_bytes).unicode_markup

Which actually runs for a while, but then fails with:

Traceback (most recent call last):

  File "<ipython-input-2-5ff598c5927b>", line 1, in <module>
    runfile('/Users/cole/Desktop/fast_text_test.py', wdir='/Users/cole/Desktop')

  File "/anaconda3/envs/tensor_env/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 704, in runfile
    execfile(filename, namespace)

  File "/anaconda3/envs/tensor_env/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 108, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/Users/cole/Desktop/fast_text_test.py", line 21, in <module>
    mod = FastText.load_fasttext_format(mpath,encoding='utf-8')

  File "/anaconda3/envs/tensor_env/lib/python3.7/site-packages/gensim/models/fasttext.py", line 1014, in load_fasttext_format
    return _load_fasttext_format(model_file, encoding=encoding, full_model=full_model)

  File "/anaconda3/envs/tensor_env/lib/python3.7/site-packages/gensim/models/fasttext.py", line 1248, in _load_fasttext_format
    m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)

  File "/anaconda3/envs/tensor_env/lib/python3.7/site-packages/gensim/models/_fasttext_bin.py", line 262, in load
    vectors_ngrams = _load_matrix(fin, new_format=new_format)

  File "/anaconda3/envs/tensor_env/lib/python3.7/site-packages/gensim/models/_fasttext_bin.py", line 225, in _load_matrix
    matrix = matrix.reshape((num_vectors, dim))

ValueError: cannot reshape array of size 179573500 into shape (4000000,300)

You may want to simply download the original link by hand. When I did it directly it was a lot faster...

mpenkov · 2019-02-13T04:31:42Z

@cbjrobertson I'm having trouble reproducing the original problem (assertion error when loading model).

First, I downloaded the model from the URL in your original message and unpacked it using the command-line gunzip utility. Next, I used the below code (based on your example) to load the model:

import logging
logging.basicConfig(level=logging.INFO)
from gensim.models.fasttext import FastText

mod = FastText.load_fasttext_format('wiki-news-300d-1M-subword.bin')

and got the UnicodeError that you described:

(devel.env) mpenkov@hetrad2:~/2378$ python bug.py 
INFO:gensim.summarization.textcleaner:'pattern' package not found; tag filters are not available for English
INFO:gensim.models._fasttext_bin:loading 2000000 words for fastText model from wiki-news-300d-1M-subword.bin
Traceback (most recent call last):
  File "bug.py", line 5, in <module>
    mod = FastText.load_fasttext_format('wiki-news-300d-1M-subword.bin')
  File "/home/mpenkov/git/gensim/gensim/models/fasttext.py", line 1014, in load_fasttext_format
    return _load_fasttext_format(model_file, encoding=encoding, full_model=full_model)
  File "/home/mpenkov/git/gensim/gensim/models/fasttext.py", line 1248, in _load_fasttext_format
    m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)
  File "/home/mpenkov/git/gensim/gensim/models/_fasttext_bin.py", line 257, in load
    raw_vocab, vocab_size, nwords = _load_vocab(fin, new_format, encoding=encoding)
  File "/home/mpenkov/git/gensim/gensim/models/_fasttext_bin.py", line 177, in _load_vocab
    word = word_bytes.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 57: unexpected end of data

I stepped through the code with a debugger and had a closer look at the data. Some of the vocab terms contained bad Unicode (unable to decode cleanly). I think it's a problem with the model, but the native fastText utility loads this model without any problem, so I added code to handle the UnicodeError. Once that was done, I was able to load the model without any problems.

In summary, I was unable to reproduce the failed assertion you initially demonstrated. Could you please try reproducing the problem with this branch? It contains the UnicodeError fix.

Here's that code running on your model:

(devel.env) mpenkov@hetrad2:~/2378$ python bug.py                                                                                                                                                             
INFO:gensim.summarization.textcleaner:'pattern' package not found; tag filters are not available for English
INFO:gensim.models._fasttext_bin:loading 2000000 words for fastText model from wiki-news-300d-1M-subword.bin
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xe8\x8b\xb1\xe8\xaa\x9e\xe7\x89\x88\xe3\x82\xa6\xe3\x82\xa3\xe3\x82\xad\xe3\x83\x9a\xe3\x83\x87\xe3\x82\xa3\xe3\x82\xa2\xe3\x81\xb8\xe3\x81\xae\xe6\x8a\x95\xe7\xa8\xbf\xe3\x81\xaf\xe3\x81\x84\xe3\x81\xa4\xe3\x81\xa7\xe3\x82\x82\xe6' to word '英語版ウィキペディアへの投稿はいつでも'
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xe4\xbd\x86\xe6\x98\xaf\xe4\xbd\xa0\xe9\x80\x99\xe6\xac\xa1\xe5\x9f\xb7\xe7\xad\x86\xe6\x89\x80\xe4\xbd\xbf\xe7\x94\xa8\xe7\x9a\x84\xe8\x8b\xb1\xe6\x96\x87\xe4\xb8\xa6\xe6\x9c\xaa\xe9\x81\x94\xe5\x88\xb0\xe8\x8b\xb1\xe6\x96\x87\xe7' to word '但是你這次執筆所使用的英文並未達到英文'
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xd0\xb0\xd0\xb4\xd0\xbc\xd0\xb8\xd0\xbd\xd0\xb8\xd1\x81\xd1\x82\xd1\x80\xd0\xb0\xd1\x82\xd0\xb8\xd0\xb2\xd0\xbd\xd0\xbe-\xd1\x82\xd0\xb5\xd1\x80\xd1\x80\xd0\xb8\xd1\x82\xd0\xbe\xd1\x80\xd0\xb8\xd0\xb0\xd0\xbb\xd1\x8c\xd0\xbd\xd1' to word 'административно-территориальн'
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.keyedvectors:Total number of ngrams is 0
INFO:gensim.models.word2vec:Updating model with new vocabulary
INFO:gensim.models.word2vec:New added 2000000 unique words (50% of original 4000000) and increased the count of 2000000 pre-existing words (50% of original 4000000)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 2000000 items
INFO:gensim.models.word2vec:sample=1e-05 downsamples 5698 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 5635237139 word corpus (61.2% of prior 9203539378)
INFO:gensim.models.fasttext:loaded (4000000, 300) weight matrix for fastText model from wiki-news-300d-1M-subword.bin

cbjrobertson · 2019-02-13T08:57:45Z

Very weird! Yes, I will. Thanks. I'll try later today and let you know.

mpenkov · 2019-02-17T00:28:24Z

@cbjrobertson How’s it going?

cbjrobertson · 2019-02-18T17:11:51Z

Sorry to be slow. I am doing it now, though. It's taking a long time to load. I will update you ASAP.

cbjrobertson · 2019-02-18T17:37:03Z

@mpenkov It is now failing to replicate for me as well. On your unicode branch it loads fine. Maybe my downloaded file was corrupt. Though the ValueError still remains for the other model. Would you recommend opening another issue for that? I don't need to load it specifically, but it is one of the FT released large models. I imagine gensim should load those without error.

mpenkov · 2019-02-18T22:57:40Z

It is now failing to replicate for me as well. On your unicode branch it loads fine. Maybe my downloaded file was corrupt.

OK, glad to hear it worked. We'll include the unicode fix in the next bugfix release.

Though the ValueError still remains for the other model. Would you recommend opening another issue for that? I don't need to load it specifically, but it is one of the FT released large models. I imagine gensim should load those without error.

Yes, we should load these without error. Could you please clarify which model you are unable to load?

cbjrobertson · 2019-02-19T12:38:11Z

Hey--it's the model referenced in the reproducible code in my first reply. Download url is https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip. I get the same ValueError that I report above when I load it with your unicode branch.

mpenkov · 2019-02-20T05:38:19Z

@cbjrobertson I cannot reproduce the problem with that model file, either.

Source:

import logging
logging.basicConfig(level=logging.INFO)
from gensim.models.fasttext import FastText
mod = FastText.load_fasttext_format('crawl-300d-2M-subword.bin')

Output:

INFO:gensim.summarization.textcleaner:'pattern' package not found; tag filters are not available for English
INFO:gensim.models._fasttext_bin:loading 2000000 words for fastText model from crawl-300d-2M-subword.bin
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'DeutschHrvatskiEnglishDanskNederlandssuomiFran\xc3\xa7ais\xce\x95\xce\xbb\xce\xbb\xce' to word 'DeutschHrvatskiEnglishDanskNederlandssuomiFrançaisΕλλ'
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xe3\x81\x99\xe3\x81\xb9\xe3\x81\xa6\xe3\x81\xae\xe5\x9b\x9e\xe7\xad\x94\xe3\x82\x92\xe9\x9d\x9e\xe8\xa1\xa8\xe7\xa4\xba\xe3\x81\xab\xe3\x81\x99\xe3\x82\x8b\xe8\xb3\xaa\xe5\x95\x8f\xe3\x82\x92\xe5\x89\x8a\xe9\x99\xa4\xe3\x81\x97\xe3' to word 'すべての回答を非表示にする質問を削除し'
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'00Z\xe9\x83\xa8\xe5\xb1\x8b\xe3\x82\xbf\xe3\x82\xa4\xe3\x83\x97\xe3\x81\xbe\xe3\x82\x8b\xe3\x81\xbe\xe3\x82\x8b\xe8\xb2\xb8\xe5\x88\x87\xe5\xbb\xba\xe7\x89\xa9\xe3\x82\xbf\xe3\x82\xa4\xe3\x83\x97\xe4\xb8\x80\xe8\xbb\x92\xe5' to word '00Z部屋タイプまるまる貸切建物タイプ一軒'
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'2017\xe6\x88\xbf\xe9\x97\xb4\xe7\xb1\xbb\xe5\x9e\x8b\xe7\x8b\xac\xe7\xab\x8b\xe6\x88\xbf\xe9\x97\xb4\xe6\x88\xbf\xe6\xba\x90\xe7\xb1\xbb\xe5\x9e\x8b\xe7\x8b\xac\xe7\xab\x8b\xe5\xb1\x8b\xe5\x8f\xaf\xe4\xbd\x8f2\xe5\x8d' to word '2017房间类型独立房间房源类型独立屋可住2'
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'2016\xe6\x88\xbf\xe9\x97\xb4\xe7\xb1\xbb\xe5\x9e\x8b\xe7\x8b\xac\xe7\xab\x8b\xe6\x88\xbf\xe9\x97\xb4\xe6\x88\xbf\xe6\xba\x90\xe7\xb1\xbb\xe5\x9e\x8b\xe7\x8b\xac\xe7\xab\x8b\xe5\xb1\x8b\xe5\x8f\xaf\xe4\xbd\x8f2\xe5\x8d' to word '2016房间类型独立房间房源类型独立屋可住2'
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'00Z\xe9\x83\xa8\xe5\xb1\x8b\xe3\x82\xbf\xe3\x82\xa4\xe3\x83\x97\xe3\x81\xbe\xe3\x82\x8b\xe3\x81\xbe\xe3\x82\x8b\xe8\xb2\xb8\xe5\x88\x87\xe5\xbb\xba\xe7\x89\xa9\xe3\x82\xbf\xe3\x82\xa4\xe3\x83\x97\xe5\x88\xa5\xe8\x8d\x98\xe5' to word '00Z部屋タイプまるまる貸切建物タイプ別荘'
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xe6\xb6\x88\xe8\xb2\xbb\xe8\x80\x85\xe7\x9f\xa5\xe9\x81\x93\xe4\xb8\x80\xe5\x80\x8b\xe8\xa3\xbd\xe9\x80\xa0\xe5\x95\x86\xe5\x85\xb6\xe4\xb8\xad\xe4\xb8\x80\xe5\x80\x8b\xe7\x94\xa2\xe5\x93\x81\xe7\x9a\x84\xe4\xb8\x80\xe8\x88\xac\xe5' to word '消費者知道一個製造商其中一個產品的一般'
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xce\xb9\xce\xb4\xce\xb9\xce\xbf\xce\xba\xcf\x84\xce\xb7\xcf\x83\xce\xaf\xce\xb1\xcf\x82\xce\x94\xce\xb9\xce\xb1\xce\xbc\xce\xad\xcf\x81\xce\xb9\xcf\x83\xce\xbc\xce\xb1\xce\x86\xcf\x84\xce\xbf\xce\xbc\xce\xb12\xce\xa5\xcf\x80\xce' to word 'ιδιοκτησίαςΔιαμέρισμαΆτομα2Υπ'
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xe6\x88\x96\xe5\x85\xb6\xe4\xbb\x96\xe5\xae\x98\xe6\x96\xb9\xe7\x82\xb9\xe8\xaf\x84\xe6\x94\xb6\xe9\x9b\x86\xe5\x90\x88\xe4\xbd\x9c\xe4\xbc\x99\xe4\xbc\xb4\xe6\x8f\x90\xe4\xbe\x9b\xe7\x9a\x84\xe5\xb7\xa5\xe5\x85\xb7\xe9\xbc\x93\xe5' to word '或其他官方点评收集合作伙伴提供的工具鼓'
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'00Z\xe6\x88\xbf\xe9\x96\x93\xe9\xa1\x9e\xe5\x9e\x8b\xe7\xa7\x81\xe4\xba\xba\xe6\x88\xbf\xe9\x96\x93\xe6\x88\xbf\xe6\xba\x90\xe9\xa1\x9e\xe5\x9e\x8b\xe5\xae\xb6\xe5\xba\xad\xe5\xbc\x8f\xe6\x97\x85\xe9\xa4\xa8\xe5\x8f\xaf\xe4' to word '00Z房間類型私人房間房源類型家庭式旅館可'
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.keyedvectors:Total number of ngrams is 0
INFO:gensim.models.word2vec:Updating model with new vocabulary
INFO:gensim.models.word2vec:New added 2000000 unique words (50% of original 4000000) and increased the count of 2000000 pre-existing words (50% of original 4000000)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 2000000 items
INFO:gensim.models.word2vec:sample=1e-05 downsamples 5738 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 306654075079 word corpus (60.7% of prior 505076145972)
INFO:gensim.models.fasttext:loaded (4000000, 300) weight matrix for fastText model from crawl-300d-2M-subword.bin

I'm using the commit 6dc4aef to test this.

Can you please confirm?

cbjrobertson · 2019-02-20T12:34:48Z

Ag. Neither can I. I am still using your unicode branch, as I couldn't get an install to work of 6dc4aef. If you send the the specific pip install ... command, I'm happy to try. Again, perhaps something to do with the models, or this commit... which has been added to your develop branch since I last tried to use it.

mpenkov · 2019-02-20T13:27:19Z

That commit is irrelevant to the current issue.

So, you can load the model correctly using the unicode branch?

cbjrobertson · 2019-02-20T13:50:05Z

Yes. Cole

…

On Wed, Feb 20, 2019 at 1:27 PM Michael Penkov ***@***.***> wrote: That commit is irrelevant to the current issue. So, you can load the model correctly using the unicode branch? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2378 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AVVeHqQJqqx-583I8bFHharUHiuKFBuQks5vPU1DgaJpZM4ayhus> .

cbjrobertson · 2019-03-29T11:09:21Z

I don’t manage that link. I found it somewhere else. Sorry, can’t help.

…

On 29 Mar 2019, at 03:29, eadka ***@***.***> wrote: @cbjrobertson I'm having trouble reproducing the original problem (assertion error when loading model). First, I downloaded the model from the URL in your original message and unpacked it using the command-line gunzip utility. Next, I used the below code (based on your example) to load the model: import logging logging.basicConfig(level=logging.INFO) from gensim.models.fasttext import FastText mod = FastText.load_fasttext_format('wiki-news-300d-1M-subword.bin') and got the UnicodeError that you described: (devel.env) ***@***.***:~/2378$ python bug.py INFO:gensim.summarization.textcleaner:'pattern' package not found; tag filters are not available for English INFO:gensim.models._fasttext_bin:loading 2000000 words for fastText model from wiki-news-300d-1M-subword.bin Traceback (most recent call last): File "bug.py", line 5, in <module> mod = FastText.load_fasttext_format('wiki-news-300d-1M-subword.bin') File "/home/mpenkov/git/gensim/gensim/models/fasttext.py", line 1014, in load_fasttext_format return _load_fasttext_format(model_file, encoding=encoding, full_model=full_model) File "/home/mpenkov/git/gensim/gensim/models/fasttext.py", line 1248, in _load_fasttext_format m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model) File "/home/mpenkov/git/gensim/gensim/models/_fasttext_bin.py", line 257, in load raw_vocab, vocab_size, nwords = _load_vocab(fin, new_format, encoding=encoding) File "/home/mpenkov/git/gensim/gensim/models/_fasttext_bin.py", line 177, in _load_vocab word = word_bytes.decode(encoding) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 57: unexpected end of data I stepped through the code with a debugger and had a closer look at the data. Some of the vocab terms contained bad Unicode (unable to decode cleanly). I think it's a problem with the model, but the native fastText utility loads this model without any problem, so I added code to handle the UnicodeError. Once that was done, I was able to load the model without any problems. In summary, I was unable to reproduce the failed assertion you initially demonstrated. Could you please try reproducing the problem with this branch? It contains the UnicodeError fix. Here's that code running on your model: (devel.env) ***@***.***:~/2378$ python bug.py INFO:gensim.summarization.textcleaner:'pattern' package not found; tag filters are not available for English INFO:gensim.models._fasttext_bin:loading 2000000 words for fastText model from wiki-news-300d-1M-subword.bin ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xe8\x8b\xb1\xe8\xaa\x9e\xe7\x89\x88\xe3\x82\xa6\xe3\x82\xa3\xe3\x82\xad\xe3\x83\x9a\xe3\x83\x87\xe3\x82\xa3\xe3\x82\xa2\xe3\x81\xb8\xe3\x81\xae\xe6\x8a\x95\xe7\xa8\xbf\xe3\x81\xaf\xe3\x81\x84\xe3\x81\xa4\xe3\x81\xa7\xe3\x82\x82\xe6' to word '英語版ウィキペディアへの投稿はいつでも' ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xe4\xbd\x86\xe6\x98\xaf\xe4\xbd\xa0\xe9\x80\x99\xe6\xac\xa1\xe5\x9f\xb7\xe7\xad\x86\xe6\x89\x80\xe4\xbd\xbf\xe7\x94\xa8\xe7\x9a\x84\xe8\x8b\xb1\xe6\x96\x87\xe4\xb8\xa6\xe6\x9c\xaa\xe9\x81\x94\xe5\x88\xb0\xe8\x8b\xb1\xe6\x96\x87\xe7' to word '但是你這次執筆所使用的英文並未達到英文' ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xd0\xb0\xd0\xb4\xd0\xbc\xd0\xb8\xd0\xbd\xd0\xb8\xd1\x81\xd1\x82\xd1\x80\xd0\xb0\xd1\x82\xd0\xb8\xd0\xb2\xd0\xbd\xd0\xbe-\xd1\x82\xd0\xb5\xd1\x80\xd1\x80\xd0\xb8\xd1\x82\xd0\xbe\xd1\x80\xd0\xb8\xd0\xb0\xd0\xbb\xd1\x8c\xd0\xbd\xd1' to word 'административно-территориальн' INFO:gensim.models.word2vec:resetting layer weights INFO:gensim.models.keyedvectors:Total number of ngrams is 0 INFO:gensim.models.word2vec:Updating model with new vocabulary INFO:gensim.models.word2vec:New added 2000000 unique words (50% of original 4000000) and increased the count of 2000000 pre-existing words (50% of original 4000000) INFO:gensim.models.word2vec:deleting the raw counts dictionary of 2000000 items INFO:gensim.models.word2vec:sample=1e-05 downsamples 5698 most-common words INFO:gensim.models.word2vec:downsampling leaves estimated 5635237139 word corpus (61.2% of prior 9203539378) INFO:gensim.models.fasttext:loaded (4000000, 300) weight matrix for fastText model from wiki-news-300d-1M-subword.bin The link provided in "...Could you please try reproducing the problem with this branch?" doesn't exist. Please redirect. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

rianrajagede · 2019-04-13T18:27:09Z

I just got this error too, I run this in Colab:

!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.id.300.bin.gz
!pip install --upgrade gensim

from gensim.models.fasttext import FastText, load_facebook_vectors
model = load_facebook_vectors("cc.id.300.bin.gz")

it returns an error:

ValueError                                Traceback (most recent call last)

<ipython-input-3-ea4d9b419339> in <module>()
      1 from gensim.models.fasttext import FastText, load_facebook_vectors, load_facebook_model
      2 
----> 3 model = load_facebook_vectors("cc.id.300.bin.gz")

/usr/local/lib/python3.6/dist-packages/gensim/models/fasttext.py in load_facebook_vectors(path, encoding)
   1297 
   1298     """
-> 1299     model_wrapper = _load_fasttext_format(path, encoding=encoding, full_model=False)
   1300     return model_wrapper.wv
   1301 

/usr/local/lib/python3.6/dist-packages/gensim/models/fasttext.py in _load_fasttext_format(model_file, encoding, full_model)
   1321     """
   1322     with smart_open(model_file, 'rb') as fin:
-> 1323         m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)
   1324 
   1325     model = FastText(

/usr/local/lib/python3.6/dist-packages/gensim/models/_fasttext_bin.py in load(fin, encoding, full_model)
    272     model.update(raw_vocab=raw_vocab, vocab_size=vocab_size, nwords=nwords)
    273 
--> 274     vectors_ngrams = _load_matrix(fin, new_format=new_format)
    275 
    276     if not full_model:

/usr/local/lib/python3.6/dist-packages/gensim/models/_fasttext_bin.py in _load_matrix(fin, new_format)
    235 
    236     matrix = np.fromfile(fin, dtype=dtype, count=num_vectors * dim)
--> 237     matrix = matrix.reshape((num_vectors, dim))
    238     return matrix
    239 

ValueError: cannot reshape array of size 1117560053 into shape (4000000,300)

is there any solution for this error? I've also seen this comment, is there any update?

mpenkov · 2019-04-13T22:57:48Z

@rianrajagede What version of gensim are you using?

rianrajagede · 2019-04-14T00:26:15Z

@mpenkov I use Gensim 3.7.2

mpenkov · 2019-04-14T02:29:32Z

OK, I will investigate and get back to you. Thank you for reporting this.

mpenkov · 2019-04-20T14:10:55Z

@rianrajagede I could not reproduce the problem with the latest version of gensim (3.7.2). Could you please try and let me know?

cbjrobertson · 2019-04-20T14:18:22Z

@rianrajagede when I was unable to reproduce my original error even though I was running the same code, I had re-downloaded the model. My conjecture was it had something to do with the model file itself being corrupted. Have you tried simply downloading model again?

rianrajagede · 2019-04-21T07:05:51Z

I run it on Google Colab, I've tried with other language or re-downloaded the model but the result still the same:

https://colab.research.google.com/drive/1noO8IwoQyKn_60XAk1sJJRjVZA0xeWfQ

mpenkov · 2019-04-21T14:28:33Z

Can you please confirm that the version installed in your Google Colab is 3.7.2?

rianrajagede · 2019-04-21T16:10:56Z

Google Colab by default provides Gensim 3.6.0, in the first run if I call !pip show gensim it returns:

Name: gensim
Version: 3.6.0
Summary: Python framework for fast Vector Space Modelling
Home-page: http://radimrehurek.com/gensim
Author: Radim Rehurek
Author-email: [email protected]
License: LGPLv2.1
Location: /usr/local/lib/python3.6/dist-packages
Requires: scipy, smart-open, numpy, six
Required-by:

but as my code above, I always start my notebook with !pip install --upgrade gensim. Then after upgrade !pip show gensim will return:

Name: gensim
Version: 3.7.2
Summary: Python framework for fast Vector Space Modelling
Home-page: http://radimrehurek.com/gensim
Author: Radim Rehurek
Author-email: [email protected]
License: LGPLv2.1
Location: /usr/local/lib/python3.6/dist-packages
Requires: six, smart-open, scipy, numpy
Required-by:

SenthilVikram · 2021-03-06T17:49:06Z

I have a better alternative, instead of using genism, you can try importing fasttext and use fasttext.load_model()
Installing:
pip install fasttext

Usage:
import fasttext
file = "wiki-news-300d-1M-subword.bin"
"""
fastext bin file embedding saved in directory (downloaded from https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.bin.zip.)
"""
model = fasttext.load_model(file)

Testing:
print(model["hello"].shape) #outputs 300

piskvorky assigned mpenkov Feb 12, 2019

mpenkov added bug Issue described a bug need info Not enough information for reproduce an issue, need more info from author fasttext Issues related to the FastText model labels Feb 13, 2019

mpenkov mentioned this issue Feb 20, 2019

handle UnicodeDecodeError when loading vocabulary #2390

Merged

mpenkov closed this as completed in #2390 Feb 21, 2019

mpenkov mentioned this issue Mar 7, 2019

AssertionError: unexpected number of vectors when loading Korean FB model #2402

Closed

mpenkov reopened this Apr 14, 2019

mpenkov removed the need info Not enough information for reproduce an issue, need more info from author label May 4, 2019

rianrajagede mentioned this issue May 28, 2019

High RAM usage when loading FastText Model on Google Colab #2502

Closed

piskvorky closed this as completed Oct 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loading fastText model trained with pretrained_vectors still fails (see: #2350) #2378

loading fastText model trained with pretrained_vectors still fails (see: #2350) #2378

cbjrobertson commented Feb 9, 2019 •

edited by mpenkov

Loading

cbjrobertson commented Feb 12, 2019

mpenkov commented Feb 12, 2019

cbjrobertson commented Feb 12, 2019 •

edited by mpenkov

Loading

mpenkov commented Feb 13, 2019

cbjrobertson commented Feb 13, 2019

mpenkov commented Feb 17, 2019

cbjrobertson commented Feb 18, 2019

cbjrobertson commented Feb 18, 2019

mpenkov commented Feb 18, 2019 •

edited

Loading

cbjrobertson commented Feb 19, 2019

mpenkov commented Feb 20, 2019

cbjrobertson commented Feb 20, 2019

mpenkov commented Feb 20, 2019

cbjrobertson commented Feb 20, 2019 via email

cbjrobertson commented Mar 29, 2019 via email

rianrajagede commented Apr 13, 2019 •

edited

Loading

mpenkov commented Apr 13, 2019

rianrajagede commented Apr 14, 2019

mpenkov commented Apr 14, 2019

mpenkov commented Apr 20, 2019

cbjrobertson commented Apr 20, 2019 •

edited

Loading

rianrajagede commented Apr 21, 2019

mpenkov commented Apr 21, 2019

rianrajagede commented Apr 21, 2019

SenthilVikram commented Mar 6, 2021

loading fastText model trained with pretrained_vectors still fails (see: #2350) #2378

loading fastText model trained with pretrained_vectors still fails (see: #2350) #2378

Comments

cbjrobertson commented Feb 9, 2019 • edited by mpenkov Loading

Description

Steps/Code/Corpus to Reproduce

Expected Results

Actual Results

Versions

thanks for your work!

cbjrobertson commented Feb 12, 2019

mpenkov commented Feb 12, 2019

cbjrobertson commented Feb 12, 2019 • edited by mpenkov Loading

Reproducible code:

mpenkov commented Feb 13, 2019

cbjrobertson commented Feb 13, 2019

mpenkov commented Feb 17, 2019

cbjrobertson commented Feb 18, 2019

cbjrobertson commented Feb 18, 2019

mpenkov commented Feb 18, 2019 • edited Loading

cbjrobertson commented Feb 19, 2019

mpenkov commented Feb 20, 2019

cbjrobertson commented Feb 20, 2019

mpenkov commented Feb 20, 2019

cbjrobertson commented Feb 20, 2019 via email

cbjrobertson commented Mar 29, 2019 via email

rianrajagede commented Apr 13, 2019 • edited Loading

mpenkov commented Apr 13, 2019

rianrajagede commented Apr 14, 2019

mpenkov commented Apr 14, 2019

mpenkov commented Apr 20, 2019

cbjrobertson commented Apr 20, 2019 • edited Loading

rianrajagede commented Apr 21, 2019

mpenkov commented Apr 21, 2019

rianrajagede commented Apr 21, 2019

SenthilVikram commented Mar 6, 2021

cbjrobertson commented Feb 9, 2019 •

edited by mpenkov

Loading

cbjrobertson commented Feb 12, 2019 •

edited by mpenkov

Loading

mpenkov commented Feb 18, 2019 •

edited

Loading

rianrajagede commented Apr 13, 2019 •

edited

Loading

cbjrobertson commented Apr 20, 2019 •

edited

Loading