-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loading fastText model trained with pretrained_vectors still fails (see: #2350) #2378
Comments
Is there anything I can do to help this get looked at? |
@cbjrobertson Thank you for reporting this.
One thing that would help is a reproducible example with a smaller model. The model you've linked to is several GB, if you could find a smaller model, it'd be easier (quicker) to reproduce the bug. If not, then no big deal, and I'll try to have a look at it during the week. |
Cool... so... here goes: Reproducible code:#dependencies
import requests, zipfile, io, os
from gensim.models.fasttext import FastText
#download model
ft_url = 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip'
zpath = 'crawl-300d-2M-subword.zip'
mpath = 'crawl-300d-2M-subword.bin'
if os.path.isfile(mpath):
#attempt load
mod = FastText.load_fasttext_format(mpath,encoding='utf-8')
elif not os.path.isfile(zpath):
r = requests.get(ft_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()
else:
z = zipfile.ZipFile(zpath)
z.extractall()
mod = FastText.load_fasttext_format(mpath,encoding='utf-8') This is at least downloads from a faster source. However, it fails with:
I patched/hacked my way around that by adding the following two lines of code
Which actually runs for a while, but then fails with:
You may want to simply download the original link by hand. When I did it directly it was a lot faster... |
@cbjrobertson I'm having trouble reproducing the original problem (assertion error when loading model). First, I downloaded the model from the URL in your original message and unpacked it using the command-line gunzip utility. Next, I used the below code (based on your example) to load the model: import logging
logging.basicConfig(level=logging.INFO)
from gensim.models.fasttext import FastText
mod = FastText.load_fasttext_format('wiki-news-300d-1M-subword.bin') and got the UnicodeError that you described:
I stepped through the code with a debugger and had a closer look at the data. Some of the vocab terms contained bad Unicode (unable to decode cleanly). I think it's a problem with the model, but the native fastText utility loads this model without any problem, so I added code to handle the UnicodeError. Once that was done, I was able to load the model without any problems. In summary, I was unable to reproduce the failed assertion you initially demonstrated. Could you please try reproducing the problem with this branch? It contains the UnicodeError fix. Here's that code running on your model:
|
Very weird! Yes, I will. Thanks. I'll try later today and let you know. |
@cbjrobertson How’s it going? |
Sorry to be slow. I am doing it now, though. It's taking a long time to load. I will update you ASAP. |
@mpenkov It is now failing to replicate for me as well. On your unicode branch it loads fine. Maybe my downloaded file was corrupt. Though the |
OK, glad to hear it worked. We'll include the unicode fix in the next bugfix release.
Yes, we should load these without error. Could you please clarify which model you are unable to load? |
Hey--it's the model referenced in the reproducible code in my first reply. Download url is |
@cbjrobertson I cannot reproduce the problem with that model file, either. Source: import logging
logging.basicConfig(level=logging.INFO)
from gensim.models.fasttext import FastText
mod = FastText.load_fasttext_format('crawl-300d-2M-subword.bin') Output:
I'm using the commit 6dc4aef to test this. Can you please confirm? |
Ag. Neither can I. I am still using your unicode branch, as I couldn't get an install to work of |
That commit is irrelevant to the current issue. So, you can load the model correctly using the unicode branch? |
Yes.
Cole
…On Wed, Feb 20, 2019 at 1:27 PM Michael Penkov ***@***.***> wrote:
That commit is irrelevant to the current issue.
So, you can load the model correctly using the unicode branch?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2378 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AVVeHqQJqqx-583I8bFHharUHiuKFBuQks5vPU1DgaJpZM4ayhus>
.
|
I don’t manage that link. I found it somewhere else. Sorry, can’t help.
… On 29 Mar 2019, at 03:29, eadka ***@***.***> wrote:
@cbjrobertson I'm having trouble reproducing the original problem (assertion error when loading model).
First, I downloaded the model from the URL in your original message and unpacked it using the command-line gunzip utility. Next, I used the below code (based on your example) to load the model:
import logging
logging.basicConfig(level=logging.INFO)
from gensim.models.fasttext import FastText
mod = FastText.load_fasttext_format('wiki-news-300d-1M-subword.bin')
and got the UnicodeError that you described:
(devel.env) ***@***.***:~/2378$ python bug.py
INFO:gensim.summarization.textcleaner:'pattern' package not found; tag filters are not available for English
INFO:gensim.models._fasttext_bin:loading 2000000 words for fastText model from wiki-news-300d-1M-subword.bin
Traceback (most recent call last):
File "bug.py", line 5, in <module>
mod = FastText.load_fasttext_format('wiki-news-300d-1M-subword.bin')
File "/home/mpenkov/git/gensim/gensim/models/fasttext.py", line 1014, in load_fasttext_format
return _load_fasttext_format(model_file, encoding=encoding, full_model=full_model)
File "/home/mpenkov/git/gensim/gensim/models/fasttext.py", line 1248, in _load_fasttext_format
m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)
File "/home/mpenkov/git/gensim/gensim/models/_fasttext_bin.py", line 257, in load
raw_vocab, vocab_size, nwords = _load_vocab(fin, new_format, encoding=encoding)
File "/home/mpenkov/git/gensim/gensim/models/_fasttext_bin.py", line 177, in _load_vocab
word = word_bytes.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 57: unexpected end of data
I stepped through the code with a debugger and had a closer look at the data. Some of the vocab terms contained bad Unicode (unable to decode cleanly). I think it's a problem with the model, but the native fastText utility loads this model without any problem, so I added code to handle the UnicodeError. Once that was done, I was able to load the model without any problems.
In summary, I was unable to reproduce the failed assertion you initially demonstrated. Could you please try reproducing the problem with this branch? It contains the UnicodeError fix.
Here's that code running on your model:
(devel.env) ***@***.***:~/2378$ python bug.py
INFO:gensim.summarization.textcleaner:'pattern' package not found; tag filters are not available for English
INFO:gensim.models._fasttext_bin:loading 2000000 words for fastText model from wiki-news-300d-1M-subword.bin
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xe8\x8b\xb1\xe8\xaa\x9e\xe7\x89\x88\xe3\x82\xa6\xe3\x82\xa3\xe3\x82\xad\xe3\x83\x9a\xe3\x83\x87\xe3\x82\xa3\xe3\x82\xa2\xe3\x81\xb8\xe3\x81\xae\xe6\x8a\x95\xe7\xa8\xbf\xe3\x81\xaf\xe3\x81\x84\xe3\x81\xa4\xe3\x81\xa7\xe3\x82\x82\xe6' to word '英語版ウィキペディアへの投稿はいつでも'
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xe4\xbd\x86\xe6\x98\xaf\xe4\xbd\xa0\xe9\x80\x99\xe6\xac\xa1\xe5\x9f\xb7\xe7\xad\x86\xe6\x89\x80\xe4\xbd\xbf\xe7\x94\xa8\xe7\x9a\x84\xe8\x8b\xb1\xe6\x96\x87\xe4\xb8\xa6\xe6\x9c\xaa\xe9\x81\x94\xe5\x88\xb0\xe8\x8b\xb1\xe6\x96\x87\xe7' to word '但是你這次執筆所使用的英文並未達到英文'
ERROR:gensim.models._fasttext_bin:unable to cleanly decode bytes b'\xd0\xb0\xd0\xb4\xd0\xbc\xd0\xb8\xd0\xbd\xd0\xb8\xd1\x81\xd1\x82\xd1\x80\xd0\xb0\xd1\x82\xd0\xb8\xd0\xb2\xd0\xbd\xd0\xbe-\xd1\x82\xd0\xb5\xd1\x80\xd1\x80\xd0\xb8\xd1\x82\xd0\xbe\xd1\x80\xd0\xb8\xd0\xb0\xd0\xbb\xd1\x8c\xd0\xbd\xd1' to word 'административно-территориальн'
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.keyedvectors:Total number of ngrams is 0
INFO:gensim.models.word2vec:Updating model with new vocabulary
INFO:gensim.models.word2vec:New added 2000000 unique words (50% of original 4000000) and increased the count of 2000000 pre-existing words (50% of original 4000000)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 2000000 items
INFO:gensim.models.word2vec:sample=1e-05 downsamples 5698 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 5635237139 word corpus (61.2% of prior 9203539378)
INFO:gensim.models.fasttext:loaded (4000000, 300) weight matrix for fastText model from wiki-news-300d-1M-subword.bin
The link provided in "...Could you please try reproducing the problem with this branch?" doesn't exist. Please redirect.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I just got this error too, I run this in Colab:
it returns an error:
is there any solution for this error? I've also seen this comment, is there any update? |
@rianrajagede What version of gensim are you using? |
@mpenkov I use Gensim 3.7.2 |
OK, I will investigate and get back to you. Thank you for reporting this. |
@rianrajagede I could not reproduce the problem with the latest version of gensim (3.7.2). Could you please try and let me know? |
@rianrajagede when I was unable to reproduce my original error even though I was running the same code, I had re-downloaded the model. My conjecture was it had something to do with the model file itself being corrupted. Have you tried simply downloading model again? |
I run it on Google Colab, I've tried with other language or re-downloaded the model but the result still the same: https://colab.research.google.com/drive/1noO8IwoQyKn_60XAk1sJJRjVZA0xeWfQ |
Can you please confirm that the version installed in your Google Colab is 3.7.2? |
Google Colab by default provides Gensim 3.6.0, in the first run if I call
but as my code above, I always start my notebook with
|
I have a better alternative, instead of using genism, you can try importing fasttext and use fasttext.load_model() Usage: Testing: |
Description
Loading pretrained
fastext_model.bin
withgensim.models.fasttext.FastText.load_fasttext_format('wiki-news-300d-1M-subword.bin')
fails withAssertionError: unexpected number of vectors
despite fix for #2350.Steps/Code/Corpus to Reproduce
first install
develop
branch with:pip install --upgrade git+git://github.com/RaRe-Technologies/gensim@develop
, then:Expected Results
Loaded model.
Actual Results
Versions
Darwin-18.2.0-x86_64-i386-64bit
Python 3.7.2 (default, Dec 29 2018, 00:00:04)
[Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.16.1
SciPy 1.2.0
gensim 3.7.1
FAST_VERSION 1
thanks for your work!
The text was updated successfully, but these errors were encountered: