-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte #66
Comments
Hey @srolskyi and @bheinzerling , I debugged that issue and debug-printed the path for $ ls -hl /home/stefan/.cache/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin
-rw-rw-r-- 1 stefan stefan 3,7M Mär 15 16:34 /home/stefan/.cache/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin And it was downloaded from However, when I download the archive manually and extract it, it has the following size: $ ls -hl ~/Downloads/en.wiki.bpe.vs10000.d100.w2v.bin
-rw-r--r-- 1 stefan stefan 3,9M Mär 19 2018 /home/stefan/Downloads/en.wiki.bpe.vs10000.d100.w2v.bin With this file I can load the vectors without any problem:
So I heavily think that the unpacking routines are currently not working and "broken" word embeddings file is then trying to be loaded - causing the error. |
After some more debugging and reading the code: stefan@ae-13412:~$ curl -LI https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin.tar.gz
HTTP/1.1 301 Moved Permanently
Date: Fri, 15 Mar 2024 15:43:13 GMT
Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips PHP/7.2.34
Location: https://bpemb.h-its.org/en/en.wiki.bpe.vs10000.d100.w2v.bin.tar.gz
Content-Type: text/html; charset=iso-8859-1
HTTP/2 200
server: nginx
date: Fri, 15 Mar 2024 15:43:14 GMT
content-type: application/gzip
content-length: 3784656
last-modified: Mon, 09 Apr 2018 22:27:16 GMT
etag: "39bfd0-56971e878b900"
accept-ranges: bytes
strict-transport-security: max-age=15768000 At the end, you can see that the redirected request has an However, the current code is expecting: Line 54 in 1c63035
an This is the reason why the archive is not properly extracted. @bheinzerling I think best option here is to check if
Then the archive is properly downloaded, extracted and loaded :) |
thank you @stefan-it for your investigation! |
@bheinzerling @stefan-it , thanks for the investigation -> right now our production is not working because we are depending on package.
|
I created a PR for a fix. In the meantime you should be able to use this fixed version with: git+https://github.com/stefan-it/bpemb.git@52ceabf4ca8bde1030be43f71f1f3cb292f4beca in a pip3 install --upgrade git+https://github.com/stefan-it/bpemb.git@52ceabf4ca8bde1030be43f71f1f3cb292f4beca When the fix is accepted/merged into upstream here, then @bheinzerling only needs to release a new version. |
@srolskyi Thanks for reporting this issue! My guess is that the admins of the server on which BPEmb is hosted updated or migrated something. In any case, thanks to Stefan's fix everything seems to be working again. I released a new version on PyPI that includes the fix and should resolve this issue:
Leaving this issue open a bit for visibility |
What version is fix in? 0.3.5? |
Fresh installation, setup new environment (python 3.9.18 or 3.12):
serg: ~ : python3 -m venv new_env
serg: ~ : source new_env/bin/activate
(new_env) serg: ~ : pip install bpemb gensim
Collecting bpemb
Downloading bpemb-0.3.4-py3-none-any.whl.metadata (19 kB)
Collecting gensim
Using cached gensim-4.3.2-cp312-cp312-macosx_10_9_universal2.whl
Collecting numpy (from bpemb)
Downloading numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl.metadata (61 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.1/61.1 kB 949.1 kB/s eta 0:00:00
Collecting requests (from bpemb)
Downloading requests-2.31.0-py3-none-any.whl.metadata (4.6 kB)
Collecting sentencepiece (from bpemb)
Downloading sentencepiece-0.2.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (7.7 kB)
Collecting tqdm (from bpemb)
Downloading tqdm-4.66.2-py3-none-any.whl.metadata (57 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.6/57.6 kB 2.6 MB/s eta 0:00:00
Collecting scipy>=1.7.0 (from gensim)
Downloading scipy-1.12.0-cp312-cp312-macosx_12_0_arm64.whl.metadata (217 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 217.9/217.9 kB 3.3 MB/s eta 0:00:00
Collecting smart-open>=1.8.1 (from gensim)
Downloading smart_open-7.0.1-py3-none-any.whl.metadata (23 kB)
Collecting wrapt (from smart-open>=1.8.1->gensim)
Downloading wrapt-1.16.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (6.6 kB)
Collecting charset-normalizer<4,>=2 (from requests->bpemb)
Downloading charset_normalizer-3.3.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (33 kB)
Collecting idna<4,>=2.5 (from requests->bpemb)
Downloading idna-3.6-py3-none-any.whl.metadata (9.9 kB)
Collecting urllib3<3,>=1.21.1 (from requests->bpemb)
Downloading urllib3-2.2.1-py3-none-any.whl.metadata (6.4 kB)
Collecting certifi>=2017.4.17 (from requests->bpemb)
Downloading certifi-2024.2.2-py3-none-any.whl.metadata (2.2 kB)
Downloading bpemb-0.3.4-py3-none-any.whl (19 kB)
Downloading numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl (13.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.7/13.7 MB 67.8 MB/s eta 0:00:00
Downloading scipy-1.12.0-cp312-cp312-macosx_12_0_arm64.whl (31.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31.4/31.4 MB 59.3 MB/s eta 0:00:00
Downloading smart_open-7.0.1-py3-none-any.whl (60 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.8/60.8 kB 3.6 MB/s eta 0:00:00
Downloading requests-2.31.0-py3-none-any.whl (62 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.6/62.6 kB 4.4 MB/s eta 0:00:00
Downloading sentencepiece-0.2.0-cp312-cp312-macosx_11_0_arm64.whl (1.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 42.6 MB/s eta 0:00:00
Downloading tqdm-4.66.2-py3-none-any.whl (78 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.3/78.3 kB 7.3 MB/s eta 0:00:00
Downloading certifi-2024.2.2-py3-none-any.whl (163 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 163.8/163.8 kB 12.8 MB/s eta 0:00:00
Downloading charset_normalizer-3.3.2-cp312-cp312-macosx_11_0_arm64.whl (119 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 119.4/119.4 kB 10.6 MB/s eta 0:00:00
Downloading idna-3.6-py3-none-any.whl (61 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.6/61.6 kB 3.9 MB/s eta 0:00:00
Downloading urllib3-2.2.1-py3-none-any.whl (121 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.1/121.1 kB 10.1 MB/s eta 0:00:00
Downloading wrapt-1.16.0-cp312-cp312-macosx_11_0_arm64.whl (38 kB)
Installing collected packages: sentencepiece, wrapt, urllib3, tqdm, numpy, idna, charset-normalizer, certifi, smart-open, scipy, requests, gensim, bpemb
Successfully installed bpemb-0.3.4 certifi-2024.2.2 charset-normalizer-3.3.2 gensim-4.3.2 idna-3.6 numpy-1.26.4 requests-2.31.0 scipy-1.12.0 sentencepiece-0.2.0 smart-open-7.0.1 tqdm-4.66.2 urllib3-2.2.1 wrapt-1.16.0
then run
python3 -c "from bpemb import BPEmb; bpemb_en = BPEmb(lang='en', dim=100)"
and got error:
_Traceback (most recent call last):
File "", line 1, in
File "/Users/serg/new_env/lib/python3.12/site-packages/bpemb/bpemb.py", line 191, in init
self.emb = load_word2vec_file(self.emb_file, add_pad=add_pad_emb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/serg/new_env/lib/python3.12/site-packages/bpemb/util.py", line 78, in load_word2vec_file
vecs = KeyedVectors.load_word2vec_format(word2vec_file, binary=binary)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/serg/new_env/lib/python3.12/site-packages/gensim/models/keyedvectors.py", line 1719, in load_word2vec_format
return _load_word2vec_format(
^^^^^^^^^^^^^^^^^^^^^^
File "/Users/serg/new_env/lib/python3.12/site-packages/gensim/models/keyedvectors.py", line 2058, in load_word2vec_format
header = utils.to_unicode(fin.readline(), encoding=encoding)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/serg/new_env/lib/python3.12/site-packages/gensim/utils.py", line 365, in any2unicode
return str(text, encoding, errors=errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
any ideas where am I make a mistake?
The text was updated successfully, but these errors were encountered: