While weight conversion of llama-13b getting this error: RuntimeError: Internal: unk is not defined. #22873

Ahtesham00 · 2023-04-19T19:18:30Z

System Info

OS : Ubunto

Virtual Env :

accelerate==0.18.0
certifi==2022.12.7
charset-normalizer==3.1.0
cmake==3.26.3
filelock==3.12.0
huggingface-hub==0.13.4
idna==3.4
Jinja2==3.1.2
lit==16.0.1
MarkupSafe==2.1.2
mpmath==1.3.0
networkx==3.1
numpy==1.24.2
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
packaging==23.1
psutil==5.9.5
PyYAML==6.0
regex==2023.3.23
requests==2.28.2
sentencepiece==0.1.98
sympy==1.11.1
tokenizers==0.13.3
torch==2.0.0
tqdm==4.65.0
transformers==4.28.1
triton==2.0.0
typing_extensions==4.5.0
urllib3==1.26.15

Who can help?

@ArthurZucker
@younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Used following command to convert llama-13 weights into hf.

python src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir /home/unconveretd-weights --model_size 13B --output_dir /home/test-converted

Expected behavior

It should generated the converted weights. But instead it is generating this error

Loading the checkpoint in a Llama model.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [00:17<00:00, 2.35it/s]
Saving in the Transformers format.
Saving a LlamaTokenizerFast to /home/test-converted.
Traceback (most recent call last):
File "/home/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py", line 278, in
main()
File "/home/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py", line 274, in main
write_tokenizer(args.output_dir, spm_path)
File "/home/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py", line 248, in write_tokenizer
tokenizer = tokenizer_class(input_tokenizer_path)
File "/home/myenv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama_fast.py", line 89, in init
super().init(
File "/home/myenv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 117, in init
slow_tokenizer = self.slow_tokenizer_class(*args, **kwargs)
File "/home/myenv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 96, in init
self.sp_model.Load(vocab_file)
File "/home/myenv/lib/python3.10/site-packages/sentencepiece/init.py", line 905, in Load
return self.LoadFromFile(model_file)
File "/home/myenv/lib/python3.10/site-packages/sentencepiece/init.py", line 310, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: unk is not defined.

The text was updated successfully, but these errors were encountered:

Rachneet · 2023-04-19T22:32:01Z

facing the same issue.

ArthurZucker · 2023-04-24T13:45:57Z

Hey! Thanks for reporting I'll investigate this!

ChongWu-Biostat · 2023-05-18T22:08:08Z

I have the same issue when I use the latest version of torch.

Ahtesham00 · 2023-05-18T22:20:01Z

I did not find the solution. but if someone wants to download the weights.
following link has all the versions.

https://huggingface.co/elinas

ArthurZucker · 2023-05-25T07:15:28Z

Okay, We update the conversion script, which should have fixed most issues. I downloaded the tokenizer model, and re-tried the conversion, and I did not have any issue. Make sure you are using the latest transformers version.

dittops · 2023-05-27T19:12:35Z

I tried with the latest code from the main branch, but still getting the same issue

egoetz · 2023-06-06T15:19:24Z

I am getting the same error message when running the conversion for the 7B model. Tried installing the latest version (4.29.2) but the error persists. Same traceback as @dittops but mine has a nicer formatting.

ArthurZucker · 2023-06-06T15:30:04Z

Again, the issue is most probably with the tokenizer file that you are using, which is outdated. Yes you need to upgrade to the latest transformers version, but you also need to use the original sentencepiece model in order for the conversion to properly work!

egoetz · 2023-06-06T15:46:57Z

Thanks for following up. I have the llama weights/tokenizer that were updated on 3/26/23. Isn't that the latest version of the tokenizer?

Also I'm not sure what you mean by the original sentencepiece model (unless you mean the model from prior to the 3/26 update).

ArthurZucker · 2023-06-06T15:54:36Z

When you say:

I have the llama weights/tokenizer that were updated on 3/26/23

do you mean the META weights and tokenizer?
Otherwise can you share a notebook with a reproducer? The issue with llama is that a PR was made too early and thus lots of checkpoints and previous tokenizers (meaning hf tokenizers json) are incorrect.

dittops · 2023-06-06T15:57:46Z

@ArthurZucker I have the META weights and tokenizer. The issue share is with that. For sentencepiece, is there a specific version to be used?

egoetz · 2023-06-06T17:29:32Z

I have the llama weights/tokenizer that were updated on 3/26/23

do you mean the META weights and tokenizer? Otherwise can you share a notebook with a reproducer? The issue with llama is that a PR was made too early and thus lots of checkpoints and previous tokenizers (meaning hf tokenizers json) are incorrect.

Ah I see. The llama weights I have come from Meta's torrent PR. I did not get them from HuggingFace, if you are referring to this PR.

ArthurZucker · 2023-06-06T18:34:38Z

Ok 👍🏻 I'll give it another go, but I remember trying with those exact weights and getting a correct conversion.
Will get back to you soon!

ArthurZucker · 2023-06-07T20:02:48Z

Would you mind sending me the file via google drive? The torrent link seems down

egoetz · 2023-06-07T20:34:33Z

The torrent is showing as up for me right now, but if it isn't working for you I am happy to send you a copy of the 7B folder I am using. The entire folder for the 7B model is ~13-14GB. I'm trying to compress it right now but it will take a little bit to finish.

ArthurZucker · 2023-06-07T20:46:30Z

Just the tokenizer files are enough!

egoetz · 2023-06-08T15:10:59Z

Email sent!

dittops · 2023-06-16T15:03:14Z

@egoetz where you able to solve this issue?

ArthurZucker · 2023-06-22T12:02:04Z

@egoetz told me that installing GIT LFS + using the tokenizer at huggyllama/llama-7b worked.
I received the email but could not access files as they were not shared using drive but a private mail provider 😅
If you are trying to convert the original model (by that I mean going from the spm model to transformers) make sure you have the latest version of transformers

dittops · 2023-06-23T15:09:27Z

I was able to resolve it by replacing tokenizer.model with one from hugging face. Thank you/

ArthurZucker · 2023-06-26T03:59:03Z

I'm not sure I understand. If you are trying to convert a checkpoint/tokenizer, then you don't need to use an already converted one. The script is to go from the original tokenizer to the HF format.

github-actions · 2023-07-20T15:03:22Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker closed this as completed Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

While weight conversion of llama-13b getting this error: RuntimeError: Internal: unk is not defined. #22873

While weight conversion of llama-13b getting this error: RuntimeError: Internal: unk is not defined. #22873

Ahtesham00 commented Apr 19, 2023 •

edited

Loading

Rachneet commented Apr 19, 2023

ArthurZucker commented Apr 24, 2023

ChongWu-Biostat commented May 18, 2023

Ahtesham00 commented May 18, 2023

ArthurZucker commented May 25, 2023

dittops commented May 27, 2023

egoetz commented Jun 6, 2023

ArthurZucker commented Jun 6, 2023

egoetz commented Jun 6, 2023

ArthurZucker commented Jun 6, 2023

dittops commented Jun 6, 2023

egoetz commented Jun 6, 2023 •

edited

Loading

ArthurZucker commented Jun 6, 2023

ArthurZucker commented Jun 7, 2023

egoetz commented Jun 7, 2023

ArthurZucker commented Jun 7, 2023

egoetz commented Jun 8, 2023

dittops commented Jun 16, 2023

ArthurZucker commented Jun 22, 2023

dittops commented Jun 23, 2023

ArthurZucker commented Jun 26, 2023

github-actions bot commented Jul 20, 2023

While weight conversion of llama-13b getting this error: RuntimeError: Internal: unk is not defined. #22873

While weight conversion of llama-13b getting this error: RuntimeError: Internal: unk is not defined. #22873

Comments

Ahtesham00 commented Apr 19, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Rachneet commented Apr 19, 2023

ArthurZucker commented Apr 24, 2023

ChongWu-Biostat commented May 18, 2023

Ahtesham00 commented May 18, 2023

ArthurZucker commented May 25, 2023

dittops commented May 27, 2023

egoetz commented Jun 6, 2023

ArthurZucker commented Jun 6, 2023

egoetz commented Jun 6, 2023

ArthurZucker commented Jun 6, 2023

dittops commented Jun 6, 2023

egoetz commented Jun 6, 2023 • edited Loading

ArthurZucker commented Jun 6, 2023

ArthurZucker commented Jun 7, 2023

egoetz commented Jun 7, 2023

ArthurZucker commented Jun 7, 2023

egoetz commented Jun 8, 2023

dittops commented Jun 16, 2023

ArthurZucker commented Jun 22, 2023

dittops commented Jun 23, 2023

ArthurZucker commented Jun 26, 2023

github-actions bot commented Jul 20, 2023

Ahtesham00 commented Apr 19, 2023 •

edited

Loading

egoetz commented Jun 6, 2023 •

edited

Loading