Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

While weight conversion of llama-13b getting this error: RuntimeError: Internal: unk is not defined. #22873

Closed
2 of 4 tasks
Ahtesham00 opened this issue Apr 19, 2023 · 22 comments

Comments

@Ahtesham00
Copy link

Ahtesham00 commented Apr 19, 2023

System Info

OS : Ubunto

Virtual Env :

accelerate==0.18.0
certifi==2022.12.7
charset-normalizer==3.1.0
cmake==3.26.3
filelock==3.12.0
huggingface-hub==0.13.4
idna==3.4
Jinja2==3.1.2
lit==16.0.1
MarkupSafe==2.1.2
mpmath==1.3.0
networkx==3.1
numpy==1.24.2
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
packaging==23.1
psutil==5.9.5
PyYAML==6.0
regex==2023.3.23
requests==2.28.2
sentencepiece==0.1.98
sympy==1.11.1
tokenizers==0.13.3
torch==2.0.0
tqdm==4.65.0
transformers==4.28.1
triton==2.0.0
typing_extensions==4.5.0
urllib3==1.26.15

Who can help?

@ArthurZucker
@younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Used following command to convert llama-13 weights into hf.

python src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir /home/unconveretd-weights --model_size 13B --output_dir /home/test-converted

Expected behavior

It should generated the converted weights. But instead it is generating this error

Loading the checkpoint in a Llama model.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [00:17<00:00, 2.35it/s]
Saving in the Transformers format.
Saving a LlamaTokenizerFast to /home/test-converted.
Traceback (most recent call last):
File "/home/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py", line 278, in
main()
File "/home/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py", line 274, in main
write_tokenizer(args.output_dir, spm_path)
File "/home/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py", line 248, in write_tokenizer
tokenizer = tokenizer_class(input_tokenizer_path)
File "/home/myenv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama_fast.py", line 89, in init
super().init(
File "/home/myenv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 117, in init
slow_tokenizer = self.slow_tokenizer_class(*args, **kwargs)
File "/home/myenv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 96, in init
self.sp_model.Load(vocab_file)
File "/home/myenv/lib/python3.10/site-packages/sentencepiece/init.py", line 905, in Load
return self.LoadFromFile(model_file)
File "/home/myenv/lib/python3.10/site-packages/sentencepiece/init.py", line 310, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: unk is not defined.

@Rachneet
Copy link

facing the same issue.

@ArthurZucker
Copy link
Collaborator

Hey! Thanks for reporting I'll investigate this!

@ChongWu-Biostat
Copy link

I have the same issue when I use the latest version of torch.

@Ahtesham00
Copy link
Author

I did not find the solution. but if someone wants to download the weights.
following link has all the versions.

https://huggingface.co/elinas

@ArthurZucker
Copy link
Collaborator

Okay, We update the conversion script, which should have fixed most issues. I downloaded the tokenizer model, and re-tried the conversion, and I did not have any issue. Make sure you are using the latest transformers version.

@dittops
Copy link

dittops commented May 27, 2023

I tried with the latest code from the main branch, but still getting the same issue

image

@egoetz
Copy link

egoetz commented Jun 6, 2023

I am getting the same error message when running the conversion for the 7B model. Tried installing the latest version (4.29.2) but the error persists. Same traceback as @dittops but mine has a nicer formatting.

@ArthurZucker
Copy link
Collaborator

Again, the issue is most probably with the tokenizer file that you are using, which is outdated. Yes you need to upgrade to the latest transformers version, but you also need to use the original sentencepiece model in order for the conversion to properly work!

@egoetz
Copy link

egoetz commented Jun 6, 2023

Thanks for following up. I have the llama weights/tokenizer that were updated on 3/26/23. Isn't that the latest version of the tokenizer?

Also I'm not sure what you mean by the original sentencepiece model (unless you mean the model from prior to the 3/26 update).

@ArthurZucker
Copy link
Collaborator

When you say:

I have the llama weights/tokenizer that were updated on 3/26/23

do you mean the META weights and tokenizer?
Otherwise can you share a notebook with a reproducer? The issue with llama is that a PR was made too early and thus lots of checkpoints and previous tokenizers (meaning hf tokenizers json) are incorrect.

@dittops
Copy link

dittops commented Jun 6, 2023

@ArthurZucker I have the META weights and tokenizer. The issue share is with that. For sentencepiece, is there a specific version to be used?

@egoetz
Copy link

egoetz commented Jun 6, 2023

I have the llama weights/tokenizer that were updated on 3/26/23

do you mean the META weights and tokenizer? Otherwise can you share a notebook with a reproducer? The issue with llama is that a PR was made too early and thus lots of checkpoints and previous tokenizers (meaning hf tokenizers json) are incorrect.

Ah I see. The llama weights I have come from Meta's torrent PR. I did not get them from HuggingFace, if you are referring to this PR.

@ArthurZucker
Copy link
Collaborator

Ok 👍🏻 I'll give it another go, but I remember trying with those exact weights and getting a correct conversion.
Will get back to you soon!

@ArthurZucker
Copy link
Collaborator

Would you mind sending me the file via google drive? The torrent link seems down

@egoetz
Copy link

egoetz commented Jun 7, 2023

The torrent is showing as up for me right now, but if it isn't working for you I am happy to send you a copy of the 7B folder I am using. The entire folder for the 7B model is ~13-14GB. I'm trying to compress it right now but it will take a little bit to finish.

@ArthurZucker
Copy link
Collaborator

Just the tokenizer files are enough!

@egoetz
Copy link

egoetz commented Jun 8, 2023

Email sent!

@dittops
Copy link

dittops commented Jun 16, 2023

@egoetz where you able to solve this issue?

@ArthurZucker
Copy link
Collaborator

@egoetz told me that installing GIT LFS + using the tokenizer at huggyllama/llama-7b worked.
I received the email but could not access files as they were not shared using drive but a private mail provider 😅
If you are trying to convert the original model (by that I mean going from the spm model to transformers) make sure you have the latest version of transformers

@dittops
Copy link

dittops commented Jun 23, 2023

I was able to resolve it by replacing tokenizer.model with one from hugging face. Thank you/

@ArthurZucker
Copy link
Collaborator

I'm not sure I understand. If you are trying to convert a checkpoint/tokenizer, then you don't need to use an already converted one. The script is to go from the original tokenizer to the HF format.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants