Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error making gguf: KeyError: '<|user|>' #14

Open
2 tasks done
arch-btw opened this issue Aug 20, 2024 · 7 comments
Open
2 tasks done

Error making gguf: KeyError: '<|user|>' #14

arch-btw opened this issue Aug 20, 2024 · 7 comments

Comments

@arch-btw
Copy link

arch-btw commented Aug 20, 2024

System Info / 系統信息

transformers: 4.44.0
llama.cpp: latest

Hi, when I try to make a gguf I get this error:

Traceback (most recent call last):
File "/home/david/llm/llama.cpp/convert_hf_to_gguf.py", line 4074, in
main()
File "/home/david/llm/llama.cpp/convert_hf_to_gguf.py", line 4068, in main
model_instance.write()
File "/home/david/llm/llama.cpp/convert_hf_to_gguf.py", line 388, in write
self.prepare_metadata(vocab_only=False)
File "/home/david/llm/llama.cpp/convert_hf_to_gguf.py", line 381, in prepare_metadata
self.set_vocab()
File "/home/david/llm/llama.cpp/convert_hf_to_gguf.py", line 3713, in set_vocab
special_vocab._set_special_token("eot", tokenizer.get_added_vocab()["<|user|>"])
~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: '<|user|>'

Do you know how to fix this?

On huggingface someone else has the same problem:

https://huggingface.co/THUDM/LongWriter-glm4-9b/discussions/1#66bc33eccd16fda66e7caa1f

But I don't know how to apply this solution:

Hi! You can get the token id by tokenizer.get_command("<|user|>").

Is the EOT even needed?

Thank you!

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

With llama.cpp:

python convert_hf_to_gguf.py /home/david/llm/LongWriter-glm4-9b --outtype f32

Here is the code:

    special_vocab = gguf.SpecialVocab(dir_model, load_merges=False)
    special_vocab.merges = merges
    # only add special tokens when they were not already loaded from config.json
    special_vocab._set_special_token("eos", tokenizer.get_added_vocab()["<|endoftext|>"])
    special_vocab._set_special_token("eot", tokenizer.get_added_vocab()["<|user|>"])
    # this one is usually not in config.json anyway
    special_vocab._set_special_token("unk", tokenizer.get_added_vocab()["<|endoftext|>"])
    special_vocab.add_to_gguf(self.gguf_writer)

Expected behavior / 期待表现

For it to make a quantization.

@bys0318
Copy link
Member

bys0318 commented Aug 20, 2024

Hi! You can get the token id by tokenizer.get_command("<|user|>").

@echnio
Copy link

echnio commented Aug 20, 2024

Hi! You can get the token id by tokenizer.get_command("<|user|>").

Hi, How to fix it ? thanks!

@bys0318
Copy link
Member

bys0318 commented Aug 20, 2024

Have you updated to our most recent model files? Also, please use transformers>=4.43.0.

@arch-btw
Copy link
Author

@bys0318 thank you, it appears that the token id is:

151336

Is this correct?

in llama.cpp:

llm_load_print_meta: general.name     = LongWriter Glm4 9b
llm_load_print_meta: EOS token        = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token        = 151329 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 151336 '[PAD151336]'
llm_load_print_meta: max token length = 1024

@echnio

I did this (find the lines starting at 3711) and replace:

    # only add special tokens when they were not already loaded from config.json
    special_vocab._set_special_token("eos", tokenizer.get_added_vocab()["<|endoftext|>"])
    token_id = tokenizer.get_command("<|user|>")
    print(token_id)
    special_vocab._set_special_token("eot", token_id)
    # this one is usually not in config.json anyway

@bys0318
Copy link
Member

bys0318 commented Aug 20, 2024

This is correct. Thanks for sharing!

@echnio
Copy link

echnio commented Aug 21, 2024

@bys0318谢谢,看来令牌 ID 是:

151336

这是正确的吗?

在 llama.cpp 中:

llm_load_print_meta: general.name     = LongWriter Glm4 9b
llm_load_print_meta: EOS token        = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token        = 151329 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 151336 '[PAD151336]'
llm_load_print_meta: max token length = 1024

@echnio

我这样做了(找到从 3711 开始的行)并替换:

    # only add special tokens when they were not already loaded from config.json
    special_vocab._set_special_token("eos", tokenizer.get_added_vocab()["<|endoftext|>"])
    token_id = tokenizer.get_command("<|user|>")
    print(token_id)
    special_vocab._set_special_token("eot", token_id)
    # this one is usually not in config.json anyway

Thank you very much, the format conversion was successful.

@aashish-1904
Copy link

@bys0318 thank you, it appears that the token id is:

151336

Is this correct?

in llama.cpp:

llm_load_print_meta: general.name     = LongWriter Glm4 9b
llm_load_print_meta: EOS token        = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token        = 151329 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 151336 '[PAD151336]'
llm_load_print_meta: max token length = 1024

@echnio

I did this (find the lines starting at 3711) and replace:

    # only add special tokens when they were not already loaded from config.json
    special_vocab._set_special_token("eos", tokenizer.get_added_vocab()["<|endoftext|>"])
    token_id = tokenizer.get_command("<|user|>")
    print(token_id)
    special_vocab._set_special_token("eot", token_id)
    # this one is usually not in config.json anyway

Thanks for the detailed steps!
I was able to convert the model.
PLease find quants at QuantFactory/LongWriter-glm4-9b-GGUF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants