Error making gguf: KeyError: '<|user|>' #14

arch-btw · 2024-08-20T00:00:05Z

System Info / 系統信息

transformers: 4.44.0
llama.cpp: latest

Hi, when I try to make a gguf I get this error:

Traceback (most recent call last):
File "/home/david/llm/llama.cpp/convert_hf_to_gguf.py", line 4074, in
main()
File "/home/david/llm/llama.cpp/convert_hf_to_gguf.py", line 4068, in main
model_instance.write()
File "/home/david/llm/llama.cpp/convert_hf_to_gguf.py", line 388, in write
self.prepare_metadata(vocab_only=False)
File "/home/david/llm/llama.cpp/convert_hf_to_gguf.py", line 381, in prepare_metadata
self.set_vocab()
File "/home/david/llm/llama.cpp/convert_hf_to_gguf.py", line 3713, in set_vocab
special_vocab._set_special_token("eot", tokenizer.get_added_vocab()["<|user|>"])
~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: '<|user|>'

Do you know how to fix this?

On huggingface someone else has the same problem:

https://huggingface.co/THUDM/LongWriter-glm4-9b/discussions/1#66bc33eccd16fda66e7caa1f

But I don't know how to apply this solution:

Hi! You can get the token id by tokenizer.get_command("<|user|>").

Is the EOT even needed?

Thank you!

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

The official example scripts / 官方的示例脚本
My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

With llama.cpp:

python convert_hf_to_gguf.py /home/david/llm/LongWriter-glm4-9b --outtype f32

Here is the code:

    special_vocab = gguf.SpecialVocab(dir_model, load_merges=False)
    special_vocab.merges = merges
    # only add special tokens when they were not already loaded from config.json
    special_vocab._set_special_token("eos", tokenizer.get_added_vocab()["<|endoftext|>"])
    special_vocab._set_special_token("eot", tokenizer.get_added_vocab()["<|user|>"])
    # this one is usually not in config.json anyway
    special_vocab._set_special_token("unk", tokenizer.get_added_vocab()["<|endoftext|>"])
    special_vocab.add_to_gguf(self.gguf_writer)

Expected behavior / 期待表现

For it to make a quantization.

The text was updated successfully, but these errors were encountered:

bys0318 · 2024-08-20T01:52:52Z

Hi! You can get the token id by tokenizer.get_command("<|user|>").

echnio · 2024-08-20T06:59:49Z

Hi! You can get the token id by tokenizer.get_command("<|user|>").

Hi, How to fix it ? thanks!

bys0318 · 2024-08-20T11:44:27Z

Have you updated to our most recent model files? Also, please use transformers>=4.43.0.

arch-btw · 2024-08-20T12:47:25Z

@bys0318 thank you, it appears that the token id is:

151336

Is this correct?

in llama.cpp:

llm_load_print_meta: general.name     = LongWriter Glm4 9b
llm_load_print_meta: EOS token        = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token        = 151329 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 151336 '[PAD151336]'
llm_load_print_meta: max token length = 1024

@echnio

I did this (find the lines starting at 3711) and replace:

    # only add special tokens when they were not already loaded from config.json
    special_vocab._set_special_token("eos", tokenizer.get_added_vocab()["<|endoftext|>"])
    token_id = tokenizer.get_command("<|user|>")
    print(token_id)
    special_vocab._set_special_token("eot", token_id)
    # this one is usually not in config.json anyway

bys0318 · 2024-08-20T14:02:40Z

This is correct. Thanks for sharing!

echnio · 2024-08-21T02:16:23Z

@bys0318谢谢，看来令牌 ID 是：

151336

这是正确的吗？

在 llama.cpp 中：

llm_load_print_meta: general.name     = LongWriter Glm4 9b
llm_load_print_meta: EOS token        = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token        = 151329 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 151336 '[PAD151336]'
llm_load_print_meta: max token length = 1024

@echnio

我这样做了（找到从 3711 开始的行）并替换：

    # only add special tokens when they were not already loaded from config.json
    special_vocab._set_special_token("eos", tokenizer.get_added_vocab()["<|endoftext|>"])
    token_id = tokenizer.get_command("<|user|>")
    print(token_id)
    special_vocab._set_special_token("eot", token_id)
    # this one is usually not in config.json anyway

Thank you very much, the format conversion was successful.

aashish-1904 · 2024-08-26T13:00:03Z

@bys0318 thank you, it appears that the token id is:

151336

Is this correct?

in llama.cpp:

llm_load_print_meta: general.name     = LongWriter Glm4 9b
llm_load_print_meta: EOS token        = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token        = 151329 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 151336 '[PAD151336]'
llm_load_print_meta: max token length = 1024

@echnio

I did this (find the lines starting at 3711) and replace:

    # only add special tokens when they were not already loaded from config.json
    special_vocab._set_special_token("eos", tokenizer.get_added_vocab()["<|endoftext|>"])
    token_id = tokenizer.get_command("<|user|>")
    print(token_id)
    special_vocab._set_special_token("eot", token_id)
    # this one is usually not in config.json anyway

Thanks for the detailed steps!
I was able to convert the model.
PLease find quants at QuantFactory/LongWriter-glm4-9b-GGUF

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error making gguf: KeyError: '<|user|>' #14

Error making gguf: KeyError: '<|user|>' #14

arch-btw commented Aug 20, 2024 •

edited

Loading

bys0318 commented Aug 20, 2024

echnio commented Aug 20, 2024

bys0318 commented Aug 20, 2024

arch-btw commented Aug 20, 2024

bys0318 commented Aug 20, 2024

echnio commented Aug 21, 2024

aashish-1904 commented Aug 26, 2024

Error making gguf: KeyError: '<|user|>' #14

Error making gguf: KeyError: '<|user|>' #14

Comments

arch-btw commented Aug 20, 2024 • edited Loading

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

bys0318 commented Aug 20, 2024

echnio commented Aug 20, 2024

bys0318 commented Aug 20, 2024

arch-btw commented Aug 20, 2024

bys0318 commented Aug 20, 2024

echnio commented Aug 21, 2024

aashish-1904 commented Aug 26, 2024

arch-btw commented Aug 20, 2024 •

edited

Loading