-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gguf : add special tokens metadata for FIM/Infill #6689
Conversation
This commit adds special token metadata for Fill-In-the-Middle (FIM)/Infill to the GGUF model. The motivation for this is that currently there is support for CodeLlama but other models exist now like CodeGemma, but the different models use different token ids for the special tokens and this commit allows for supporting multiple models. Signed-off-by: Daniel Bevenius <[email protected]>
This commit breaks model compatibility. I've been experimenting with git log --pretty --oneline 132f5579..HEAD
dbceec87 (HEAD -> master, origin/master, origin/HEAD) llama : add StableLM2 12B (#6635)
f4dea7da llama : add qwen2moe (#6074)
8a56075b gritlm : add --outdir option to hf.sh script (#6699)
58227ffd perplexity : require positive --ctx-size arg (#6695)
4fbd8098 (infill-metadata) gguf : add special tokens metadata for FIM/Infill (#6689)
7593639c (stable) `main`: add --json-schema / -j flag (#6659)
./main -m models/shakespeare/ggml-shakespeare-256x16-f32-LATEST.gguf --color -e -s 1337 -c 4096 -n 256 --n-gpu-layers 16 -p "When forty winters shall besiege thy brow,"
Log start
main: build = 2680 (4fbd8098)
main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
main: seed = 1337
llama_model_loader: loaded meta data with 20 key-value pairs and 147 tensors from models/shakespeare/ggml-shakespeare-256x16-f32-LATEST.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.file_type u32 = 0
llama_model_loader: - kv 2: llama.context_length u32 = 64
llama_model_loader: - kv 3: llama.embedding_length u32 = 256
llama_model_loader: - kv 4: llama.feed_forward_length u32 = 768
llama_model_loader: - kv 5: llama.attention.head_count u32 = 8
llama_model_loader: - kv 6: llama.block_count u32 = 16
llama_model_loader: - kv 7: llama.rope.dimension_count u32 = 32
llama_model_loader: - kv 8: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 9: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 10: llama.rope.scale_linear f32 = 1.000000
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 13: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 18: tokenizer.ggml.seperator_token_id u32 = 4294967295
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 4294967295
llama_model_loader: - type f32: 147 tensors
llama_model_load: error loading model: error loading model vocabulary: key not found in model: general.name
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'models/shakespeare/ggml-shakespeare-256x16-f32-LATEST.gguf'
main: error: unable to load model I think this is due to the way the vocabulary was modified which has always supported the llama architecture. I'm using the python gguf-py/scripts/gguf-dump.py models/ggml-vocab-mistral.gguf
* Loading: models/ggml-vocab-mistral.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.
* Dumping 25 key/value pair(s)
1: UINT32 | 1 | GGUF.version = 3
2: UINT64 | 1 | GGUF.tensor_count = 0
3: UINT64 | 1 | GGUF.kv_count = 22
4: STRING | 1 | general.architecture = 'llama'
5: STRING | 1 | general.name = 'mistralai'
6: UINT32 | 1 | llama.vocab_size = 32000
7: UINT32 | 1 | llama.context_length = 32768
8: UINT32 | 1 | llama.embedding_length = 4096
9: UINT32 | 1 | llama.block_count = 32
10: UINT32 | 1 | llama.feed_forward_length = 14336
11: UINT32 | 1 | llama.rope.dimension_count = 128
12: UINT32 | 1 | llama.attention.head_count = 32
13: UINT32 | 1 | llama.attention.head_count_kv = 8
14: FLOAT32 | 1 | llama.attention.layer_norm_rms_epsilon = 9.999999747378752e-06
15: FLOAT32 | 1 | llama.rope.freq_base = 1000000.0
16: STRING | 1 | tokenizer.ggml.model = 'llama'
17: [STRING] | 32000 | tokenizer.ggml.tokens
18: [FLOAT32] | 32000 | tokenizer.ggml.scores
19: [INT32] | 32000 | tokenizer.ggml.token_type
20: UINT32 | 1 | tokenizer.ggml.bos_token_id = 1
21: UINT32 | 1 | tokenizer.ggml.eos_token_id = 2
22: UINT32 | 1 | tokenizer.ggml.unknown_token_id = 0
23: BOOL | 1 | tokenizer.ggml.add_bos_token = True
24: BOOL | 1 | tokenizer.ggml.add_eos_token = False
25: STRING | 1 | tokenizer.chat_template = "{{ bos_token }}{% for message in messages %}{% if (message['"
* Dumping 0 tensor(s) This commit changed the special vocabulary ids. Haven't dug into too deep. Still looking into it. |
// CodeGemma (LLM_ARCH_GEMMA). This can potentially be removed once | ||
// new versions of these models have been published. | ||
std::string gen_name; | ||
ml.get_key(LLM_KV_GENERAL_NAME, gen_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's lines 4083 - 4106 that are causing the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
examples/train-text-from-scratch/train-text-from-scratch.cpp
doesn't rely on or use LLM_KV_GENERAL_NAME
, so that's why I'm able to train, but not inference. This most likely has other unintended side-effects due to the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does #6709 fix the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think so.
ml.get_key(LLM_KV_GENERAL_NAME, gen_name, false);
It seems like setting the required parameter to false did the trick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@teleprint-me Sorry about causing this and wasting your time. And thanks @ggerganov for fixing my mistake!
PR #6709 fixed it. I'm able to run the latest code with this change. I tested another custom model I've been tinkering with and its working again. Might be a good idea to add "general.name" to |
This commit adds special token metadata for Fill-In-the-Middle (FIM)/Infill to the GGUF model. The motivation for this is that currently there is support for CodeLlama but other models exist now like CodeGemma, but the different models use different token ids for the special tokens and this commit allows for supporting multiple models. Signed-off-by: Daniel Bevenius <[email protected]>
How does llama.cpp know the FIM prompt template for each model? Does it just assume the template |
It does seem like a hack not to define a prompt template for FIM so it can be defined in a modelfile. There is a PR that does does this: #5207. |
This commit adds special token metadata for Fill-In-the-Middle (FIM)/Infill to the GGUF model.
The motivation for this is that currently there is support for CodeLlama but other models exist now like CodeGemma, but the different models use different token ids for the special tokens and this commit allows for supporting multiple models.