-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: tokenizer is missing merges section when converting using convert_hf_to_gguf.py #9309
Comments
Most likely the merges are not present because Lines 81 to 95 in bdf314f
I guess we can add those regardless, but if they are not needed for anything maybe better to not include them. What does |
It looks like they are needed because the llama tokenizer uses BPE ( I ran into this for phi3 which uses BPE tokenizer, which uses llama's tokenizer) in |
AFAIK llama v1 and v2 used SPM and then switched to BPE for v3. Tinyllama uses SPM. I believe Phi-3 is also an SPM tokenizer (at least this is what we listed it here in our conversion script): llama.cpp/convert_hf_to_gguf_update.py Line 69 in bdf314f
|
What happened?
I ran into this issue when working on a PR on HF where I was adding GGUF support for phi3 model.
When using gguf-my-repo (or convert_hf_to_gguf.py) to convert from hugging face to gguf, merges is missing from the gguf file.
Below is an already converted TinyLlama-1.1B-Chat-v1.0-GGUF and you can see there is a merges section in the gguf tokenizer:
Here is a tinyllama I converted few days ago via gguf-my-repo & it is missing merges from tokenizer:
I was able to checkout llama.cpp & repro via:
I am not familiar with the conversion script but I investigated and I think i understand the issue and I also have a fix:
Case where
tokenizer.model
is present:This bug can happen for any model class that calls
_set_vocab_sentencepiece()
. For the case where atokenizer.model
is present,_create_vocab_sentencepiece()
never throws an exception, and when we are back in_set_vocab_sentencepiece()
load_merges is also not passed as True here, so this would be one place we would have to fix this.Case where
tokenizer.model
is not present andtokenizer.json
is present:This happens for the Llama family models only if
_set_vocab_llama_hf()
is invoked.If the self._set_vocab_sentencepiece() which is wrapped by a try-catch inside the LlamaModel class fails which it does in my case since there is no tokenizer.model file for the llama model or phi3 but there is a tokenizer.json. For above case we can fix it in convert_hf_to_gguf.py#L806. I am able to fix it by passing
load_merges=True
to that line like:If the above fixes make sense, I can create a PR!
Name and Version
version: 3660 (b69a480)
built with Apple clang version 15.0.0 (clang-1500.0.40.1) for arm64-apple-darwin23.1.0
What operating system are you seeing the problem on?
Mac
Relevant log output
No response
The text was updated successfully, but these errors were encountered: