Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: tokenizer is missing merges section when converting using convert_hf_to_gguf.py #9309

Open
a8nova opened this issue Sep 4, 2024 · 3 comments
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) stale

Comments

@a8nova
Copy link

a8nova commented Sep 4, 2024

What happened?

I ran into this issue when working on a PR on HF where I was adding GGUF support for phi3 model.

When using gguf-my-repo (or convert_hf_to_gguf.py) to convert from hugging face to gguf, merges is missing from the gguf file.

Below is an already converted TinyLlama-1.1B-Chat-v1.0-GGUF and you can see there is a merges section in the gguf tokenizer:

older_tinyllama_has_merges

Here is a tinyllama I converted few days ago via gguf-my-repo & it is missing merges from tokenizer:

missing_merges

I was able to checkout llama.cpp & repro via:

python3.10 ./convert_hf_to_gguf.py TinyLlama-1.1B-Chat-v1.0 --outtype f16 --outfile TinyLlama-1.1B-Chat-v1.0-fp16.gguf

I am not familiar with the conversion script but I investigated and I think i understand the issue and I also have a fix:

  • Case where tokenizer.model is present:
    This bug can happen for any model class that calls _set_vocab_sentencepiece(). For the case where a tokenizer.model is present, _create_vocab_sentencepiece() never throws an exception, and when we are back in _set_vocab_sentencepiece() load_merges is also not passed as True here, so this would be one place we would have to fix this.

  • Case where tokenizer.model is not present and tokenizer.json is present:
    This happens for the Llama family models only if _set_vocab_llama_hf() is invoked.
    If the self._set_vocab_sentencepiece() which is wrapped by a try-catch inside the LlamaModel class fails which it does in my case since there is no tokenizer.model file for the llama model or phi3 but there is a tokenizer.json. For above case we can fix it in convert_hf_to_gguf.py#L806. I am able to fix it by passing load_merges=True to that line like:

special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True, n_vocab=len(tokens))

If the above fixes make sense, I can create a PR!

Name and Version

version: 3660 (b69a480)
built with Apple clang version 15.0.0 (clang-1500.0.40.1) for arm64-apple-darwin23.1.0

What operating system are you seeing the problem on?

Mac

Relevant log output

No response

@a8nova a8nova added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Sep 4, 2024
@ggerganov
Copy link
Owner

Most likely the merges are not present because llama.cpp does not use them with SPM-type tokenizers. We currently use the merges information only for BPE-type tokenizers:

int llama_vocab::find_bpe_rank(const std::string & token_left, const std::string & token_right) const {
GGML_ASSERT(token_left.find(' ') == std::string::npos);
GGML_ASSERT(token_left.find('\n') == std::string::npos);
GGML_ASSERT(token_right.find(' ') == std::string::npos);
GGML_ASSERT(token_right.find('\n') == std::string::npos);
auto it = bpe_ranks.find(std::make_pair(token_left, token_right));
if (it == bpe_ranks.end()) {
return -1;
}
return it->second;
}

I guess we can add those regardless, but if they are not needed for anything maybe better to not include them. What does transformers need them for?

@a8nova
Copy link
Author

a8nova commented Sep 5, 2024

It looks like they are needed because the llama tokenizer uses BPE ( I ran into this for phi3 which uses BPE tokenizer, which uses llama's tokenizer) in transformers adding HF engineer @SunMarc so he can confirm

@ggerganov
Copy link
Owner

ggerganov commented Sep 5, 2024

AFAIK llama v1 and v2 used SPM and then switched to BPE for v3. Tinyllama uses SPM. I believe Phi-3 is also an SPM tokenizer (at least this is what we listed it here in our conversion script):

{"name": "phi-3", "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/microsoft/Phi-3-mini-4k-instruct", },

@github-actions github-actions bot added the stale label Oct 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) stale
Projects
None yet
Development

No branches or pull requests

2 participants