Bug: tokenizer is missing merges section when converting using convert_hf_to_gguf.py #9309

a8nova · 2024-09-04T13:31:38Z

What happened?

I ran into this issue when working on a PR on HF where I was adding GGUF support for phi3 model.

When using gguf-my-repo (or convert_hf_to_gguf.py) to convert from hugging face to gguf, merges is missing from the gguf file.

Below is an already converted TinyLlama-1.1B-Chat-v1.0-GGUF and you can see there is a merges section in the gguf tokenizer:

Here is a tinyllama I converted few days ago via gguf-my-repo & it is missing merges from tokenizer:

I was able to checkout llama.cpp & repro via:

python3.10 ./convert_hf_to_gguf.py TinyLlama-1.1B-Chat-v1.0 --outtype f16 --outfile TinyLlama-1.1B-Chat-v1.0-fp16.gguf

I am not familiar with the conversion script but I investigated and I think i understand the issue and I also have a fix:

Case where tokenizer.model is present:
This bug can happen for any model class that calls _set_vocab_sentencepiece(). For the case where a tokenizer.model is present, _create_vocab_sentencepiece() never throws an exception, and when we are back in _set_vocab_sentencepiece() load_merges is also not passed as True here, so this would be one place we would have to fix this.
Case where tokenizer.model is not present and tokenizer.json is present:
This happens for the Llama family models only if _set_vocab_llama_hf() is invoked.
If the self._set_vocab_sentencepiece() which is wrapped by a try-catch inside the LlamaModel class fails which it does in my case since there is no tokenizer.model file for the llama model or phi3 but there is a tokenizer.json. For above case we can fix it in convert_hf_to_gguf.py#L806. I am able to fix it by passing load_merges=True to that line like:

special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True, n_vocab=len(tokens))

If the above fixes make sense, I can create a PR!

Name and Version

version: 3660 (b69a480)
built with Apple clang version 15.0.0 (clang-1500.0.40.1) for arm64-apple-darwin23.1.0

What operating system are you seeing the problem on?

Mac

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-09-05T08:28:58Z

Most likely the merges are not present because llama.cpp does not use them with SPM-type tokenizers. We currently use the merges information only for BPE-type tokenizers:

llama.cpp/src/llama-vocab.cpp

Lines 81 to 95 in bdf314f

    
           int llama_vocab::find_bpe_rank(const std::string & token_left, const std::string & token_right) const { 
        
               GGML_ASSERT(token_left.find(' ')   == std::string::npos); 
        
               GGML_ASSERT(token_left.find('\n')  == std::string::npos); 
        
               GGML_ASSERT(token_right.find(' ')  == std::string::npos); 
        
               GGML_ASSERT(token_right.find('\n') == std::string::npos); 
        
               auto it = bpe_ranks.find(std::make_pair(token_left, token_right)); 
        
               if (it == bpe_ranks.end()) { 
        
                   return -1; 
        
               } 
        
               return it->second; 
        
           }

I guess we can add those regardless, but if they are not needed for anything maybe better to not include them. What does transformers need them for?

a8nova · 2024-09-05T08:49:12Z

It looks like they are needed because the llama tokenizer uses BPE ( I ran into this for phi3 which uses BPE tokenizer, which uses llama's tokenizer) in transformers adding HF engineer @SunMarc so he can confirm

ggerganov · 2024-09-05T09:40:03Z

AFAIK llama v1 and v2 used SPM and then switched to BPE for v3. Tinyllama uses SPM. I believe Phi-3 is also an SPM tokenizer (at least this is what we listed it here in our conversion script):

llama.cpp/convert_hf_to_gguf_update.py

Line 69 in bdf314f

    
           {"name": "phi-3",          "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/microsoft/Phi-3-mini-4k-instruct", },

a8nova added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Sep 4, 2024

a8nova mentioned this issue Sep 4, 2024

Add support for GGUF Phi-3 huggingface/transformers#31844

Merged

5 tasks

github-actions bot added the stale label Oct 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: tokenizer is missing merges section when converting using convert_hf_to_gguf.py #9309

Bug: tokenizer is missing merges section when converting using convert_hf_to_gguf.py #9309

a8nova commented Sep 4, 2024 •

edited

Loading

ggerganov commented Sep 5, 2024

a8nova commented Sep 5, 2024 •

edited

Loading

ggerganov commented Sep 5, 2024 •

edited

Loading

Bug: tokenizer is missing merges section when converting using convert_hf_to_gguf.py #9309

Bug: tokenizer is missing merges section when converting using convert_hf_to_gguf.py #9309

Comments

a8nova commented Sep 4, 2024 • edited Loading

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

ggerganov commented Sep 5, 2024

a8nova commented Sep 5, 2024 • edited Loading

ggerganov commented Sep 5, 2024 • edited Loading

a8nova commented Sep 4, 2024 •

edited

Loading

a8nova commented Sep 5, 2024 •

edited

Loading

ggerganov commented Sep 5, 2024 •

edited

Loading