Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align the tokenized result between deepseek coder python model and gguf model #3986

Closed
wants to merge 0 commits into from

Conversation

DOGEwbx
Copy link

@DOGEwbx DOGEwbx commented Nov 8, 2023

Thanks for your efforts to support convert deepseek coder model according to
#3633
I just found that the tokenized result of the quantinized model is sightly different from the original because deepseek coder use different pretokenizer from the gpt2 style preprocess implement in your code. So I add deepseekcoder process pipeline to your work.
The convert script is modified from strutive07's work.

python convert-deepseek-coder-hf-to-gguf.py <MODEL_PATH> --outfile <OUTPUT_NAME> --vocab-dir <MODEL_PATH> --padvocab
./quantize <OUTPUT_NAME> <OUTPUT_NAME_q4.0> q4_0

I ran above commands and did some test to make sure the tokenized result are the same

@DOGEwbx
Copy link
Author

DOGEwbx commented Nov 8, 2023

I'm still working on transfer the whole pretokenizer pipeline to cpp, hope we could directly read the pretokenize process from tokenizer.json instead of using current hard code solution.

llama.cpp Outdated
// std::cout<<utf_char<<std::endl;
// forward backward lookups
const std::string & utf_char_next = (i + 1 < (int)text_utf.size()) ? text_utf[i + 1] : "";
const std::string & utf_char_next_next = (i + 2 < (int)text_utf.size()) ? text_utf[i + 2] : "";
Copy link
Collaborator

@cebtenzzre cebtenzzre Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think C++ allows you to bind a reference to a temporary std::string (converted from "") like this, and then store it in a variable.
edit: I guess we're doing that too, so it must work well enough. There's quite a lot of copy-pasting going on here...

Copy link
Collaborator

@cebtenzzre cebtenzzre Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure you don't need to copy-paste the whole convert.py for this. What version of convert.py did you base this on? It's not a very recent one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reference #3633 as the convert script

Copy link
Collaborator

@cebtenzzre cebtenzzre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is way too much extra (mostly duplicate) code to maintain for a single model (with only tokenizer changes). You need to find a better way to do this.

@Galunid
Copy link
Collaborator

Galunid commented Nov 8, 2023

I'd suggest you give a look at #3838, that introduces convert-hf-to-gguf.py. It should be relatively easy to just create new class and overwrite set_vocab and maybe set_gguf_parameters methods to do what you need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants