-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align the tokenized result between deepseek coder python model and gguf model #3986
Conversation
I'm still working on transfer the whole pretokenizer pipeline to cpp, hope we could directly read the pretokenize process from tokenizer.json instead of using current hard code solution. |
llama.cpp
Outdated
// std::cout<<utf_char<<std::endl; | ||
// forward backward lookups | ||
const std::string & utf_char_next = (i + 1 < (int)text_utf.size()) ? text_utf[i + 1] : ""; | ||
const std::string & utf_char_next_next = (i + 2 < (int)text_utf.size()) ? text_utf[i + 2] : ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think C++ allows you to bind a reference to a temporary std::string (converted from "") like this, and then store it in a variable.
edit: I guess we're doing that too, so it must work well enough. There's quite a lot of copy-pasting going on here...
convert-deepseek-coder-hf-to-gguf.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure you don't need to copy-paste the whole convert.py for this. What version of convert.py did you base this on? It's not a very recent one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reference #3633 as the convert script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is way too much extra (mostly duplicate) code to maintain for a single model (with only tokenizer changes). You need to find a better way to do this.
I'd suggest you give a look at #3838, that introduces |
Thanks for your efforts to support convert deepseek coder model according to
#3633
I just found that the tokenized result of the quantinized model is sightly different from the original because deepseek coder use different pretokenizer from the gpt2 style preprocess implement in your code. So I add deepseekcoder process pipeline to your work.
The convert script is modified from strutive07's work.
I ran above commands and did some test to make sure the tokenized result are the same