-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix missing bos token #6050
Fix missing bos token #6050
Conversation
…y other models too).
Oh shit.. is this why this was happening to me? I too observed it in some models and not others and resorted to running in verbose to double check. The BOS being missing made me think verbose wasn't outputting the entire prompt when enabled. When I would manually add the BOS token, often times it would get removed by the code above your PR. |
That's a very important fix, thank you. |
This breaks everything in some situations (eg. using alfred-40B-1023-GGUF with llamacpp_HF), checkbox on or off, doesn't matter. I'm sure it's something small...
|
Thanks for reporting @TheLounger ! Based on your error log it seems that there are situations where bos_token_id exists but is None? I added a quick fix here: #6061 |
Expected behavior
When the UI checkbox to add bos token is checked, user expects bos token to be added.
Actual behavior
When the UI checkbox to add bos token is checked, some models add bos token, but some models don't add bos token. The (mis)configuration of the tokenizer config file affects whether bos token gets added or not.
Scope
Affects both UI users and API users (in API case tested with API parameter to add bos token rather than UI checkbox).
Affects at least model loaders which use huggingface transformers tokenizer (e.g. ExllamaV2_HF).
I checked various models I had on disk, and about half of Llama 3 models from my disk were affected by this issue. I also tried some Lllama 2 models, none of them were affected.
A few examples of affected models:
I don't know if it makes a difference, but I didn't use the downloader script to download models, I manually downloaded them.
Cause
My understanding is that Meta distributed some misconfigured tokenizer.json files and silently updated them later. Copies of the misconfigured tokenizer.json files are now circulating amongst fine tuners and quantizers.
How are the tokenizer.json files misconfigured? They have the bos token defined in some places, but not in all the places where it's needed. I don't really know what these files are supposed to contain. I looked at one llama 3 tokenizer file which appeared to be working correctly, and I found that it had this definition, which was missing from the misconfigured tokenizer files:
The way bos token is implemented in TGW is that the transformers tokenizer is called with
add_special_tokens=True
, so if the bos token is not defined in special_tokens, it won't be added.Do we need a fix in TGW if it's a model issue?
We need a fix in TGW because we have UI checkbox and API parameter that set user expectations.
Also, many models quality is severely degraded in TGW if it's not fixed (even if it's technically somebody else's fault).
How to fix this
After transformers tokenizer encode, we check that the bos token is added and manually add it if it is not.
Checklist: