-
-
Notifications
You must be signed in to change notification settings - Fork 927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for additional_special_tokens #1221
Conversation
Slight clarification:
Also ensures that the tokens are treated special by the tokenizer, even if the token is already in the vocabulary!
However, But at least during training they should be handled fine: yi_tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-34B-200K")
print(yi_tokenizer("<|im_start|>system\nX<|im_end|>")["input_ids"])
print(
yi_tokenizer.add_tokens(
[
AddedToken("<|im_start|>", lstrip=False, rstrip=False, normalized=False),
AddedToken("<|im_end|>", lstrip=False, rstrip=False, normalized=False),
]
)
)
print(yi_tokenizer("<|im_start|>system\nX<|im_end|>")["input_ids"]) Results in the expected: [59666, 59705, 622, 59593, 5858, 46826, 10707, 144, 59733, 59666, 59705, 622, 59593, 701, 46826]
2
[6, 10707, 144, 59733, 7] |
Thank you for the PR. I ran into this issue today and it also seems to me that Running your code on the mistral base model with this config: special_tokens:
eos_token: "<|im_end|>"
unk_token: "<unk>"
additional_special_tokens: ["<|im_start|>"] Did yield an error:
Changing the config to this worked: special_tokens:
eos_token: "<|im_end|>"
unk_token: "<unk>"
additional_special_tokens: ["<|im_start|>"]
tokens:
- "<|im_start|>" I believe it may have to do with the fact that Mistral doesn't have |
Thank you for testing! Do you actually see tokenization differences? I did run over the whole training set with these 2 options: (A) Only use And I did not find a single example where the token ids would differ. If you did, can you share an example so I can test with it? On the issue you mention, I think # %%
from transformers import AutoTokenizer, AddedToken
tokens = ["<|im_start|>", "<|im_end|>"]
tok1 = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
tok2 = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
tok1.add_tokens(
[
AddedToken(token, rstrip=False, lstrip=False, normalized=False)
for token in tokens
]
)
tok2.add_special_tokens({"additional_special_tokens": tokens})
# %%
inputs = [
"<|im_start|>user\nhello<|im_end|>",
"<|im_start|>user\nhello<|im_end|>\n<|im_start|> assistant\nworld<|im_end|>",
]
for input in inputs:
print(tok1.tokenize(input))
print(tok2.tokenize(input))
print()
# %% And it does not raise an exception and the outputs are the same (in this case the ids are swapped, but that's fine): ['<|im_start|>', '▁user', '<0x0A>', 'hello', '<|im_end|>']
[1, 32000, 2188, 13, 21558, 32001]
['<|im_start|>', '▁user', '<0x0A>', 'hello', '<|im_end|>']
[1, 32001, 2188, 13, 21558, 32000]
['<|im_start|>', '▁user', '<0x0A>', 'hello', '<|im_end|>', '▁', '<0x0A>', '<|im_start|>', '▁', '▁assistant', '<0x0A>', 'world', '<|im_end|>']
[1, 32000, 2188, 13, 21558, 32001, 28705, 13, 32000, 28705, 13892, 13, 9471, 32001]
['<|im_start|>', '▁user', '<0x0A>', 'hello', '<|im_end|>', '▁', '<0x0A>', '<|im_start|>', '▁', '▁assistant', '<0x0A>', 'world', '<|im_end|>']
[1, 32001, 2188, 13, 21558, 32000, 28705, 13, 32001, 28705, 13892, 13, 9471, 32000] So I am not sure what's the issue. The stack trace does not have much information unfortunately :-/ |
@DreamGenX , I think the difference in the actual implementation and your test code above is that you are adding both the tokens to the vocabulary, and then setting them as special. setting |
I also get this error if the additional_special_tokens are not already in the vocabulary or included in
|
alright, found the bug and added a test for it. dea2d31 |
Thank you for the fix! Just to clarify, you can see in the example that |
rebasing from main to pick up fixes for ci failures b/c of torch-2.2.0 release |
what if the "additional_special_tokens" is already present in the tokenizer.config for the model that is being fine tuned? Is it still necessary to put them in the axolotl yaml manifest? Or they'll just be picked up as expected from tokenizer.config? |
@winglian @DreamGenX Based on all the discussion and changes above, it is now very confusing knowing how to properly set up your config.yml for training a sharegpt dataset with the chatml chat template. Would the following stripped example work as intended? It mirrors those found in other issues but it's hard to know what's current at this point. Note this is for a model with a tokenizer that does not already contain <im_start> and <im_end>.
|
* Support for additional_special_tokens * Support for additional_special_tokens. Adjust whitespace. * Support for additional_special_tokens. Use correct quotes. * Support for additional_special_tokens. Safe pop. * Support for additional_special_tokens. nt. * Support for additional_special_tokens. cfg.special_tokens may be None. * add token if not in vocabulary when adding additional_special_tokens * fix logic for copy/pasta * bugfix for popping from config and tokenizer reload * no need to add tokens manually now with previous bugfix --------- Co-authored-by: Wing Lian <[email protected]>
Description
This lets users specify "additional_special_tokens" of key of the
add_special_tokens
method.Motivation and Context
Special tokens are treated different by the tokenizer, ensuring that they are never broken down. This is important for some use cases, like ChatML.
Consider Yi-200K, which has
<|im_start|>
and<|im_end|>
in its vocabulary already. They are, however not marked as special in the base variant. This means that:Results in:
However, adding these tokens as special:
Fixes the issue:
This change will let users handle these cases correctly.
How has this been tested?
I ran a training for 1 step after adding this to my config:
Without this, it's likely that many ChatML models are in-fact semi broken, because it's common practice to add
<|im_start|>
and<|im_end|>
like this:This means that the tokenizer might incorrectly tokenize
<|im_start|>
strings.Social Handles (Optional)
dreamgen on discord
https://twitter.com/DreamGenAI