Duplicate Token `<s>` in Tokenizer Encoded Token ids #2899

zxybazh · 2024-02-17T06:39:51Z

When working on tokenizer result for llama-2-7b-chat-hf model, I noticed that the prompt_token_ids generated in this place would generate an extra token <s> in the beginning of the sentence.

For example for the follow prompt <s>[INST] what is the color of the snow? [/INST] , hf tokenizer can directly tokenize it to

['<s>', '▁[', 'INST', ']', '▁what', '▁is', '▁the', '▁color', '▁of', '▁the', '▁snow', '?', '▁[', '/', 'INST', ']']
[1, 518, 25580, 29962, 825, 338, 278, 2927, 310, 278, 15007, 29973, 518, 29914, 25580, 29962]

but for the very same prompt vllm would generate tokenized prompt ids as follows

[1, 1, 518, 25580, 29962, 825, 338, 278, 2927, 310, 278, 15007, 29973, 518, 29914, 25580, 29962]

which has an extra token 1, aka <s> in the beginning.

Looking forward to have someone help me confirm if this is designated behaviour or caused by some of the model options.

The text was updated successfully, but these errors were encountered:

CatherineSue · 2024-06-03T23:48:07Z

I think this is because in llama2-7b-chat-hf's tokenizer_config.json, it sets add_bos_token to true.

zxybazh · 2024-06-04T00:17:02Z

Thanks for the clarification, in that case it should be designated behaviour.

CatherineSue · 2024-06-04T00:56:40Z

a fix is merged #4688

DarkLight1337 added the usage How to use vllm label May 31, 2024

zxybazh closed this as completed Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate Token `<s>` in Tokenizer Encoded Token ids #2899

Duplicate Token `<s>` in Tokenizer Encoded Token ids #2899

zxybazh commented Feb 17, 2024

CatherineSue commented Jun 3, 2024

zxybazh commented Jun 4, 2024

CatherineSue commented Jun 4, 2024 •

edited

Loading

Duplicate Token <s> in Tokenizer Encoded Token ids #2899

Duplicate Token <s> in Tokenizer Encoded Token ids #2899

Comments

zxybazh commented Feb 17, 2024

CatherineSue commented Jun 3, 2024

zxybazh commented Jun 4, 2024

CatherineSue commented Jun 4, 2024 • edited Loading

Duplicate Token `<s>` in Tokenizer Encoded Token ids #2899

Duplicate Token `<s>` in Tokenizer Encoded Token ids #2899

CatherineSue commented Jun 4, 2024 •

edited

Loading