Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question of chat templates #16

Open
trueRosun opened this issue Jun 13, 2024 · 6 comments
Open

question of chat templates #16

trueRosun opened this issue Jun 13, 2024 · 6 comments

Comments

@trueRosun
Copy link

nice work! starred already.
sorry for asking, why replacing the bos_token with empty string?

sample['positive'] = tokenizer.apply_chat_template(
        sample['chosen'], tokenize=False, add_generation_prompt=False).replace(tokenizer.bos_token, "")
sample['negative'] = tokenizer.apply_chat_template(
    sample['rejected'], tokenize=False, add_generation_prompt=False).replace(tokenizer.bos_token, "")
@WeiXiongUST
Copy link
Collaborator

Because when we service the Bradley Terry RM with pipeline, it will automatically add a bos_token inside the pipeline when tokenizing.

For pair-wise preference model, it is because we train the model without a bos_token (this is indeed some issue of llama3 at that time). But the influence of the bos token is mild in general.

@trueRosun
Copy link
Author

thank you for answering!

I will further check the outputs after tokenization.

@hunterlang
Copy link

Because when we service the Bradley Terry RM with pipeline, it will automatically add a bos_token inside the pipeline when tokenizing.

I don't fully understand...if the inference-time pipeline adds the bos_token automatically, doesn't that mean we should train with the bos token?

@WeiXiongUST
Copy link
Collaborator

Because when we service the Bradley Terry RM with pipeline, it will automatically add a bos_token inside the pipeline when tokenizing.

I don't fully understand...if the inference-time pipeline adds the bos_token automatically, doesn't that mean we should train with the bos token?

Yes, you are correct. Unfortunately, when we train the model, there is a bug in the llama3 tokenizer so the model is trained WITHOUT bos token.

We have tested with/without bos, it can lead to ~1% difference in the reward bench accuracy. You may modify the tokenizer to prevent the tokenizer from adding a bos token automatically to fix the issue I guess...

@hunterlang
Copy link

Thanks for the reply! Just to clarify:

If I remove those .replace(tokenizer.bos_token, "") calls, then training should match inference, because the inference pipeline adds BOS automatically.

If I modify the tokenizer, then the inference pipeline will match the off-the-shelf models you already released, which were trained without BOS?

@WeiXiongUST
Copy link
Collaborator

We get a bos token when we tokenize by apply chat template. Then, inside the pipeline, we get another bos token.

If you remove .replace(tokenizer.bos_token, ""), you still get one bos token inside the pipeline. If you do not remove .replace(tokenizer.bos_token, ""), you will get two bos tokens.

If we modify the tokenizer to avoid it to add bos token, then we will never get bos token. Then, it matches the training (no bos token).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants