Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training and evaluating for pair_pm model. #21

Open
t-sifanwu opened this issue Jul 9, 2024 · 5 comments
Open

Training and evaluating for pair_pm model. #21

t-sifanwu opened this issue Jul 9, 2024 · 5 comments

Comments

@t-sifanwu
Copy link

t-sifanwu commented Jul 9, 2024

Hi,

I have replicated the training and evaluation for the pair_rm model, but I haven't achieved the results reported in Table 2 of the paper. The best results I obtained were with pm_models/llama3-8b-it_bs128_lr1e-5/checkpoint-1306:

Chat: 63.55
Chat Hard: 63.27
Safety: 82.59
Reasoning: 53.53
The main difference I've noticed in your script is that the base_model in your pair_pm/llama3-8b-it.yaml is /home/wx/axtool/models/llama3_it_with_padding_token. However, I couldn't find this model on Hugging Face or anywhere else. Therefore, I trained the pair_pm with meta-llama/Meta-Llama-3-8B-Instruct.

Another difference is in eval_reward_bench_pm.py. Similarly, you are using /home/cyeab/axtool/models/llama3_it_427_update for tokenizer and tokenizer_plain, while I used meta-llama/Meta-Llama-3-8B-Instruct instead.

Could you please share the llama3_it_with_padding_token and llama3_it_427_update models with me? Additionally, could you provide details on how you trained them?

Thank you!

@WayXG
Copy link

WayXG commented Jul 9, 2024

I think the llama3 with padding is obtained by adding a pad token to the original llama model. This can be done by calling the pair-pm/prepare_model.py script. I did so and the resulting model is as expected.

axoltol will mask some tokens and stop the gradients and the model's padding token should be set appropriately to get the expected performance I think.

@t-sifanwu
Copy link
Author

t-sifanwu commented Jul 10, 2024

Thanks for your reply! I still have another question about the training of bradley-terry-rm models. In the file of bradley-terry-rm/llama3_rm.py, you are using the dataset "hendrydong/preference_700K", is that the same as the mix2 you mentioned in the paper?

@WeiXiongUST
Copy link
Collaborator

yes, you can use henrydong/preference_700K and the script we provide to process it into the format used by pairwise preference dataset!

@t-sifanwu
Copy link
Author

yes, you can use henrydong/preference_700K and the script we provide to process it into the format used by pairwise preference dataset!

Thanks for your reply! Since the provided data process file takes the input for standard format. Is that possible to provide the data process script to extract pairs? For example sharing the script transforming from the original ultrafeedback 63k dataset to RLHF format 340k standard dataset.

@WeiXiongUST
Copy link
Collaborator

yes, you can use henrydong/preference_700K and the script we provide to process it into the format used by pairwise preference dataset!

Thanks for your reply! Since the provided data process file takes the input for standard format. Is that possible to provide the data process script to extract pairs? For example sharing the script transforming from the original ultrafeedback 63k dataset to RLHF format 340k standard dataset.

Hi, you can check the dataset we provide in the huggingface RLHFlow organization. We provide the script for each dataset in the dataset card.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants