-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phi3 has a nearly constant DPO loss of 0.69xx #17
Comments
This should be the error of the TRL's dpo implementation. Could you check whether the phi3 has the same padding token and eos token? If so, could you try to prepare the model by the following code import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
name = 'xxx'
tokenizer_name = name
model = AutoModelForCausalLM.from_pretrained(
name,
torch_dtype=torch.bfloat16,
)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.config.pad_token_id = tokenizer.pad_token_id
model.resize_token_embeddings(len(tokenizer))
model.save_pretrained("the dir to store the model")
tokenizer.save_pretrained("the dir to store the model") Then, re-run the dpo with this modified model. |
I tried both the variants - 1. Having pad token = eos token 2. Using L218 in run_dpo.py to add an explicit [PAD] token. In both the cases the issue remains. |
Maybe you can try the latest try to train the DPO. Essentially, the inference, data annotation, and dpo training are separate and you can modify them separately. |
latest TRL * |
which version of |
I am using transformers 4.42.4 and trl 0.9.6. I tried using the DPOTrainer from TRL and still faced the same issue. @WeiXiongUST if you get time can you please try Phi3; would be helpful! |
Issue: Implementing Iterative DPO on Phi3-4k-instruct
Hi, thanks for the great work and open source!
I am trying to implement iterative DPO on
Phi3-4k-instruct
. The following outlines my approach:Generation Step:
Reward Annotation:
Note: I have commented line 124 and uncommented line 123 in this file to handle the chat template of Phi3 differently from the Llama3-based reward model. This might be incorrect as I have not modified the
change_of_format()
function!DPO Iteration:
After performing these steps, the DPO loss is stuck at 0.69xx. I am running at a batch size of 128 and a learning rate of 5e-7.
Any insights to help get a Phi3 variant of iterative DPO would be greatly appreciated.
Thanks!
The text was updated successfully, but these errors were encountered: