Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phi3 has a nearly constant DPO loss of 0.69xx #17

Open
Arnav0400 opened this issue Jul 18, 2024 · 6 comments
Open

Phi3 has a nearly constant DPO loss of 0.69xx #17

Arnav0400 opened this issue Jul 18, 2024 · 6 comments

Comments

@Arnav0400
Copy link

Issue: Implementing Iterative DPO on Phi3-4k-instruct

Hi, thanks for the great work and open source!

I am trying to implement iterative DPO on Phi3-4k-instruct. The following outlines my approach:

  1. Generation Step:

    python generation/gen_hf.py --ports 8000 8001 8002 8003 --tokenizer microsoft/Phi-3-mini-4k-instruct --dataset_name_or_path $jsonl_input --output_dir $json_output --K 8 --temperature 1.0
  2. Reward Annotation:

    accelerate launch annotate_data/get_rewards.py --dataset_name_or_path $json_output --output_dir $model_output

    Note: I have commented line 124 and uncommented line 123 in this file to handle the chat template of Phi3 differently from the Llama3-based reward model. This might be incorrect as I have not modified the change_of_format() function!

  3. DPO Iteration:

    accelerate launch dpo_iteration/run_dpo.py --run_name $iteration --output_dir $iteration --model_name_or_path microsoft/Phi-3-mini-4k-instruct --ref_model microsoft/Phi-3-mini-4k-instruct --learning_rate 5e-7 --max_steps 1200 --choose_type max_min --train_dir $model_output --eval_dir $model_output --loss_type sigmoid --lr_scheduler_type cosine

After performing these steps, the DPO loss is stuck at 0.69xx. I am running at a batch size of 128 and a learning rate of 5e-7.

Any insights to help get a Phi3 variant of iterative DPO would be greatly appreciated.

Thanks!

@WeiXiongUST
Copy link
Contributor

This should be the error of the TRL's dpo implementation. Could you check whether the phi3 has the same padding token and eos token? If so, could you try to prepare the model by the following code

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

name = 'xxx'
tokenizer_name = name

model = AutoModelForCausalLM.from_pretrained(
    name,
    torch_dtype=torch.bfloat16,
)


tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.config.pad_token_id = tokenizer.pad_token_id
model.resize_token_embeddings(len(tokenizer))

model.save_pretrained("the dir to store the model")
tokenizer.save_pretrained("the dir to store the model")

Then, re-run the dpo with this modified model.

@Arnav0400
Copy link
Author

I tried both the variants - 1. Having pad token = eos token 2. Using L218 in run_dpo.py to add an explicit [PAD] token. In both the cases the issue remains.

@WeiXiongUST
Copy link
Contributor

Maybe you can try the latest try to train the DPO. Essentially, the inference, data annotation, and dpo training are separate and you can modify them separately.

@WeiXiongUST
Copy link
Contributor

latest TRL *

@srzer
Copy link

srzer commented Jul 24, 2024

which version of transformers did you use for training phi 3?

@Arnav0400
Copy link
Author

I am using transformers 4.42.4 and trl 0.9.6. I tried using the DPOTrainer from TRL and still faced the same issue.

@WeiXiongUST if you get time can you please try Phi3; would be helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants