Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions About Data chosen Strategies #29

Open
nantenT opened this issue Dec 9, 2024 · 2 comments
Open

Questions About Data chosen Strategies #29

nantenT opened this issue Dec 9, 2024 · 2 comments

Comments

@nantenT
Copy link

nantenT commented Dec 9, 2024

Hi, amazing work, and thank you for making it open source!

  1. After reviewing your code, I noticed multiple preference strategies are included when selecting DPO preference pairs. Have you compared these strategies, and if so, which one tends to perform better?

  2. When incorporating chosen preference data (SFT) into the original model, if the data distribution of the original model's outputs is completely inconsistent with the chosen data and of lower quality, would you recommend using OOD chosen + generated data as preference pairs for training, or only using preference pairs generated by the original model?

Thanks in advance for your insights!

@hendrydong
Copy link
Contributor

Hi, thanks for your interests in our work.

For 1, we have found that the max-min pair performs the best in our experiments.

For 2, we'd suggest to conduct SFT first, then perform online DPO, so that your policy model is good enough to generate reasonable samples. If your policy model is not good enough, the sample efficiency would be very low (best of n + large n to obtain a good example.

@nantenT nantenT closed this as completed Dec 10, 2024
@nantenT nantenT reopened this Dec 10, 2024
@nantenT
Copy link
Author

nantenT commented Dec 10, 2024

Thank you so much for taking the time to respond! I truly appreciate your insights.

Regarding point 2, I wanted to seek further clarification: does the process involve performing SFT first, followed by DPO? Specifically, is the SFT step meant to align the distribution by fine-tuning on the chosen outputs from the DPO preference pairs. If so, does this imply that the chosen output in DPO pairs needs to be of particularly high quality?

Or is it sufficient to use open-source instruction-tuning datasets to bring the model to a usable level, without focusing on the differences between the SFT data and DPO pairs? In such a case, would the primary criterion for determining the usability of DPO pairs simply rely on the RM's scoring as rejected sample.

Thank you again for your patience and for sharing your expertise!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants