You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We use a fixed test prompt set to evaluate the KL divergence, which is the test set described in the paper. For each prompt, we only sample 1 response. We also try to the sample multiple responses but the results are similar so for simplicity, we only use 1 response.
Question about KL Divergence Evaluation in DPO Implementation
I read the paper "Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint" and noticed your comparison of different methods and iterations using reward-KLD metrics.
In the implementation, I see that during training, you compute the KL divergence in DPO as follows:
I have a few questions about how KLD is evaluated after training:
What is your evaluation methodology for computing KLD? Do you:
Regarding the evaluation inputs:
This information would help clarify the practical aspects of implementing and evaluating the KL divergence constraints described in the paper.
Thanks!
The text was updated successfully, but these errors were encountered: