Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reward-KL Comparison #27

Open
vincezh2000 opened this issue Nov 10, 2024 · 1 comment
Open

Reward-KL Comparison #27

vincezh2000 opened this issue Nov 10, 2024 · 1 comment

Comments

@vincezh2000
Copy link

Question about KL Divergence Evaluation in DPO Implementation

I read the paper "Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint" and noticed your comparison of different methods and iterations using reward-KLD metrics.

In the implementation, I see that during training, you compute the KL divergence in DPO as follows:

# Compute policy log-ratios
pi_logratios = policy_chosen_logps - policy_rejected_logps

# Handle reference model computations
if reference_free:
    ref_logratios = torch.tensor([0], dtype=pi_logratios.dtype, device=pi_logratios.device)
else:
    ref_logratios = reference_chosen_logps - reference_rejected_logps

# Move tensors to device
pi_logratios = pi_logratios.to(device)
ref_logratios = ref_logratios.to(device)

# Compute final preference logits
logits = pi_logratios - ref_logratios

I have a few questions about how KLD is evaluated after training:

  1. What is your evaluation methodology for computing KLD? Do you:

    • Sample multiple responses for each input?
    • Average KLD across these samples?
  2. Regarding the evaluation inputs:

    • Do you use different inputs for each response?
    • Is there a specific sampling strategy for the inputs?

This information would help clarify the practical aspects of implementing and evaluating the KL divergence constraints described in the paper.

Thanks!

image
image

@WeiXiongUST
Copy link
Contributor

We use a fixed test prompt set to evaluate the KL divergence, which is the test set described in the paper. For each prompt, we only sample 1 response. We also try to the sample multiple responses but the results are similar so for simplicity, we only use 1 response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants