Reward-KL Comparison #27

vincezh2000 · 2024-11-10T00:12:24Z

Question about KL Divergence Evaluation in DPO Implementation

I read the paper "Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint" and noticed your comparison of different methods and iterations using reward-KLD metrics.

In the implementation, I see that during training, you compute the KL divergence in DPO as follows:

# Compute policy log-ratios
pi_logratios = policy_chosen_logps - policy_rejected_logps

# Handle reference model computations
if reference_free:
    ref_logratios = torch.tensor([0], dtype=pi_logratios.dtype, device=pi_logratios.device)
else:
    ref_logratios = reference_chosen_logps - reference_rejected_logps

# Move tensors to device
pi_logratios = pi_logratios.to(device)
ref_logratios = ref_logratios.to(device)

# Compute final preference logits
logits = pi_logratios - ref_logratios

I have a few questions about how KLD is evaluated after training:

What is your evaluation methodology for computing KLD? Do you:
- Sample multiple responses for each input?
- Average KLD across these samples?
Regarding the evaluation inputs:
- Do you use different inputs for each response?
- Is there a specific sampling strategy for the inputs?

This information would help clarify the practical aspects of implementing and evaluating the KL divergence constraints described in the paper.

Thanks!

WeiXiongUST · 2024-11-10T02:05:43Z

We use a fixed test prompt set to evaluate the KL divergence, which is the test set described in the paper. For each prompt, we only sample 1 response. We also try to the sample multiple responses but the results are similar so for simplicity, we only use 1 response.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reward-KL Comparison #27

Reward-KL Comparison #27

vincezh2000 commented Nov 10, 2024

WeiXiongUST commented Nov 10, 2024

Reward-KL Comparison #27

Reward-KL Comparison #27

Comments

vincezh2000 commented Nov 10, 2024

Question about KL Divergence Evaluation in DPO Implementation

WeiXiongUST commented Nov 10, 2024