Attentively reading papers and deliberate practice.
Motivation: RLHF is complex and unstable, so we aggregate the RM training and RL in one step. Therefore, preference data directly leads to the final LM.
Key Insight: Leverage an analytical mapping from reward functions to optimal policies, which enables us to transform a loss function over reward functions into a loss function over policies.
-
Supervised fine-tuning (SFT): With a high-quality dataset of dialogue, instruction following, and summarization. (like 1000?)
SFT: Learning from the textbook, straightforward. RLHF: Learning by trying to progress, successful.
-
Preference sampling and Reward learning
-
Assuming human preference distribution can be written as $$\begin{equation} p^\left(y_1 \succ y_2 \mid x\right) = \frac{\exp \left(r^\left(x, y_1\right)\right)}{\exp \left(r^\left(x, y_1\right)\right)+\exp \left(r^\left(x, y_2\right)\right)} \end{equation}$$
-
The loss function to train RM denotes: $$\begin{equation} \mathcal{L}R\left(r{\phi}, \mathcal{D}\right) = -\mathbb{E}{\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\log \sigma\left(r\phi\left(x, y_w\right)-r_\phi\left(x, y_l\right)\right)\right], \sigma(x)=\frac{1}{1+e^{-x}} \end{equation}$$ Questions:
- Why RM accuracy is low to 64%?
- Why there is a log, to make the difference more obvious? Easy to differentiate?
-
-
Reinforcement-learning optimization.
- Optim goal $$\begin{equation} \max {\pi\theta} \mathbb{E}{x \sim \mathcal{D}, y \sim \pi\theta(y \mid x)} \left[r_\phi(x, y)\right] -\beta \mathbb{D}{\mathrm{KL}}\left[\pi\theta(y \mid x) | \pi_{\mathrm{ref}}(y \mid x)\right], \mathbb{D}{\mathrm{KL}} = log\frac{\pi\theta(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)} \end{equation}$$ If there is no $\mathbb{D}{\mathrm{KL}}$, $\pi\theta(\mathop{\arg \max}{y} r\phi(x,y)|x) = 1,$
- Not differentiable, so the reward function denotes
$$\begin{equation}
r(x, y) = r_\phi(x, y)-\beta\left(\log \pi_\theta(y \mid x)-\log \pi_{\mathrm{ref}}(y \mid x)\right)
\end{equation}$$
Questions:
- Why is this one differentiable?
- Fine-tuning with params of certain layers fixed?
Motivation: Leverage an analytical mapping from reward functions to optimal policies, which enables us to transform a loss function over reward functions into a loss function over policies.
-
Deriving the Objective
We will derive Eq. 4 Analogously to Eq. 3. We optimize the following objective:
$$\begin{equation} \max {\pi} \mathbb{E}{x \sim \mathcal{D}, y \sim \pi}[r(x, y)]-\beta \mathbb{D}{\mathrm{KL}}\left[\pi(y \mid x) | \pi{\text {ref }}(y \mid x)\right] \end{equation}$$
under any reward function
$r(x, y)$ , reference model$\pi_{\text {ref }}$ and a general non-parametric policy class. We now have:$$ \begin{aligned} \max {\pi} \mathbb{E}{x \sim \mathcal{D}, y \sim \pi} & {[r(x, y)]-\beta \mathbb{D}{\mathrm{KL}}\left[\pi(y \mid x) | \pi{\mathrm{ref}}(y \mid x)\right] } \ & =\max {\pi} \mathbb{E}{x \sim \mathcal{D}} \mathbb{E}{y \sim \pi(y \mid x)}\left[r(x, y)-\beta \log \frac{\pi(y \mid x)}{\pi{\mathrm{ref}}(y \mid x)}\right] \ & =-\beta \min {\pi} \mathbb{E}{x \sim \mathcal{D}} \mathbb{E}{y \sim \pi(y \mid x)}\left[\log \frac{\pi(y \mid x)}{\pi{\mathrm{ref}}(y \mid x)}-\frac{1}{\beta} r(x, y)\right] \ & =-\beta \min {\pi} \mathbb{E}{x \sim \mathcal{D}} \mathbb{E}{y \sim \pi(y \mid x)}\left[\log \frac{\pi(y \mid x)}{\frac{1}{Z(x)} \pi{\mathrm{ref}}(y \mid x) \exp \left(\frac{1}{\beta} r(x, y)\right)}-\log Z(x)\right] \end{aligned} $$
where we have partition function (import concept in Probability Theory and Statistical Physics):
$$\begin{equation} Z(x)=\sum_{y} \pi_{\mathrm{ref}}(y \mid x) \exp \left(\frac{1}{\beta} r(x, y)\right) . \end{equation}$$
Note that the partition function is a function of only
$x$ and the reference policy$\pi_{\text {ref }}$ , but does not depend on the policy$\pi$ .We can now define $$\begin{equation} \pi^{*}(y \mid x)=\frac{1}{Z(x)} \pi_{\text {ref }}(y \mid x) \exp \left(\frac{1}{\beta} r(x, y)\right) \end{equation}$$
which is a valid probability distribution as $\pi^{}(y \mid x) \geq 0$ for all $y$ and $\sum_{y} \pi^{}(y \mid x)=1$. Since
$Z(x)$ is not a function of$y$ , we can then re-organize the final objective in$\mathrm{Eq} 12$ as:$$\begin{equation} \min {\pi} \mathbb{E}{x \sim \mathcal{D}}\left[\mathbb{E}{y \sim \pi(y \mid x)}\left[\log \frac{\pi(y \mid x)}{\pi^{*}(y \mid x)}\right]-\log Z(x)\right]= \ \min {\pi} \mathbb{E}{x \sim \mathcal{D}}\left[\mathbb{D}{\mathrm{KL}}\left(\pi(y \mid x) | \pi^{*}(y \mid x)\right)+Z(x)\right] \end{equation}$$
Now, since
$Z(x)$ does not depend on$\pi$ , the minimum is achieved by the policy that minimizes the first KL term. Gibbs' inequality tells us that the KL-divergence is minimized at 0 if and only if the two distributions are identical(Widely known). Hence we have the optimal solution:$$\begin{equation} \pi(y \mid x)=\pi^{*}(y \mid x)=\frac{1}{Z(x)} \pi_{\mathrm{ref}}(y \mid x) \exp \left(\frac{1}{\beta} r(x, y)\right) \end{equation}$$
for all
$x \in \mathcal{D}$ . This completes the derivation. Motivation: Trying to make the object a$\mathbb{D}_{\mathrm{KL}}$ + sth invariant with$y$ .A.2 Deriving the DPO Objective Under the Bradley-Terry Model
It is straightforward to derive the DPO objective under the Bradley-Terry preference model as we have
$$\begin{equation} p^{}\left(y_{1} \succ y_{2} \mid x\right)=\frac{\exp \left(r^{}\left(x, y_{1}\right)\right)}{\exp \left(r^{}\left(x, y_{1}\right)\right)+\exp \left(r^{}\left(x, y_{2}\right)\right)} \end{equation}$$
In Section 4 we showed that we can express the (unavailable) ground-truth reward through its corresponding optimal policy:
$$\begin{equation} r^{}(x, y)=\beta \log \frac{\pi^{}(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)}+\beta \log Z(x) \end{equation}$$
Substituting Eq. 17 into Eq. 16 we obtain:
$$\begin{aligned} p^{}\left(y_{1} \succ y_{2} \mid x\right) & =\frac{\exp \left(\beta \log \frac{\pi^{}\left(y_{1} \mid x\right)}{\pi_{\mathrm{ref}}\left(y_{1} \mid x\right)}+ \beta \log Z(x)\right)}{\exp \left(\beta \log \frac{\pi^{}\left(y_{1} \mid x\right)}{\pi_{\mathrm{ref}}\left(y_{1} \mid x\right)}+\beta \log Z(x)\right)+\exp \left(\beta \log \frac{\pi^{}\left(y_{2} \mid x\right)}{\pi_{\mathrm{ref}}\left(y_{2} \mid x\right)}+\beta \log Z(x)\right)} \ & =\frac{1}{1+\exp \left(\beta \log \frac{\pi^{}\left(y_{2} \mid x\right)}{\pi_{\mathrm{ref}}\left(y_{2} \mid x\right)}-\beta \log \frac{\pi^{}\left(y_{1} \mid x\right)}{\pi_{\mathrm{ref}}\left(y_{1} \mid x\right)}\right)} \ & =\sigma\left(\beta \log \frac{\pi^{}\left(y_{1} \mid x\right)}{\pi_{\mathrm{ref}}\left(y_{1} \mid x\right)}-\beta \log \frac{\pi^{}\left(y_{2} \mid x\right)}{\pi_{\mathrm{ref}}\left(y_{2} \mid x\right)}\right) . \end{aligned}$$
The last line is the per-instance loss, so the loss function to be minimized becomes: $$\begin{equation} \mathcal{L}{\mathrm{DPO}}\left(\pi\theta ; \pi_{\mathrm{ref}}\right) = -\mathbb{E}{\left(x, y_w, y_l\right) \sim \mathcal{D}} \left[log \sigma \left(\beta log \frac{\pi\theta(y_w|x)}{\pi_{ref}(y_w|x)}
-\beta log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} \right) \right]
\end{equation}$$ -
Update In this section we derive the gradient of the DPO objective: $$\begin{equation} \nabla_{\theta} \mathcal{L}{\mathrm{DPO}}\left(\pi{\theta} ; \pi_{\text {ref }}\right)=-\nabla_{\theta} \mathbb{E}{\left(x, y{w}, y_{l}\right) \sim \mathcal{D}}\left[\log \sigma\left(\beta \log \frac{\pi_{\theta}\left(y_{l} \mid x\right)}{\pi_{\text {ref }}\left(y_{l} \mid x\right)}-\beta \log \frac{\pi_{\theta}\left(y_{w} \mid x\right)}{\pi_{\text {ref }}\left(y_{w} \mid x\right)}\right)\right] \end{equation}$$
We can rewrite the RHS of Equation 21 as $$\begin{equation} \nabla_{\theta} \mathcal{L}{\mathrm{DPO}}\left(\pi{\theta} ; \pi_{\mathrm{ref}}\right)=-\mathbb{E}{\left(x, y{w}, y_{l}\right) \sim \mathcal{D}}\left[\frac{\sigma^{\prime}(u)}{\sigma(u)} \nabla_{\theta}(u)\right], \end{equation}$$
where
$u=\beta \log \frac{\pi_{\theta}\left(y_{l} \mid x\right)}{\pi_{\text {ref }}\left(y_{l} \mid x\right)}-\beta \log \frac{\pi_{\theta}\left(y_{w} \mid x\right)}{\pi_{\mathrm{ref}}\left(y_{w} \mid x\right)}$ .Using the properties of sigmoid function
$\sigma^{\prime}(x)=\sigma(x)(1-\sigma(x))$ and$\sigma(-x)=1-\sigma(x)$ , we obtain the final $$\begin{aligned} & \nabla_\theta \mathcal{L}{\mathrm{DPO}}\left(\pi\theta ; \pi_{\mathrm{ref}}\right)= \ & -\beta \mathbb{E}{\left(x, y_w, y_l\right) \sim \mathcal{D}}[\underbrace{\sigma\left(\hat{r}\theta\left(x, y_l\right)-\hat{r}\theta\left(x, y_w\right)\right)}{\text {higher weight when reward estimate is wrong }}[\underbrace{\nabla_\theta \log \pi\left(y_w \mid x\right)}{\text {increase likelihood of } y_w}-\underbrace{\nabla\theta \log \pi(y \mid x)}{\text {decrease likelihood of } y_l}]], \end{aligned}$$ where $\hat{r}\theta(x, y)=\beta \log \frac{\pi_\theta(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)}$ is the reward implicitly defined by the language model$\pi_\theta$ and reference model$\pi_{\mathrm{ref}}$ -
DPO Outline
- Sample completions
$y_1, y_2 \sim \pi_{\mathrm{ref}}(\cdot \mid x)$ for every prompt$x$ , label with human preferences to construct the offline dataset of preferences$\mathcal{D} = \left{x^{(i)}, y_w^{(i)}, y_l^{(i)} \right}_{i=1}^N$ and - Optimize the language model
$\pi_\theta$ to minimize $\mathcal{L}{\mathrm{DPO}}$ for the given $\pi{\text {ref }}$ and$\mathcal{D}$ and desired$\beta$ . In practice, one would like to reuse preference datasets publicly available, rather than generating samples and gathering human preferences. Since the preference datasets are sampled using$\pi^{\mathrm{SFT}}$ , we initialize$\pi_{\mathrm{ref}}=\pi^{\mathrm{SFT}}$ whenever available. However, when$\pi^{\mathrm{SFT}}$ is not available, we initialize$\pi_{\text {ref }}$ by maximizing likelihood of preferred completions$\left(x, y_w\right)$ , that is, $\pi_{\text {ref }}=\arg \max \pi \mathbb{E}{x, y_w \sim \mathcal{D}}\left[\log \pi\left(y_w \mid x\right)\right]$. This procedure helps mitigate the distribution shift between the true reference distribution which is unavailable, and$\pi_{\mathrm{ref}}$ used by DPO.
- Sample completions
Leverage an analytical mapping from reward functions to optimal policies, which enables us to transform a loss function over reward functions into a loss function over policies.
- Analytical mapping : $$\begin{equation} r^{}(x, y)=\beta \log \frac{\pi^{}(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)}+\beta \log Z(x) \end{equation}$$ Every reward function equivalence class can be mapped to one certain policy. And the optimal policy of loss function over reward function is the policy.
- The transformation of loss function:
Using the mapping between reward and policy,
from
$$\begin{equation} \max {\pi\theta} \mathbb{E}{x \sim \mathcal{D}, y \sim \pi\theta(y \mid x)} \left[r_\phi(x, y)\right] -\beta \mathbb{D}{\mathrm{KL}}\left[\pi\theta(y \mid x) | \pi_{\mathrm{ref}}(y \mid x)\right] \end{equation}$$ to
$$\begin{equation} -\mathbb{E}{\left(x, y_w, y_l\right) \sim \mathcal{D}} \left[log \sigma \left(\beta log \frac{\pi\theta(y_w|x)}{\pi_{ref}(y_w|x)}
-\beta log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} \right) \right]
= \mathcal{L}{\mathrm{DPO}}\left(\pi\theta ; \pi_{\mathrm{ref}}\right) \end{equation}$$
-
Your LM is secretly a Reward Model
Def 1 Two reward functions
$r(x, y)$ and$r^{\prime}(x, y)$ are equivalent iff$r(x, y)-r^{\prime}(x, y)=f(x)$ for some function$f$ , i.e.$r(x,y_1) - r(x,y_2) = r^{\prime}(x,y_1) - r^{\prime}(x,y_2), \forall y_1,y_2.$ Lem 1. Under the PL, and in particular the BT, preference framework, two reward functions from the same class induce the same preference distribution. Proof. We consider the general Plackett-Luce (with the Bradley-Terry model a special case for
$K=2$ ) and denote the probability distribution over rankings induced by a particular reward function$r(x, y)$ as$p_{r}$ .$$\begin{aligned} p_{r^{\prime}}\left(\tau \mid y_{1}, \ldots, y_{K}, x\right) & =\prod_{k=1}^{K} \frac{\exp \left(r^{\prime}\left(x, y_{\tau(k)}\right)\right)}{\sum_{j=k}^{K} \exp \left(r^{\prime}\left(x, y_{\tau(j)}\right)\right)} \ & =\prod_{k=1}^{K} \frac{\exp \left(r\left(x, y_{\tau(k)}\right)+f(x)\right)}{\sum_{j=k}^{K} \exp \left(r\left(x, y_{\tau(j)}\right)+f(x)\right)} \ & =\prod_{k=1}^{K} \frac{\exp (f(x)) \exp \left(r\left(x, y_{\tau(k)}\right)\right)}{\exp (f(x)) \sum_{j=k}^{K} \exp \left(r\left(x, y_{\tau(j)}\right)\right)} \ & =\prod_{k=1}^{K} \frac{\exp \left(r\left(x, y_{\tau(k)}\right)\right)}{\sum_{j=k}^{K} \exp \left(r\left(x, y_{\tau(j)}\right)\right)} \ & =p_{r}\left(\tau \mid y_{1}, \ldots, y_{K}, x\right), \end{aligned}$$
Lem 2. Two reward functions from the same equivalence class induce the same optimal policy under the constrained
$R L$ problem.Proof. Let us consider two reward functions from the same class, such that
$r^{\prime}(x, y)=r(x, y)+f(x)$ and, let us denote as$\pi_{r}$ and$\pi_{r^{\prime}}$ the corresponding optimal policies. By Eq. 4 for all$x, y$ we have $$\begin{aligned} \pi_{r^{\prime}}(y \mid x) & =\frac{1}{\sum_{y} \pi_{\mathrm{ref}}(y \mid x) \exp \left(\frac{1}{\beta} r^{\prime}(x, y)\right)} \pi_{\mathrm{ref}}(y \mid x) \exp \left(\frac{1}{\beta} r^{\prime}(x, y)\right) \ & =\frac{1}{\sum_{y} \pi_{\mathrm{ref}}(y \mid x) \exp \left(\frac{1}{\beta}(r(x, y)+f(x))\right)} \pi_{\mathrm{ref}}(y \mid x) \exp \left(\frac{1}{\beta}(r(x, y)+f(x))\right) \ & =\frac{1}{\exp \left(\frac{1}{\beta} f(x)\right) \sum_{y} \pi_{\mathrm{ref}}(y \mid x) \exp \left(\frac{1}{\beta} r(x, y)\right)} \pi_{\mathrm{ref}}(y \mid x) \exp \left(\frac{1}{\beta} r(x, y)\right) \exp \left(\frac{1}{\beta} f(x)\right) \ & =\frac{1}{\sum_{y} \pi_{\mathrm{ref}}(y \mid x) \exp \left(\frac{1}{\beta} r(x, y)\right)} \pi_{\mathrm{ref}}(y \mid x) \exp \left(\frac{1}{\beta} r(x, y)\right) \ & =\pi_{r}(y \mid x). \end{aligned}$$Thm 1. Under mild assumptions, all reward classes consistent with the PL(and BT in particular) models can be represented with the reparameterization
$r(x, y) = \beta log \frac{π(y|x)}{π_{ref}(y|x)}$ for some model$π(y|x)$ and a given reference model$π_{ref}(y | x)$ .Proof. Consider any reward function
$r(x, y)$ , which induces an optimal model$\pi_{r}(y \mid x)$ under the KL-constrained RL problem, with analytical solution given by 4 . Following Eq. 5. when we log-linearize both sides we obtain:$$\begin{equation} r(x, y)=\beta \log \frac{\pi_{r}(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)}+\beta \log Z(x) \end{equation}$$
where
$Z(x)=\sum_{y} \pi_{\text {ref }}(y \mid x) \exp \left(\frac{1}{\beta} r(x, y)\right)$ (notice that$Z(x)$ also depends on the reward function$r$ ). Using the operator$r^{\prime}(x, y)=f\left(r, \pi_{\text {ref }}, \beta\right)(x, y)=r(x, y)-\beta \log Z(x)$ , we see that this new reward function is within the equivalence class of$r$ and, we have:$$\begin{equation} r^{\prime}(x, y)=\beta \log \frac{\pi_{r}(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)} \end{equation}$$
which completes the proof.
We can further expand on these results. We can see that if
$r$ and$r^{\prime}$ are two reward functions in the same class, then$$\begin{equation} f\left(r, \pi_{\text {ref }}, \beta\right)(x, y)=\beta \log \frac{\pi_{r}(y \mid x)}{\pi_{\text {ref }}(y \mid x)}=\beta \log \frac{\pi_{r}^{\prime}(y \mid x)}{\pi_{\text {ref }}(y \mid x)}=f\left(r^{\prime}, \pi_{\text {ref }}, \beta\right)(x, y) \end{equation}$$
where the second equality follows from Lemma 2 We have proven that the operator
$f$ maps all reward functions from a particular equivalence class to the same reward function.Next, we show that for every equivalence class of reward functions, the reparameterization outlined in Theorem 1 is unique.
Proposition 1. Assume we have a reference model, such that
$\pi_{\text {ref }}(y \mid x)>0$ for all pairs of prompts$x$ and answers$y$ and a parameter$\beta>0$ . Then every equivalence class of reward functions, as defined in Section 5 has a unique reward function$r(x, y)$ , which can be reparameterized as$r(x, y)=\beta \log \frac{\pi(y \mid x)}{\pi_{\text {ref }}(y \mid x)}$ for some model$\pi(y \mid x)$ .Proof. We will use contradiction, a classic and straight forward way to prove uniqueness. Assume we have two reward functions from the same class, such that
$r^{\prime}(x, y)=r(x, y)+f(x)$ . Moreover, assume that$r^{\prime}(x, y)=\beta \log \frac{\pi^{\prime}(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)}$ for some model$\pi^{\prime}(y \mid x)$ and$r(x, y)=\beta \log \frac{\pi(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)}$ for some model$\pi(y \mid x)$ , such that$\pi \neq \pi^{\prime}$ . We then have$r^{\prime}(x, y)=r(x, y)+f(x)=\beta \log \frac{\pi(y \mid x)}{\pi_{\text {ref }}(y \mid x)}+f(x)=\beta \log \frac{\pi(y \mid x) \exp \left(\frac{1}{\beta} f(x)\right)}{\pi_{\text {ref }}(y \mid x)}=\beta \log \frac{\pi^{\prime}(y \mid x)}{\pi_{\text {ref }}(y \mid x)}$ for all prompts
$x$ and completions$y$ . Then we must have$\pi(y \mid x) \exp \left(\frac{1}{\beta} f(x)\right)=\pi^{\prime}(y \mid x)$ . Since these are distributions, summing over$y$ on both sides, we obtain that$\exp \left(\frac{1}{\beta} f(x)\right)=1$ and since$\beta>0$ , we must have$f(x)=0$ for all$x$ . Therefore$r(x, y)=r^{\prime}(x, y)$ . -
Instability of Actor-Critic Alogrithms
Code (Hugging Face blog)
DPO is relatively straightforward to implement; PyTorch code for the DPO loss is provided below:
```
import torch.nn.functional as F
def dpo_loss(pi_logps, ref_logps, yw_idxs, yl_idxs, beta):
"""
pi_logps: policy logprobs, shape (B,)
ref_logps: reference model logprobs, shape (B,)
yw_idxs: preferred completion indices in [0, B-1], shape (T,)
yl_idxs: dispreferred completion indices in [0, B-1], shape (T,)
beta: temperature controlling strength of KL penalty
Each pair of (yw_idxs[i], yl_idxs[i]) represents the
indices of a single preference pair.
pi_yw_logps, pi_yl_logps = pi_logps[yw_idxs], pi_logps[yl_idxs]
ref_yw_logps, ref_yl_logps = ref_logps[yw_idxs], ref_logps[yl_idxs]
pi_logratios = pi_yw_logps - pi_yl_logps
ref_logratios = ref_yw_logps - ref_yl_logps
losses = -F.logsigmoid(beta * (pi_logratios - ref_logratios))
rewards = beta * (pi_logps - ref_logps).detach()
return losses, rewards
```
```
def return_prompt_and_responses(samples) -> Dict[str, str, str]:
return {
"prompt": [
"Question: " + question + "\n\nAnswer: "
for question in samples["question"]
],
"chosen": samples["response_j"], # rated better than k
"rejected": samples["response_k"], # rated worse than j
}
dataset = load_dataset(
"lvwerra/stack-exchange-paired",
split="train",
data_dir="data/rl"
)
original_columns = dataset.column_names
dataset.map(
return_prompt_and_responses,
batched=True,
remove_columns=original_columns
)
```
A.3 Deriving the DPO Objective Under the Plackett-Luce Model
Q: More efficient to train using
~~The Plackett-Luce model is a generalization of the Bradley-Terry model over rankings (rather than just pair-wise comparisons). Similar to the Bradley-Terry model, it stipulates that when presented with a set of possible choices, people prefer a choice with probability proportional to the value of some latent reward function for that choice. In our context, when presented with a prompt
$$ p^{}\left(\tau \mid y_{1}, \ldots, y_{K}, x\right)=\prod_{k=1}^{K} \frac{\exp \left(r^{}\left(x, y_{\tau(k)}\right)\right)}{\sum_{j=k}^{K} \exp \left(r^{*}\left(x, y_{\tau(j)}\right)\right)} $$
Notice that when
$$ p^{}\left(\tau \mid y_{1}, \ldots, y_{K}, x\right)=\prod_{k=1}^{K} \frac{\exp \left(\beta \log \frac{\pi^{}\left(y_{\tau(k)} \mid x\right)}{\pi_{\mathrm{ref}}\left(y_{\tau(k)} \mid x\right)}\right)}{\sum_{j=k}^{K} \exp \left(\beta \log \frac{\pi^{*}\left(y_{\tau(j)} \mid x\right)}{\pi_{\mathrm{ref}}\left(y_{\tau(j)} \mid x\right)}\right)} $$
Similarly to the approach of Section 4. if we have access to a dataset
$$ \begin{aligned} \mathcal{L}{\mathrm{DPO}}\left(\pi{\theta}, \pi_{\mathrm{ref}}\right)=-\mathbb{E}{\tau, y{1}, \ldots, y_{K}, x \sim \mathcal{D}}\left[\log \prod_{k=1}^{K} \frac{\exp \left(\beta \log \frac{\pi_{\theta}\left(y_{\tau(k)} \mid x\right)}{\pi_{\mathrm{ref}}\left(y_{\tau(k)} \mid x\right)}\right)}{\sum_{j=k}^{K} \exp \left(\beta \log \frac{\pi_{\theta}\left(y_{\tau(j)} \mid x\right)}{\pi_{\mathrm{ref}}\left(y_{\tau(j)} \mid x\right)}\right)}\right] \end{aligned} $$