Skip to content

Commit

Permalink
Update ArmoRM blog
Browse files Browse the repository at this point in the history
  • Loading branch information
Haoxiang-Wang committed May 29, 2024
1 parent 23d1188 commit 51778ee
Showing 1 changed file with 9 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -132,9 +132,17 @@ The adjusted reward vector is denoted as $r'\in \mathbb{R}^k$.
Finally, we multiply the gating coefficients to the multi-objective rewards, to obtain a scalar score $s$ for the response $y$ given prompt $x,$

$$
\mathrm{score} = g_\phi(f_\theta(x))^\top r'
R = g_\phi(f_\theta(x))^\top r'
$$

To train the gating layer, we freeze the parameters of the backbone and the regression layer, and only train the gating layer with the Bradley-Terry loss,

$$
\min_\phi \mathbb{E} \left[ -\log \frac{\exp(R_{\mathrm{chosen}})}{\exp(R_\mathrm{chosen}+R_\mathrm{rejected})} \right]
$$

where $R_{\mathrm{chosen}}$ and $R_{\mathrm{rejected}}$ are the preference scores for the chosen and rejected responses in each pairwise example, $(x, y_{\mathrm{chosen}}, y_{\mathrm{rejected}})$.

### Implementation of ArmoRM-MoE

The gating layer is trained on top of the ArmoRM obtained from stage-1. Here we provide implementation details:
Expand Down

0 comments on commit 51778ee

Please sign in to comment.