Free Process Rewards without Process Labels
- [2024/12/02] Our paper comes alive! We release our implicit PRMs trained with DPO and CE respectively, the best PRMs trained from Llama-3.1-Instruct to date, and we also open-source the corresponding training dataset, the response-level rollouts to UltraInteract instructions sampled by Llama-3.1-8B-Instruct.
Training a PRM with conventioanal approaches requires labels annotated at every intermediate step, presenting significant challenges for both manual and automatic data collection.
In contrast, we show that an implicit PRM can be obtained at no additional cost, by simply training an ORM on the cheaper response-level labels. The only assumption is to parameterize the outcome reward as the log-likelihood ratios of the policy and reference models
Notably, this conclusion still holds when
We evaluate Implicit PRM with best-of-N sampling. We generate testsets on MATH500 with three generation models.
We implement Math-Shepherd and AutoPSV and train on our dataset. We also compare to six off-the-shelf ORMs and PRM, including the (previous) SOTA PRMs of Llama-3.1 class, RLHFlow/Llama3.1-8B-PRM-Mistral-Data and RLHFlow/Llama3.1-8B-PRM-Deepseek-Data.
We instantiate our proposition using various reward modeling objectives, including DPO, NCA, KTO, and cross-entropy (CE). Particularly, given esponse-level label
It is noteworthy that our implicit PRM (DPO) achieves the overall best performance, surpassing the previous SOTA of this backbone. Our implicit PRM (CE) also outperforms all baselines except RLHFlow-8B-Mistral-Data and RLHFlow-8B-DS-Data. This indicates the potential in empowering real-world applications where pairwise data is hard to collect.
KTO and CE gain the most from the integration, both of which fail to surpass majority voting alone but outperforms it through weighted best-of-N. It is also noteworthy that CE loss becomes the most effective when augmented with majority voting, once again demonstrating its potential.
- Both scaling instructions and responses consistently improve the performance of our implicit PRM.
- Compared to instructions, scaling up responses seems to be more influential on implicit PRMs, as reflected by the larger performance variations between the minimum and maximum data setups.
- DPO requires more data to obtain a descent performance than CE. DPO is under-trained with two responses per instruction, which can be partly attributed to the insufficient amount of instructions: two responses may not constitute a pair to train our DPO variant, and thus many instructions can not be used in training. In contrast, CE generally performs better with insufficient data and can always improve different generation model, even when it is trained with one response per instruction with pairs, the extreme case of the unpaired setup. This presents a huge advantage in real-world data scarcity scenarios.
If you find our model, data, or evaluation code useful, please kindly cite our paper:
@misc{yuan2024implicitprm,
title={Free Process Rewards without Process Labels},
author={Lifan Yuan and Wendi Li and Huayu Chen and Ganqu Cui and Ning Ding and Kaiyan Zhang and Bowen Zhou and Zhiyuan Liu and Hao Peng},
year={2024},
eprint={2412.01981},
url={https://arxiv.org/abs/2412.01981},
}