-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Negative reward when serving ArmoRM-Llama3-8B-v0.1 #23
Comments
Could you try the service example in https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1 |
You may want to check the document of huggingface about the difference between AutoModel and AutoModel+specified model type. |
Thanks a lot! By the way, could you please provide an example that inferences for a batch of input and takes attention mask as an input? The example that you provided in HuggingFace only contains inference for a single input. |
Hello! When I serve ArmoRM-Llama3-8B-v0.1 using OpenRLHF, the output rewards are almost negative (around -2.0). I've attached some pictures of how I served the reward model. Is the output of this RM naturally around -2.0, or is it because the way I serve the RM is wrong? (The prompt dataset are also from rlhflow, like "RLHFlow/iterative-prompt-v1-iter7-20K", and the responses are generated from "RLHFlow/LLaMA3-iterative-DPO-final". We also apply the chat template when creating the prompt-response dataset
)
The text was updated successfully, but these errors were encountered: