Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative reward when serving ArmoRM-Llama3-8B-v0.1 #23

Open
maoliyuan opened this issue Aug 29, 2024 · 4 comments
Open

Negative reward when serving ArmoRM-Llama3-8B-v0.1 #23

maoliyuan opened this issue Aug 29, 2024 · 4 comments

Comments

@maoliyuan
Copy link

Hello! When I serve ArmoRM-Llama3-8B-v0.1 using OpenRLHF, the output rewards are almost negative (around -2.0). I've attached some pictures of how I served the reward model. Is the output of this RM naturally around -2.0, or is it because the way I serve the RM is wrong? (The prompt dataset are also from rlhflow, like "RLHFlow/iterative-prompt-v1-iter7-20K", and the responses are generated from "RLHFlow/LLaMA3-iterative-DPO-final". We also apply the chat template when creating the prompt-response dataset
serve-armo-reward-model2
serve-armo-reward-model
)
serve-armo-reward-model1

@WeiXiongUST
Copy link
Contributor

Could you try the service example in https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1

@maoliyuan
Copy link
Author

Perhaps I got the reason... I built the model from AutoModel.from_pretrained rather than AutoModelForSequenceClassification.from_pretrained, and when I tried the example you gave, the model will output something like this:
armo-rm-custom-output
It's a correct output and has everything I want. However, when I built from AutoModel.from_pretrained, the model output will become something like this:
armo-rm-auto-output
Could you please explain the reason behind this? Thanks a lot.

@WeiXiongUST
Copy link
Contributor

You may want to check the document of huggingface about the difference between AutoModel and AutoModel+specified model type.

@maoliyuan
Copy link
Author

Thanks a lot! By the way, could you please provide an example that inferences for a batch of input and takes attention mask as an input? The example that you provided in HuggingFace only contains inference for a single input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants