Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate QRM reward models #195

Merged
merged 8 commits into from
Oct 11, 2024
Merged

Evaluate QRM reward models #195

merged 8 commits into from
Oct 11, 2024

Conversation

Nicolinho
Copy link
Contributor

Hi, could you please evaluate the QRM reward model: https://huggingface.co/nicolinho/QRM-Llama3.1-8B
I had to add an argument to the script, such that no model kwargs are given to the model_builder as otherwise it messes with the datatypes.
You can run the evaluation with the following command:

export ACCELERATE_MIXED_PRECISION=bf16; python run_rm.py  --model nicolinho/QRM-Llama3.1-8B --trust_remote_code --batch_size 1 --attn_implementation flash_attention_2 --no_model_kwargs
# {'Chat': 0.946927374301676, 'Chat Hard': 0.8991228070175439, 'Safety': 0.922972972972973, 'Reasoning': 0.9578115621760245}

Thank you!

@natolambert
Copy link
Collaborator

Hey @Nicolinho, which specific arg is causing an issue with it? I was wondering if we can do this in a more general way, or by adding a model config to rewardbench/models/__init__.py?

Also, some comments on the SkyWorks dataset soon, it seems like there is some contamination.

@Nicolinho
Copy link
Contributor Author

@natolambert Both the torch_dtype as well as the device_map gave problems for me.
I updated the request to load the model manually via rewardbench/models/init.py
You should be able to run it with:
`export ACCELERATE_MIXED_PRECISION=bf16; python run_rm.py --model nicolinho/QRM-Llama3.1-8B --trust_remote_code --batch_size 1 --attn_implementation flash_attention_2

@natolambert
Copy link
Collaborator

natolambert commented Oct 2, 2024

@Nicolinho have you tried other models too? Just trying to understand the device map issue on your setup. I do know handling multi-GPU better would help.

Second, if the other code is no longer needed, can you remove it?

Third, can you run make style and make quality?

@Nicolinho
Copy link
Contributor Author

  1. I did not try other models.
  2. I removed the code that is no longer needed.
  3. I updated the code style and quality.

To evaluate the model trained with the skywork dataset you can run

export ACCELERATE_MIXED_PRECISION=bf16; python run_rm.py  --model nicolinho/QRM-Llama3.1-8B --trust_remote_code --batch_size 1 --attn_implementation flash_attention_2 --no_model_kwargs
# {'Chat': 0.946927374301676, 'Chat Hard': 0.8991228070175439, 'Safety': 0.922972972972973, 'Reasoning': 0.9578115621760245}

To evaluate the model trained without the skywork dataset and using Llama3 as base you can run

export ACCELERATE_MIXED_PRECISION=bf16; python run_rm.py  --model nicolinho/QRM-Llama3-8B --trust_remote_code --batch_size 1 --attn_implementation flash_attention_2 
# {'Chat': 0.9581005586592178, 'Chat Hard': 0.8048245614035088, 'Safety': 0.8986486486486487, 'Reasoning': 0.9753028318873792}

@natolambert
Copy link
Collaborator

Thanks @Nicolinho ! Looks good, should be able to merge this shortly :)

@natolambert
Copy link
Collaborator

@Nicolinho do I need ACCELERATE_MIXED_PRECISION=bf16;? I don't like one off ways to run models. I'll try setting the datatype to bfloat16

rewardbench/models/__init__.py Show resolved Hide resolved
rewardbench/models/__init__.py Show resolved Hide resolved
@natolambert natolambert merged commit ce6c89f into allenai:main Oct 11, 2024
3 checks passed
@Nicolinho
Copy link
Contributor Author

@natolambert The argument is needed, as the quantile regression head is trained in fp32. Using bfloat16 degrades the performance somewhat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants