Evaluate QRM reward models #195

Nicolinho · 2024-10-01T08:00:37Z

Hi, could you please evaluate the QRM reward model: https://huggingface.co/nicolinho/QRM-Llama3.1-8B
I had to add an argument to the script, such that no model kwargs are given to the model_builder as otherwise it messes with the datatypes.
You can run the evaluation with the following command:

export ACCELERATE_MIXED_PRECISION=bf16; python run_rm.py  --model nicolinho/QRM-Llama3.1-8B --trust_remote_code --batch_size 1 --attn_implementation flash_attention_2 --no_model_kwargs
# {'Chat': 0.946927374301676, 'Chat Hard': 0.8991228070175439, 'Safety': 0.922972972972973, 'Reasoning': 0.9578115621760245}

Thank you!

natolambert · 2024-10-01T15:45:03Z

Hey @Nicolinho, which specific arg is causing an issue with it? I was wondering if we can do this in a more general way, or by adding a model config to rewardbench/models/__init__.py?

Also, some comments on the SkyWorks dataset soon, it seems like there is some contamination.

Nicolinho · 2024-10-01T18:47:09Z

@natolambert Both the torch_dtype as well as the device_map gave problems for me.
I updated the request to load the model manually via rewardbench/models/init.py
You should be able to run it with:
`export ACCELERATE_MIXED_PRECISION=bf16; python run_rm.py --model nicolinho/QRM-Llama3.1-8B --trust_remote_code --batch_size 1 --attn_implementation flash_attention_2

natolambert · 2024-10-02T17:44:11Z

@Nicolinho have you tried other models too? Just trying to understand the device map issue on your setup. I do know handling multi-GPU better would help.

Second, if the other code is no longer needed, can you remove it?

Third, can you run make style and make quality?

Nicolinho · 2024-10-09T16:26:10Z

I did not try other models.
I removed the code that is no longer needed.
I updated the code style and quality.

To evaluate the model trained with the skywork dataset you can run

export ACCELERATE_MIXED_PRECISION=bf16; python run_rm.py  --model nicolinho/QRM-Llama3.1-8B --trust_remote_code --batch_size 1 --attn_implementation flash_attention_2 --no_model_kwargs
# {'Chat': 0.946927374301676, 'Chat Hard': 0.8991228070175439, 'Safety': 0.922972972972973, 'Reasoning': 0.9578115621760245}

To evaluate the model trained without the skywork dataset and using Llama3 as base you can run

export ACCELERATE_MIXED_PRECISION=bf16; python run_rm.py  --model nicolinho/QRM-Llama3-8B --trust_remote_code --batch_size 1 --attn_implementation flash_attention_2 
# {'Chat': 0.9581005586592178, 'Chat Hard': 0.8048245614035088, 'Safety': 0.8986486486486487, 'Reasoning': 0.9753028318873792}

natolambert · 2024-10-09T20:45:55Z

Thanks @Nicolinho ! Looks good, should be able to merge this shortly :)

natolambert · 2024-10-11T17:16:04Z

@Nicolinho do I need ACCELERATE_MIXED_PRECISION=bf16;? I don't like one off ways to run models. I'll try setting the datatype to bfloat16

rewardbench/models/__init__.py

Nicolinho · 2024-10-15T14:39:01Z

@natolambert The argument is needed, as the quantile regression head is trained in fp32. Using bfloat16 degrades the performance somewhat.

add no model kwargs argument

b8516cd

Nicolinho added 2 commits October 1, 2024 20:43

load QRM models manually

6fc1585

load QRM models manually

c943f4d

Nicolinho and others added 3 commits October 9, 2024 18:11

update code style

5bc4b0e

remove unnecassary changes to run_rm

6f7eb90

style updates

290a868

update code quality

9c08969

natolambert approved these changes Oct 11, 2024

View reviewed changes

rewardbench/models/__init__.py Show resolved Hide resolved

rewardbench/models/__init__.py Show resolved Hide resolved

Apply suggestions from code review

2338f07

natolambert merged commit ce6c89f into allenai:main Oct 11, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate QRM reward models #195

Evaluate QRM reward models #195

Nicolinho commented Oct 1, 2024

natolambert commented Oct 1, 2024

Nicolinho commented Oct 1, 2024

natolambert commented Oct 2, 2024 •

edited

Loading

Nicolinho commented Oct 9, 2024

natolambert commented Oct 9, 2024

natolambert commented Oct 11, 2024

Nicolinho commented Oct 15, 2024

Evaluate QRM reward models #195

Evaluate QRM reward models #195

Conversation

Nicolinho commented Oct 1, 2024

natolambert commented Oct 1, 2024

Nicolinho commented Oct 1, 2024

natolambert commented Oct 2, 2024 • edited Loading

Nicolinho commented Oct 9, 2024

natolambert commented Oct 9, 2024

natolambert commented Oct 11, 2024

Nicolinho commented Oct 15, 2024

natolambert commented Oct 2, 2024 •

edited

Loading