update global batch size in eval model compatible to ring-attn-size #590

ShomyLiu · 2024-12-16T11:35:05Z

No description provided.

hijkzzz · 2024-12-16T11:41:41Z

I have a question: Does train_batch_size affect DeepSpeed in eval mode?

ShomyLiu · 2024-12-16T11:52:14Z

For example, the code in train_dpo.py:
https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/cli/train_dpo.py#L39-L46
the ds_config in ref_model initialization is obtained from the modified get_eval_ds_config function. This may affect model initialization in deepspeed, especially when using Zero-3 optimization
（https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/models/actor.py#L57）

If not configured properly, it will raise this assertion error:

AssertionError: Gradient accumulation steps:0 has to be greater than 0

update global batch size in eval model compatible to ring-attn-size

e244fd4

hijkzzz merged commit 8317ca9 into OpenRLHF:main Dec 16, 2024

Provide feedback