Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepspeed-chat: calculate loss in fp32 when using bf16 #754

Merged
merged 1 commit into from
Oct 4, 2023

Conversation

mosheisland
Copy link
Contributor

Using loss in fp32 improved accuracy for bf16 training for all 3 stages. By default, all 3 stages will calculate loss in fp32 when using bf16. This can be disabled by using --no_bf16_to_fp32_loss.

While at it, fix stage2 reward model creation: pass zero_stage to create_critic_model.

Change-Id: I9c8e95d4886cdb44aaa6c14c4aee738e133ae405

@tjruwase
Copy link
Contributor

tjruwase commented Oct 3, 2023

@misland-habana, this is great! But I am curious, is there any reason that this cannot be enable for fp16 as well (even fp8)? I think other dtypes could benefit from fp32 loss computation.

@mosheisland
Copy link
Contributor Author

@tjruwase, I enabled Bloom with bf16 and did not test the impact of using fp32 loss for fp16. I only tested it for bf16. I would guess that fp16 would also benefit (maybe less than bf16, due to fp16 having more bits for mantissa). As for fp8, I agree that it may help.

@tjruwase
Copy link
Contributor

tjruwase commented Oct 3, 2023

@mosheisland, in that case do you mind changing the cli to something like --compute_fp32_loss and enabling it for other low-precision dtypes. For now, I think it would be sufficient to remove all the bf16 guards.

@tjruwase
Copy link
Contributor

tjruwase commented Oct 3, 2023

@lekurile, @awan-10 FYI

@mosheisland
Copy link
Contributor Author

@tjruwase - sure, I will do it.

Using loss in fp32 can improve training accuracy for all 3 stages.
This was tested with Bloom model using bf16 dtype

While at it, fix stage2 reward model creation: pass zero_stage to
create_critic_model.

Also, in stage3, when using bf16 and tensorboard enabled, we record the actor
and critic loss. Tensorboard accepets a scalar bf16 loss tensor and converts
it to numpy. This fails since numpy does not support conversion from tensor to
bf16. Fix it by logging to tensorboard the loss.item().

Change-Id: I9c8e95d4886cdb44aaa6c14c4aee738e133ae405
Signed-off-by: Moshe Island <[email protected]>
@mosheisland mosheisland force-pushed the 5_calc_loss_fp32_for_bf16 branch from 5dda2d4 to 044bd98 Compare October 4, 2023 12:23
@mosheisland
Copy link
Contributor Author

@tjruwase, I have uploaded a new commit with --compute_fp32_loss.

While testing it, I encountered an issue in upstream that happens when you use bf16 and tensorboard.
The issue is that in stage3 we log actor and critic loss to tensorboard.
Then, if the scalar is a tensor, tensorboard converts it to numpy.
The issue is that when using bf16 without --compute_fp32_loss, the loss will be a bf16 tensor.
However, numpy does not support conversion from bf16 tensor.
I fixed this issue as part of this commit.
I only discovered this issue now since up till now I forced bf16 loss to fp32...

@tjruwase tjruwase merged commit e6f400a into microsoft:master Oct 4, 2023
@mosheisland mosheisland deleted the 5_calc_loss_fp32_for_bf16 branch October 5, 2023 05:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants