-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deepspeed-chat: calculate loss in fp32 when using bf16 #754
deepspeed-chat: calculate loss in fp32 when using bf16 #754
Conversation
@misland-habana, this is great! But I am curious, is there any reason that this cannot be enable for fp16 as well (even fp8)? I think other dtypes could benefit from fp32 loss computation. |
@tjruwase, I enabled Bloom with bf16 and did not test the impact of using fp32 loss for fp16. I only tested it for bf16. I would guess that fp16 would also benefit (maybe less than bf16, due to fp16 having more bits for mantissa). As for fp8, I agree that it may help. |
@mosheisland, in that case do you mind changing the cli to something like |
@tjruwase - sure, I will do it. |
Using loss in fp32 can improve training accuracy for all 3 stages. This was tested with Bloom model using bf16 dtype While at it, fix stage2 reward model creation: pass zero_stage to create_critic_model. Also, in stage3, when using bf16 and tensorboard enabled, we record the actor and critic loss. Tensorboard accepets a scalar bf16 loss tensor and converts it to numpy. This fails since numpy does not support conversion from tensor to bf16. Fix it by logging to tensorboard the loss.item(). Change-Id: I9c8e95d4886cdb44aaa6c14c4aee738e133ae405 Signed-off-by: Moshe Island <[email protected]>
5dda2d4
to
044bd98
Compare
@tjruwase, I have uploaded a new commit with --compute_fp32_loss. While testing it, I encountered an issue in upstream that happens when you use bf16 and tensorboard. |
Using loss in fp32 improved accuracy for bf16 training for all 3 stages. By default, all 3 stages will calculate loss in fp32 when using bf16. This can be disabled by using --no_bf16_to_fp32_loss.
While at it, fix stage2 reward model creation: pass zero_stage to create_critic_model.
Change-Id: I9c8e95d4886cdb44aaa6c14c4aee738e133ae405