deepspeed-chat: calculate loss in fp32 when using bf16 #754

mosheisland · 2023-10-03T13:41:45Z

Using loss in fp32 improved accuracy for bf16 training for all 3 stages. By default, all 3 stages will calculate loss in fp32 when using bf16. This can be disabled by using --no_bf16_to_fp32_loss.

While at it, fix stage2 reward model creation: pass zero_stage to create_critic_model.

Change-Id: I9c8e95d4886cdb44aaa6c14c4aee738e133ae405

tjruwase · 2023-10-03T13:59:04Z

@misland-habana, this is great! But I am curious, is there any reason that this cannot be enable for fp16 as well (even fp8)? I think other dtypes could benefit from fp32 loss computation.

mosheisland · 2023-10-03T14:30:47Z

@tjruwase, I enabled Bloom with bf16 and did not test the impact of using fp32 loss for fp16. I only tested it for bf16. I would guess that fp16 would also benefit (maybe less than bf16, due to fp16 having more bits for mantissa). As for fp8, I agree that it may help.

tjruwase · 2023-10-03T16:22:09Z

@mosheisland, in that case do you mind changing the cli to something like --compute_fp32_loss and enabling it for other low-precision dtypes. For now, I think it would be sufficient to remove all the bf16 guards.

tjruwase · 2023-10-03T16:23:41Z

@lekurile, @awan-10 FYI

mosheisland · 2023-10-03T17:49:11Z

@tjruwase - sure, I will do it.

Using loss in fp32 can improve training accuracy for all 3 stages. This was tested with Bloom model using bf16 dtype While at it, fix stage2 reward model creation: pass zero_stage to create_critic_model. Also, in stage3, when using bf16 and tensorboard enabled, we record the actor and critic loss. Tensorboard accepets a scalar bf16 loss tensor and converts it to numpy. This fails since numpy does not support conversion from tensor to bf16. Fix it by logging to tensorboard the loss.item(). Change-Id: I9c8e95d4886cdb44aaa6c14c4aee738e133ae405 Signed-off-by: Moshe Island <[email protected]>

mosheisland · 2023-10-04T12:30:17Z

@tjruwase, I have uploaded a new commit with --compute_fp32_loss.

While testing it, I encountered an issue in upstream that happens when you use bf16 and tensorboard.
The issue is that in stage3 we log actor and critic loss to tensorboard.
Then, if the scalar is a tensor, tensorboard converts it to numpy.
The issue is that when using bf16 without --compute_fp32_loss, the loss will be a bf16 tensor.
However, numpy does not support conversion from bf16 tensor.
I fixed this issue as part of this commit.
I only discovered this issue now since up till now I forced bf16 loss to fp32...

mosheisland requested review from jeffra, samyam, tjruwase, ShadenSmith, conglongli, awan-10, eltonzheng, minjiaz, RezaYazdaniAminabadi, duli2012, mrwyattii, yaozhewei, arashb and xiaoxiawu-microsoft as code owners October 3, 2023 13:41

tjruwase requested review from lekurile and removed request for arashb, ShadenSmith, jeffra, duli2012, samyam, conglongli, mrwyattii, yaozhewei, eltonzheng, minjiaz, RezaYazdaniAminabadi and xiaoxiawu-microsoft October 3, 2023 16:23

mosheisland force-pushed the 5_calc_loss_fp32_for_bf16 branch from 5dda2d4 to 044bd98 Compare October 4, 2023 12:23

tjruwase approved these changes Oct 4, 2023

View reviewed changes

tjruwase merged commit e6f400a into microsoft:master Oct 4, 2023

mosheisland deleted the 5_calc_loss_fp32_for_bf16 branch October 5, 2023 05:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepspeed-chat: calculate loss in fp32 when using bf16 #754

deepspeed-chat: calculate loss in fp32 when using bf16 #754

mosheisland commented Oct 3, 2023

tjruwase commented Oct 3, 2023

mosheisland commented Oct 3, 2023

tjruwase commented Oct 3, 2023

tjruwase commented Oct 3, 2023

mosheisland commented Oct 3, 2023

mosheisland commented Oct 4, 2023

deepspeed-chat: calculate loss in fp32 when using bf16 #754

deepspeed-chat: calculate loss in fp32 when using bf16 #754

Conversation

mosheisland commented Oct 3, 2023

tjruwase commented Oct 3, 2023

mosheisland commented Oct 3, 2023

tjruwase commented Oct 3, 2023

tjruwase commented Oct 3, 2023

mosheisland commented Oct 3, 2023

mosheisland commented Oct 4, 2023