-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NaN] Fix nan print issue when running Megatron-Deepspeed with DeepSpeed #434
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@ys950902, can you please share a bit more details about why |
Hi @tjruwase, thanks for your reply, when you running Megatron-DeepSpeed with DeepSpeed for 3D parallelism: |
@ys950902, thanks for the explanation. I think the correct solution is to use the Megatron-DeepSpeed/megatron/training.py Line 746 in 53b241f
The problem is that Megatron-DeepSpeed/megatron/training.py Lines 773 to 778 in 53b241f
Can you try setting |
Got it, I will fix it as you suggested! |
Hi @tjruwase, could you please take a look on this pr and with the modify in deepspeed to support bfloat16 microsoft/DeepSpeed#5879. |
Hi @tjruwase, will you merge this pr? |
When we running megatron-deepspeed with deepspeed met nan issue, the only way we can judge this issue can see below is no lm loss print and the number of nan iterations is still 0 which is not correct:
iteration 9/ 10 | consumed samples: 108 | consumed tokens: 442368 | elapsed time per iteration (ms): 1979.2 | learning rate: 4.219E-07 | global batch size: 12 | loss scale: 1.0 | actual seqlen: 4096 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 6.063 | tokens per gpu per second (tgs): 2069.506 | TFLOPs: 127.00 |
This pr is to fix this issue, whether is skipped iter we should do the nan check.