-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix sequence parallel(Ulysses) grad scale for zero0 #5555
Conversation
deepspeed/runtime/engine.py
Outdated
|
||
def _reduce_expert_gradients(self, expert_grads, elements_per_buffer): | ||
# to maintain the gradients value unaffected by ep_size setting, | ||
# utilize dp_world_size for allreduce average | ||
dp_world_size = dist.get_world_size(groups._get_data_parallel_group()) | ||
dp_world_size = dist.get_world_size(groups._get_data_parallel_group()) / float(self.sequence_parallel_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@inkcherry, can you help me understand why scale by sp_size? get_data_parallel_group
!= get_sequence_data_parallel_group
, you should have correct value already, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! @samadejacobs. Yes, this should be the correct value. We should only need to modify the dp_world_size
in the above instance.
Hi, @samadejacobs I have removed the modifications you mentioned in that line. Could you please help review the other parts again? Thanks! |
use dp_world_size for grad reduction, instead of seq_dp_world_size. Currently, for zero0, only sparse tensors use the correct world_size. tiny model with sp=4 grad norm test: grad_norm | step1 | step2 | step3 | step4 |step5 | step100 -- | -- | -- | -- | -- | --| -- zero1 | 15.825 | 16.646|15.853 | 16.159 | 17.333 | 15.555 zero0 | 3.956 | 4.161 | 3.963 | 4.040 | 4.333| 3.889 zero0(this patch) | 15.825 | 16.646 | 15.853| 16.159 | 17.333 | 15.554
use dp_world_size for grad reduction, instead of seq_dp_world_size.
Currently, for zero0, only sparse tensors use the correct world_size.
tiny model with sp=4 grad norm test: