Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix sequence parallel(Ulysses) grad scale for zero0 #5555

Merged
merged 2 commits into from
Jun 5, 2024

Conversation

inkcherry
Copy link
Contributor

@inkcherry inkcherry commented May 21, 2024

use dp_world_size for grad reduction, instead of seq_dp_world_size.
Currently, for zero0, only sparse tensors use the correct world_size.

tiny model with sp=4 grad norm test:

grad_norm step1 step2 step3 step4 step5 step100
zero1 15.825 16.646 15.853 16.159 17.333 15.555
zero0 3.956 4.161 3.963 4.040 4.333 3.889
zero0(this patch) 15.825 16.646 15.853 16.159 17.333 15.554

@tjruwase tjruwase requested review from samadejacobs and tohtana and removed request for tjruwase and mrwyattii May 21, 2024 15:14

def _reduce_expert_gradients(self, expert_grads, elements_per_buffer):
# to maintain the gradients value unaffected by ep_size setting,
# utilize dp_world_size for allreduce average
dp_world_size = dist.get_world_size(groups._get_data_parallel_group())
dp_world_size = dist.get_world_size(groups._get_data_parallel_group()) / float(self.sequence_parallel_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inkcherry, can you help me understand why scale by sp_size? get_data_parallel_group != get_sequence_data_parallel_group, you should have correct value already, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! @samadejacobs. Yes, this should be the correct value. We should only need to modify the dp_world_size in the above instance.

@inkcherry
Copy link
Contributor Author

Hi, @samadejacobs I have removed the modifications you mentioned in that line. Could you please help review the other parts again? Thanks!

@samadejacobs samadejacobs added this pull request to the merge queue Jun 5, 2024
Merged via the queue into microsoft:master with commit 6b6d641 Jun 5, 2024
14 checks passed
sfc-gh-reyazda pushed a commit to Snowflake-Labs/DeepSpeed that referenced this pull request Jun 10, 2024
use dp_world_size for grad reduction, instead of seq_dp_world_size.
Currently, for zero0, only sparse tensors use the correct world_size.

tiny model with sp=4 grad norm test:
grad_norm | step1 | step2 | step3 | step4 |step5 | step100
-- | -- | -- | -- | -- | --| --
zero1 | 15.825 | 16.646|15.853 | 16.159 | 17.333 | 15.555
zero0 | 3.956 | 4.161 | 3.963 | 4.040 | 4.333| 3.889
zero0(this patch) | 15.825 | 16.646 | 15.853| 16.159 | 17.333 | 15.554
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants