fix sequence parallel(Ulysses) grad scale for zero0 #5555

inkcherry · 2024-05-21T07:47:06Z

use dp_world_size for grad reduction, instead of seq_dp_world_size.
Currently, for zero0, only sparse tensors use the correct world_size.

tiny model with sp=4 grad norm test:

grad_norm	step1	step2	step3	step4	step5	step100
zero1	15.825	16.646	15.853	16.159	17.333	15.555
zero0	3.956	4.161	3.963	4.040	4.333	3.889
zero0(this patch)	15.825	16.646	15.853	16.159	17.333	15.554

samadejacobs · 2024-05-24T17:02:48Z

deepspeed/runtime/engine.py


    def _reduce_expert_gradients(self, expert_grads, elements_per_buffer):
        # to maintain the gradients value unaffected by ep_size setting,
        # utilize dp_world_size for allreduce average
-        dp_world_size = dist.get_world_size(groups._get_data_parallel_group())
+        dp_world_size = dist.get_world_size(groups._get_data_parallel_group()) / float(self.sequence_parallel_size)


@inkcherry, can you help me understand why scale by sp_size? get_data_parallel_group != get_sequence_data_parallel_group, you should have correct value already, no?

Thanks for the review! @samadejacobs. Yes, this should be the correct value. We should only need to modify the dp_world_size in the above instance.

inkcherry · 2024-06-03T03:48:05Z

Hi, @samadejacobs I have removed the modifications you mentioned in that line. Could you please help review the other parts again? Thanks!

use dp_world_size for grad reduction, instead of seq_dp_world_size. Currently, for zero0, only sparse tensors use the correct world_size. tiny model with sp=4 grad norm test: grad_norm | step1 | step2 | step3 | step4 |step5 | step100 -- | -- | -- | -- | -- | --| -- zero1 | 15.825 | 16.646|15.853 | 16.159 | 17.333 | 15.555 zero0 | 3.956 | 4.161 | 3.963 | 4.040 | 4.333| 3.889 zero0(this patch) | 15.825 | 16.646 | 15.853| 16.159 | 17.333 | 15.554

fix ds-sp grad scale for zero0

cb15ffa

inkcherry requested review from mrwyattii and tjruwase as code owners May 21, 2024 07:47

tjruwase requested review from samadejacobs and tohtana and removed request for tjruwase and mrwyattii May 21, 2024 15:14

samadejacobs reviewed May 24, 2024

View reviewed changes

keep the correct dp_size

60e0dbc

tohtana approved these changes Jun 5, 2024

View reviewed changes

samadejacobs added this pull request to the merge queue Jun 5, 2024

Merged via the queue into microsoft:master with commit 6b6d641 Jun 5, 2024
14 checks passed

delock mentioned this pull request Sep 20, 2024

[TRACKER] Customer support related PR tracker for Intel devices #6556

Open

23 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix sequence parallel(Ulysses) grad scale for zero0 #5555

fix sequence parallel(Ulysses) grad scale for zero0 #5555

inkcherry commented May 21, 2024 •

edited

Loading

samadejacobs May 24, 2024

inkcherry May 28, 2024

inkcherry commented Jun 3, 2024

fix sequence parallel(Ulysses) grad scale for zero0 #5555

fix sequence parallel(Ulysses) grad scale for zero0 #5555

Conversation

inkcherry commented May 21, 2024 • edited Loading

samadejacobs May 24, 2024

Choose a reason for hiding this comment

inkcherry May 28, 2024

Choose a reason for hiding this comment

inkcherry commented Jun 3, 2024

inkcherry commented May 21, 2024 •

edited

Loading