Is pipeline-parallel in conflict with ZeRO-stage2? #823

gongjingcs · 2021-03-05T11:38:15Z

I can't train gpt with 3D-parallel and ZeRO-stage2 at the same time.
It seems peline-parallel in conflict with ZeRO-stage2. I use the pipeline example here: https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM-v1.1.5-3D_parallelism.

looking forward to your reply

sdtblck · 2021-03-05T13:23:24Z

Hi @gongjingcs Pipeline Parallelism can work with zero stage 1 but not stage 2, due to the gradient accumulation in PP that requires that all gradients be present across multiple forward/backward passes.

Since zero stage 2 partitions the gradients, they are simply incompatible unfortunately.

sdtblck · 2021-03-05T13:24:16Z

(paraphrasing some communications with @samyam here, but I have also tried it myself in some experiments and got the same error. It would be a useful thing to add to the docs imo.)

gongjingcs · 2021-03-10T07:54:19Z

Hi @gongjingcs Pipeline Parallelism can work with zero stage 1 but not stage 2, due to the gradient accumulation in PP that requires that all gradients be present across multiple forward/backward passes.

Since zero stage 2 partitions the gradients, they are simply incompatible unfortunately.

Thanks for your reply. Yes, I change the config setting to zero-1, it doesn't report any error, however zero stage1 does not really work in this example: https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM-v1.1.5-3D_parallelism. In this example, it uses " buffered_allreduce_fallback" func to allreduce gradients, could you help check it?

hyunwoongko mentioned this issue Sep 25, 2021

ZeRO2 with Pipeline parallelism EleutherAI/gpt-neox#411

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is pipeline-parallel in conflict with ZeRO-stage2? #823

Is pipeline-parallel in conflict with ZeRO-stage2? #823

gongjingcs commented Mar 5, 2021

sdtblck commented Mar 5, 2021

sdtblck commented Mar 5, 2021

gongjingcs commented Mar 10, 2021

Is pipeline-parallel in conflict with ZeRO-stage2? #823

Is pipeline-parallel in conflict with ZeRO-stage2? #823

Comments

gongjingcs commented Mar 5, 2021

sdtblck commented Mar 5, 2021

sdtblck commented Mar 5, 2021

gongjingcs commented Mar 10, 2021