Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is pipeline-parallel in conflict with ZeRO-stage2? #823

Open
gongjingcs opened this issue Mar 5, 2021 · 3 comments
Open

Is pipeline-parallel in conflict with ZeRO-stage2? #823

gongjingcs opened this issue Mar 5, 2021 · 3 comments

Comments

@gongjingcs
Copy link

image
I can't train gpt with 3D-parallel and ZeRO-stage2 at the same time.
It seems peline-parallel in conflict with ZeRO-stage2. I use the pipeline example here: https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM-v1.1.5-3D_parallelism.

looking forward to your reply

@sdtblck
Copy link
Contributor

sdtblck commented Mar 5, 2021

Hi @gongjingcs Pipeline Parallelism can work with zero stage 1 but not stage 2, due to the gradient accumulation in PP that requires that all gradients be present across multiple forward/backward passes.

Since zero stage 2 partitions the gradients, they are simply incompatible unfortunately.

@sdtblck
Copy link
Contributor

sdtblck commented Mar 5, 2021

(paraphrasing some communications with @samyam here, but I have also tried it myself in some experiments and got the same error. It would be a useful thing to add to the docs imo.)

@gongjingcs
Copy link
Author

Hi @gongjingcs Pipeline Parallelism can work with zero stage 1 but not stage 2, due to the gradient accumulation in PP that requires that all gradients be present across multiple forward/backward passes.

Since zero stage 2 partitions the gradients, they are simply incompatible unfortunately.

Thanks for your reply. Yes, I change the config setting to zero-1, it doesn't report any error, however zero stage1 does not really work in this example: https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM-v1.1.5-3D_parallelism. In this example, it uses " buffered_allreduce_fallback" func to allreduce gradients, could you help check it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants