-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use one's existing vertical MP with DeepSpeed #674
Comments
@stas00 We do support both horizontal MP and vertical MP (Pipeline) with DeepSpeed in addition to ZeRO. In fact you can use all three together, but as you noticed we do not have an example released yet, but its in our to do list. But for the problem you are facing, we do have a Pipeline Parallelism example that might be helpful to get the tp5, gp2 , and bart that you mentioned going : https://github.com/microsoft/DeepSpeedExamples/tree/master/pipeline_parallelism Following this example, you should be able to combine pipeline parallelism with data parallelism or ZeRO Stage 1 in DeepSpeed. This uses DeepSpeed's own pipeline parallelism, but if your model has been sequentialized, DeepSpeed can do the rest for you. @ShadenSmith knows the details of PP in DeepSpeed better than I do so tagging him here. If you want to use your own implementation of pipeline parallelism or tensor slicing with DeepSpeed, you can do that too, but that requires you to implement a model parallel unit (mpu) object, and pass it to DeepSpeed. This mpu object should implement a few methods that tells DeepSpeed which GPUs are involved in model parallelism and which are involved in data parallelism. How to pass mpu to Deepspeed: https://github.com/microsoft/DeepSpeedExamples/blob/400cd1bc3524507301f39c659d3069672c4ab464/Megatron-LM/pretrain_gpt2.py#L172 What should mpu implement mpu.get_model_parallel_group() DeepSpeed/deepspeed/runtime/engine.py Line 520 in 82cecf6
In terms of if we ever need Pipeline Parallelism or Model Parallelism, when we already have ZeRO, I think there are situations when Pipeline Parallelism can be faster. Specifically if the network bandwidth is too slow for ZeRO-powered data parallel training or standard data parallel training. Combining pipeline parallel training with data parallel training can reduce the network traffic significantly, and help improve throughput when the network is a bottleneck. We actually did a comparative study on the pros and cons of these three techniques in the last press release: https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/ under the subsection Understanding tradeoffs between data, model and pipeline parallelism under 3D parallelism. |
Your comment is phenomenal - thank you so much for explicitly covering all the grounds, @samyam. This is very helpful to understand how things stack together. Here is current We have the naive implementation of vertical MP as can be seen from the top diagram from the GPipe paper: So we have everything as the pipeline except no rpc and we manage all the data and layer switching ourselves. Except it's very inefficient due to gpu idling, which is what PP solves. Wrt, pipeline, I'm working on trying to port So my goal is hopefully to find a way to make PP work with
That's super important - thank you for that insight. I'm looking forward to figuring out how to make these things work and then benchmark to see which is which. Thank again for your indepth commentary, @samyam |
I have a follow up question:
So basically what you're saying is that such setup will require at least 4 GPUs, since you can't use the same 2 gpus for MP and DP, correct? I think I remember seeing a diagram somewhere. I don't see how 2D or 3D parallelism can be done on 2 GPUs. If I'm not mistaken each 1D requires 2x gpus, so 4 gpus for 2D and 8 gpus at least for 3D. with DP+PP: We hide the fact that gpu 1 and 3 are used for PP, and so we tell DP that we only have gpu 0 and 2 and it's oblivious that the data it sends to gpus 0+2 also outsource to gpus 1+3. And if we do 3D, I think the same story repeats, where we have to hide from each extra 1D the gpus that are used by other dimensions. |
I have a similar problem - when i'm passing in mpu with a custom GPT2 Pipeline parallel model (building on megatron examples to get full 3d parallelism) to deepspeed.initialize, it gives me the following error:
Looking at the deepspeed initialize code, it seems it tries to grab mpu from the model?
So should mpu be passed in to the Pipeline Module for building the model? can you provide any examples of how to properly acheive 3D parallelism? |
The docs mention in several places that DeepSpeed can be used with a model that implements its own MP - which my guess it's referring to a vertical model slicing as taught at the pytorch tutorial., that is groups of layers spread out through several GPUs.
OK, we have implemented vertical slicing MP for t5, gpt2 and bart in transformers, using a naive slow approach, but I have no idea how to bolt DeepSpeed on top of such a model that spans several GPUs. The typical problem is that there is a conflict between the model doing its own device management and switching model's data and inputs to different devices, with any external solutions. e.g. I can't do DP or DDP over 2 GPUs which are also used for vertical MP.
I do find it unclear though when in some places the documentation talks about allowing one's own MP and no explanation or examples of how exactly that would work. I think it does suggest to look at Megatron-LM for such an example, but it's a complex project. What would have helped a lot is having a simple example on how to bolt DeepSpeed for an MP-enabled model with perhaps just a few layers, such as the simple toy MP model from the same pytorch tutorial.
Thank you!
And I also understand that the result of combing one's own vertical MP with DeepSpeed might actually be worse than just using the ZeRO Partitioning, and probably the whole setup will be much more complicated, but it's hard to appreciate or compare things if there is a barrier to trying various options.
My gut feeling, which is not yet supported by enough experience, is telling me that DeepSpeed's solution to the memory issue completely eliminates the need for vertical MP (almost definitely once stage 3 is implemented), but since our initiative to implement MP in transformers is sort of incomplete, since the naive MP doesn't scale, I'm trying to either find a way to make it efficient or perhaps remove it altogether if DeepSpeed is all one needs in all possible scenarios.
I'm trying to figure out if bolting just PP onto the naive MP will do the trick. But it's tricky since one needs to "sequentialize" the model's stack for PP to work.
The text was updated successfully, but these errors were encountered: