-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A request for clarity around 3D Parallelism in DeepSpeed #673
Comments
I guess the discussion continued at huggingface/blog#71 (review), but it's best to avoid discussing important things that others may benefit from in code suggestions, because as soon as a suggestion is resolved github hides those comments. So I will at least re-paste them here for posterity: @jeffra wrote:
And then @samyam expanded:
|
Thank you for these clarifications, @jeffra and @samyam!
|
I think DeepSpeed would benefit a lot from having a sort of a visual map showing how the different components fit together.
The blog posts have a lot of diagrams, but they only convey a partial picture. |
Hi @stas00 , thanks for the great suggestions :-). A detailed map of DeepSpeed would be great. Pipeline parallelism is not used unless you provide a
This is a great question. The difference comes down to where the computation is performed. With ZeRO's sharding, we save memory by partitioning various tensors before and after the computations, but the actual In contrast, tensor slicing like Megatron-LM actually modifies the computations to work in a distributed manner. A rough example: instead of collecting the full data, doing a matrix multiplication, and then partitioning again, the module is inherently modified to work in a distributed manner. This has several very key advantages including reducing the size of activations for layers and also reducing the pressure on global batch size by splitting individual samples across model-parallel groups. Tensor slicing is a model-specific strategy, which is why DeepSpeed supports, but does not provide it. @samyam phrased it in a great way, we like to think of sharding as optimizations for data parallel training, because we don't need the specifics of the model's computation. |
I understood this part, thank you. I got mislead by the 3D parallelism discussion in the Microsoft blog post and assumed that DS does this out of the box. I think thanks to your clear answers I now have a pretty good understanding of the different parts.
Yes, of course, this is the DP micro-batch. Which is different from PP micro-batch. Let me try to summarize my understanding so far:
Unlike DP, PP feeds all micro-batches to frontal GPU (in the simple case) and the other GPUs in the pipeline stack get to that same micro-batch in their turn. Here each GPU gets to see each and every micro-batch. So PP micro-batch is different from DP micro-batch.
Perhaps the first level split should be called mini, and second micro? Let's look at an example to clearly see which is which. Say, you have 4 GPUs - 2 for DP and 2 for PP and we want to run a BS=24 and let's ignore gradient accumulation for now. a. DP gets first dibs on splitting the batch and since we have 2 stacks of GPUs visibile to DP, each stack will get a micro-batch of 12. Am I doing well so far?
Perhaps, the way this is done now is that each "P" calls the splits "micro-batches" and as they stack the micro-micro-batches just work, because the context doesn't get overlapped. or
Your explanation is perfect, @ShadenSmith. Thank you! I will try to summarize all your various answers together to try to draw a complete picture. |
Let's start with saying that based on my reading of various papers Model Parallelism (MP) is a very inconsistent term. One can slice vertically or horizontally. One can implement a naive slow version or speed it up with pipelining, and almost none of them is really parallel. I tried to summarize and demo a few of the basic options here: huggingface/transformers#8771 (comment)
So then DeepSpeed talks a lot about 3D parallelism, the blog posts like this state multiple times that DeepSpeed uses 3D parallelism.
Then @samyam kindly reviewed the draft of the upcoming blog post about DeepSpeed integration in transformers, where he suggests that no, DeepSpeed doesn't do 3D parallelism.
To me it does look like DeepSpeed implements all 3:
So please correct me if I'm wrong in that DeepSpeed isn't already doing 3D.
Quotes from the blog post: https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/
and then later:
The text was updated successfully, but these errors were encountered: