[wip] [doc] Parallelism notes #9766
Labels
DeepSpeed
Model Parallel
Model Parallelilsm Implementations
Pipeline Parallel
WIP
Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Perhaps this will end up in a blog post and/or a new document, for now collecting notes. This is a work in progress. Please give me some time to write the bulk of it and then you'll be welcome to ask questions, add contributions, etc.
Parallelism overview
In the modern machine learning the various approaches to Parallelism are used to:
We will first discuss in depth various 1D parallelism techniques and their pros and cons and then look at how they can be combined into 2D and 3D parallelism to enable an even faster training and to support even bigger models.
While the main concepts most likely will apply to any other framework, this article is focused in pytorch-based implementations.
Data Parallel
Most users with just 2 GPUs already enjoy the increased training speed up thanks to DataParallel (DP) and DistributedDataParallel (DDP) that are almost trivial to use.
ZeRO Data Parallel
ZeRO-powered data parallelism (ZeRO-DP) is described on the following diagram from this blog post
It can be difficult to wrap one's head around it, but in reality the concept is quite simple. This is just the usual DataParallel (DP), except, instead of replicating the full model params, gradients and optimizer states, each GPU stores only a slice of it. And then at run-time when the full layer params are needed just for the given layer, all GPUs synchronize to give each other parts that they miss - this is it.
Consider this simple model with 3 layers, where each layer has 3 params:
Lx being the layer and we have 3 layers, and ax being the weights - 3 weights
If we have 3 GPUs, the Sharded DDP (= Zero DP) splits the model onto 3 GPUs like so:
In a way this is horizontal slicing, if you imagine the typical DNN diagram. Vertical slicing is where one puts whole layer-groups on different GPUs. But it's just the starting point.
Now each of these GPUs will get the usual mini-batch as it works in DP:
The inputs are unmodified - they think they are going to be processed by the normal model.
So the inputs first hit the first layer La.
Let's focus just on GPU0: x0 needs a0, a1, a2 params to do its forward path, but GPU0 has only a0 - so what it does is it gets sent a1 from GPU1 and a2 from GPU2. Now the forward step can happen.
In parallel GPU1 gets mini-batch x1 and it only has a1, but needs a0 and a2 params, so it gets those from GPU0 and GPU2.
Same happens to GPU2 that gets input x2. It gets a0 and a1 from GPU0 and GPU1.
As soon as the calculation is done, the data that is no longer needed gets dropped - it's only used during the calculation.
The same is repeated at every other stage.
And the whole larger thing is repeated for layer Lb, then Lc forward-wise, and then backward Lc -> Lb -> La.
To me this sounds like an efficient group backpacking weight distribution strategy:
Now each night they all share what they have with others and get from others what the don't have, and in the morning they pack up their allocated type of gear and continue on their way. This is Sharded DDP / Zero DP.
Compare this strategy to the simple one where each person has to carry their own tent, stove and entertainment system, which would be far more inefficient. This is DataParallel in pytorch.
And I think pretty much everywhere I read Sharded == Partitioned, so I think those are synonyms in the context of distributed models.
If you pay close attention the way ZeRO partitions the model's data - it looks very similar to horizontal model parallelism which will be discussed later. This is because it partitions/shards each layer's data unlike vertical model parallelism which is discussed next.
Implementations:
transformers
integrationNaive Model Parallel (Vertical) and Pipeline Parallel
Naive Model Parallel (MP) is where one spreads groups of model layers across multiple GPUs. The mechanism is relatively simple - switch the desired layers
.to()
the desired devices and now whenever the data goes in and out those layers switch the data to the same device as the layer and leave the rest unmodified.We refer to it as Vertical MP, because if you remember how most models are drawn, we slice the layers vertically. For example, if the following diagram shows an 8-layer model:
we just sliced it in 2 vertically, placing layers 0-3 onto gpu 0 and 4-7 to gpu 1.
Now while data travels from layer 0 to 1, 1 to 2 and 2 to 3 this is just the normal model. But when data needs to pass from layer 3 to layer 4 it needs to travel from gpu0 to gpu1 which introduces a communication overhead. If the participating GPUs are on the same node (e.g. same PC) this copying is pretty fast, but if the other gpus are on different nodes (e.g. another PC) the communication overhead could be significantly larger.
Then layers 4 to 5 to 6 to 7 are as a normal model would have and when the 7th layer completes we often need to send the data back to layer 0 where the labels are (or alternatively send the labels to the the last layer).
Problems:
Pipeline Parallel (PP) is almost identical to a naive MP, but it solves the idling problem to a degree, by chunking the incoming batch into micro-batches and artificially creating a pipeline, which allows different GPUs to concurrently participate in the computation process.
The following illustration from the GPipe paper shows first the naive MP, the PP:
It's easy to see how PP has less dead zones where GPUs are idle.
PP introduces a new hyper-parameter to tune and it's
chunks
which defines how many pipeline stages are to be used. e.g. in the 2nd diagram of the image above you can see thatchunks=4
.With
chunks=1
you end up with the naive MP. with a very large value you will find that the overhead of slicing the tensors will slow everything down. So one has to experiment to find the best value. It's also important to remember that to take advantage of the GPU, you need largish batches and ideally in multiples of 8.So if the normal batch size
bs=64
andchunks=8
, the each stage will receive a micro-batch of8
. However if you're tight on memory in first place you may end up with a the normalbs=8
, and then if you choosechunks=4
, you will end up with4
pipeline segments with a micro-batch of just2
- which would be very inefficient. Alsobs=8
andchunks=3
won't go too well together either, as you will end up with uneven micro-batches of[3,3,2]
.While the diagram shows that there is a bubble of "dead" time that can't be parallelized because the last
forward
stage has to wait forbackward
to complete the pipeline, the purpose of finding the best value forchunks
is to enable a high concurrent GPU utilization across all participating GPUs.Problems:
nn.Sequential
sequence of the same, which may require changes to the design of the model.Implementations:
Other approaches:
SageMaker introduces the concept of an Interleaved Pipeline
Here the bubble (idle time) is further minimized by prioritizing backward passes.
According to the same document, it might be able to automate the conversion to pipeline.
The only problem is that this is currently only available at AWS, so you can't run it on your own hardware.
Model Parallel (Horizontal)
Megatron-LM
2D Parallelism
The following diagram from the DeepSpeed pipeline tutorial demonstrates how one combines DP with PP.
Here it's important to see how DP rank 0 doesn't see gpu2 and DP rank 1 doesn't see gpu3. To DP there is just gpus 0 and 1 where it feeds data as if there were just 2 gpus. gpu 0 "secretly" offloads some of its load to gpu 2 using PP. and gpu 1 does the same by enlisting gpu 3 to its aid.
XXX: will update this section once I get it working
3D Parallelism
FlexFlow
FlexFlow is also solving the parallelization problem in a slightly different approach.
Paper: "Beyond Data and Model Parallelism for Deep Neural Networks" by Zhihao Jia, Matei Zaharia, Alex Aiken
It performs a sort of 4D Parallelism over Sample-Operator-Attribute-Parameter.
and they are working on Pipeline Parallelism. I guess ZeRO-DP is Sample+Parameter in this context.
The significance of this framework is that it takes resources like (1) GPU/TPU/CPU vs. (2) RAM/DRAM vs. (3) fast-intra-connect/slow-inter-connect and it automatically optimizes all these algorithmically deciding which parallelisation to use where.
On very important aspect is that FlexFlow is designed for optimizing DNN parallelizations for models with static and fixed workload, since models with dynamic behavior may prefer different parallelization strategies across iterations.
So the promise is very attractive - it runs say a 30min simulation on the cluster of choice and it comes up with the best strategy to utilise this specific environment. If you add/remove/replace any parts it'll run and re-optimize the plan for that. And then you can train. A different setup will have its own custom optimization.
The text was updated successfully, but these errors were encountered: