-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about implementing model parallelism in the inference engine #1161
Comments
Hi @hyunwoongko The tensor slicing for Inference is not happening on the engine class, whereas it is done in the replace_module utility and in ReplaceWithTensorSlicing class. The model parallelism is based on Megatron tensor slicing. This model parallelism works for various model architectures and not just GPT-Neo. Thanks, |
For reducing the results across different GPUs, we use the all_reduce in the inference-api. The model-parallel group is also created in the inference-engine and passed to the inference-module. |
Thanks for your kind reply. :) My question is, why doesn't the memory of each GPU area decrease after slicing the model in two? For example, if we slice a 30GB model into two, the amount of memory allocated to at least one device will have to be less than 30GB, even if it is not completely reduced to 15GB. Thanks, |
The problem is that the model gets created on both GPUs by HunggingFace initially and that takes the model's total memory which is in your case 30GB. However, in deepspeed we partition the parameters and send the corresponding parts to each GPU. So, some of the initially allocated memory is just cached on the GPU and never used. I have not spend time to release that memory in HuggingFace way of initializing a model. By the way, I think you can still use those allocated memory, the issue is that nvidia-smi is not showing the amount of free memory in a precise way. |
I have used the torch's memory management (https://pytorch.org/docs/stable/cuda.html#memory-management) and it shows the memory reduction when using model-parallel 2:
I think nvidia-smi is just showing all the cached memory! Could you please print |
Thanks for the very kind reply. Can't I just load the model from the CPU? If I deploy a large model like Blender (9B+) or T5 (10B+), if the hugging face model is loaded on the GPU first, memory allocation will fail. Thanks |
I successed CPU to GPU parallel. Thanks ! |
@hyunwoongko any insight on how did you do it? |
I developed totally new tools for model parallelism. |
ok, thanks! |
@hyunwoongko, thanks for pushing to solve all these issues. Please let me know when you are finished with opening this as I am also eager to see your approach. Best, |
https://github.com/tunib-ai/parallelformers We present parallelformers, a novel framework for model parallelization 🎉 |
Thanks for sharing this. I will look into it and let you know if I have questions. |
Hi @hyunwoongko, Thanks for sharing the great work you did on parallelizing the transformers models. I think we can use a lot of this in deepspeed to parallelize the different models. I would really appreciate if we can work together to bring the parallel implementation design on your side to deepspeed and we can merge it with the high-performance kernels for the different models. Thanks, |
By the way, you will realize that with this PR the memory of the GPUs won't increase (even the cached one) the same as before. |
Thanks for the positive reply. @stas00 and I were discussing the integration of parallelformers and transformers here. And maybe we're thinking of this as part of the DeepSpeed and Transformers integration, and we want to work on this with the DeepSpeed team. For example, I and you implement and commit in DeepSpeed, and I and Stas utilize this in Huggingface Transformers. Also, as you said, I'm thinking deeply about how to use fused kernel with the mechanism I'm currently implementing on my side. If this works, we can take the speed of fused kernel and the scalability of my implementation at the same time. Anyway, we need to discuss how to collaborate. How would you like to work? Thanks, |
In addition, I also want to implement a training feature, what do you think of this? I think many people want to training Transformers models with the Tensor MP method. I hope ultimately all the models in Transformers support 3D parallelization through ZeRO + Pipeline with Tensor MP. |
@RezaYazdaniAminabadi |
I'm closing this issue because it's too old and I'll discuss it in a new issue. (#1248) |
Hi @hyunwoongko Thanks for the great discussion. I am certainly interested. Let me also discuss this internally, and we can go on with the collaboration soon :) Thanks, |
@RezaYazdaniAminabadi |
Hi @hyunwoongko Thanks for checking in :) Thanks, |
Sure! Why don't we arrange the meeting via email? |
Just sent an invite through email. Thanks |
Could you summarize what you said at the meeting? |
And since this issue is closed, how about discussing it in new issue? |
Yes, better to open another issue, I will send the summary in email. Thanks, |
I already opened new issue ! |
Hello. I would like to ask about the model parallelism feature in the inference engine. In general, the model parallelism that I can think of is inter-layer model parallelism like GPipe (only partitioning part, not pipelining) and intra-layer model parallelism like Megatron-LM.
DeepSpeed/deepspeed/inference/engine.py
Line 102 in aa16828
But, in the case of the current implementation of inference engine, it seems to broadcast all parameters from device 0 if mpu is not input. But, as far as I know, these operations don't seem parallelizing way that models are sliced.
DeepSpeed/deepspeed/inference/engine.py
Line 213 in aa16828
In this part of the code, it seems that input is also broadcasting to all devices. So, I wonder why deepspeed is broadcasting in all parts when mpu isn't inputted.
The text was updated successfully, but these errors were encountered: