Questions about implementing model parallelism in the inference engine #1161

hyunwoongko · 2021-06-15T12:37:51Z

Hello. I would like to ask about the model parallelism feature in the inference engine. In general, the model parallelism that I can think of is inter-layer model parallelism like GPipe (only partitioning part, not pipelining) and intra-layer model parallelism like Megatron-LM.

DeepSpeed/deepspeed/inference/engine.py

Line 102 in aa16828

dist.broadcast(p, 0)

But, in the case of the current implementation of inference engine, it seems to broadcast all parameters from device 0 if mpu is not input. But, as far as I know, these operations don't seem parallelizing way that models are sliced.

DeepSpeed/deepspeed/inference/engine.py

Line 213 in aa16828

dist.broadcast(input, 0)

In this part of the code, it seems that input is also broadcasting to all devices. So, I wonder why deepspeed is broadcasting in all parts when mpu isn't inputted.

hyunwoongko · 2021-06-15T13:56:49Z

--num_gpus=2 and mp_size=2
--num_gpus=1 and mp_size=1

RezaYazdaniAminabadi · 2021-06-15T16:03:35Z

Hi @hyunwoongko

The tensor slicing for Inference is not happening on the engine class, whereas it is done in the replace_module utility and in ReplaceWithTensorSlicing class. The model parallelism is based on Megatron tensor slicing. This model parallelism works for various model architectures and not just GPT-Neo.

Thanks,
Reza

RezaYazdaniAminabadi · 2021-06-15T16:10:45Z

For reducing the results across different GPUs, we use the all_reduce in the inference-api. The model-parallel group is also created in the inference-engine and passed to the inference-module.

hyunwoongko · 2021-06-15T17:54:59Z

Thanks for your kind reply. :)

My question is, why doesn't the memory of each GPU area decrease after slicing the model in two? For example, if we slice a 30GB model into two, the amount of memory allocated to at least one device will have to be less than 30GB, even if it is not completely reduced to 15GB.

Thanks,
Hyunwoong Ko

RezaYazdaniAminabadi · 2021-06-15T18:27:36Z

The problem is that the model gets created on both GPUs by HunggingFace initially and that takes the model's total memory which is in your case 30GB. However, in deepspeed we partition the parameters and send the corresponding parts to each GPU. So, some of the initially allocated memory is just cached on the GPU and never used. I have not spend time to release that memory in HuggingFace way of initializing a model. By the way, I think you can still use those allocated memory, the issue is that nvidia-smi is not showing the amount of free memory in a precise way.

RezaYazdaniAminabadi · 2021-06-15T22:29:00Z

I have used the torch's memory management (https://pytorch.org/docs/stable/cuda.html#memory-management) and it shows the memory reduction when using model-parallel 2:

Before deepspeed initialize: 14042529792 bytes allocated
After deepspeed initialize: 7719548928 bytes allocated

I think nvidia-smi is just showing all the cached memory!

Could you please print {torch.cuda.memory_allocated()}, {torch.cuda.memory_cached()} before and after deepspeed inference intialize on your side to verify it?
Thanks,
Reza

hyunwoongko · 2021-06-16T02:05:03Z

When I call torch.cuda.empty_cache(), allocated memory is reduced as your words.

Thanks !!
Hyunwoong Ko

hyunwoongko · 2021-06-16T02:09:53Z

Thanks for the very kind reply. Can't I just load the model from the CPU? If I deploy a large model like Blender (9B+) or T5 (10B+), if the hugging face model is loaded on the GPU first, memory allocation will fail.

Thanks
Hyunwoong Ko

hyunwoongko · 2021-06-16T04:28:35Z

I successed CPU to GPU parallel. Thanks !

andreamad8 · 2021-07-05T05:48:00Z

@hyunwoongko any insight on how did you do it?

hyunwoongko · 2021-07-05T06:56:45Z

I developed totally new tools for model parallelism.
My tool can parallelize all the huggingface models. I will open this tool asap.

andreamad8 · 2021-07-05T07:35:35Z

ok, thanks!

RezaYazdaniAminabadi · 2021-07-05T16:31:53Z

@hyunwoongko, thanks for pushing to solve all these issues. Please let me know when you are finished with opening this as I am also eager to see your approach.

Best,
Reza

hyunwoongko · 2021-07-18T03:09:28Z

@RezaYazdaniAminabadi

https://github.com/tunib-ai/parallelformers

We present parallelformers, a novel framework for model parallelization 🎉

RezaYazdaniAminabadi · 2021-07-18T04:33:06Z

@hyunwoongko

Thanks for sharing this. I will look into it and let you know if I have questions.

RezaYazdaniAminabadi · 2021-07-20T22:09:21Z

Hi @hyunwoongko,

Thanks for sharing the great work you did on parallelizing the transformers models. I think we can use a lot of this in deepspeed to parallelize the different models.
I went through your description here. I agree with the limitation that our approach has, mostly due to using the inference-optimized kernels. However, I think on the parallelism part I see you take more or less the same approach as us.
I see that you are mainly concerned about the way that we move the model to the different devices (in the inference engine). I think solving this is pretty easy by moving this part after module_inject function which takes care of the model-parallel partitioning.
However, I still see the issue of creating the model on CPU through the HuggingFace Transformers pipeline. That is due to the fact the some of the tensors on the pipeline side (which are not part of the model) will be on CPU whereas the computation of the model is happening on the device side and generates tensors on different devices.
Please note that using the pipeline way of doing inference is giving a pretty easy of way testing different models that users are normally eager to use.
I have made a PR to fix this issue on the deepspeed-inference engine. So, we can now create the model on the CPU side, partition it based on the number of GPUs, and then move it to the devices. As you can see the changes are pretty minimal to resolve this limitation!

I would really appreciate if we can work together to bring the parallel implementation design on your side to deepspeed and we can merge it with the high-performance kernels for the different models.

Thanks,
Reza

RezaYazdaniAminabadi · 2021-07-20T22:11:41Z

By the way, you will realize that with this PR the memory of the GPUs won't increase (even the cached one) the same as before.

hyunwoongko · 2021-07-22T00:38:06Z

Thanks for the positive reply. @stas00 and I were discussing the integration of parallelformers and transformers here. And maybe we're thinking of this as part of the DeepSpeed and Transformers integration, and we want to work on this with the DeepSpeed team. For example, I and you implement and commit in DeepSpeed, and I and Stas utilize this in Huggingface Transformers.

Also, as you said, I'm thinking deeply about how to use fused kernel with the mechanism I'm currently implementing on my side. If this works, we can take the speed of fused kernel and the scalability of my implementation at the same time. Anyway, we need to discuss how to collaborate. How would you like to work?

Thanks,
Hyunwoong Ko

hyunwoongko · 2021-07-22T00:40:51Z

In addition, I also want to implement a training feature, what do you think of this? I think many people want to training Transformers models with the Tensor MP method. I hope ultimately all the models in Transformers support 3D parallelization through ZeRO + Pipeline with Tensor MP.

hyunwoongko · 2021-07-23T21:25:04Z

@RezaYazdaniAminabadi
I'm looking forward to your answer :)
Are you interested in this collaborative process?

hyunwoongko · 2021-07-23T21:46:51Z

I'm closing this issue because it's too old and I'll discuss it in a new issue. (#1248)

RezaYazdaniAminabadi · 2021-07-23T22:26:29Z

Hi @hyunwoongko

Thanks for the great discussion. I am certainly interested. Let me also discuss this internally, and we can go on with the collaboration soon :)

Thanks,
Reza

hyunwoongko · 2021-07-27T04:12:40Z

@RezaYazdaniAminabadi
Let me know when the discussion is over. I'll start working on it right away :)

RezaYazdaniAminabadi · 2021-07-27T20:28:00Z

Hi @hyunwoongko

Thanks for checking in :)
I wonder if you will be free for a quick offline chat to go through this?

Thanks,
Reza

hyunwoongko · 2021-07-28T00:05:22Z

Sure! Why don't we arrange the meeting via email?
My one is [email protected].

RezaYazdaniAminabadi · 2021-07-28T15:18:14Z

Just sent an invite through email. Thanks

hyunwoongko · 2021-07-30T01:26:32Z

Could you summarize what you said at the meeting?

hyunwoongko · 2021-07-30T01:27:23Z

And since this issue is closed, how about discussing it in new issue?

RezaYazdaniAminabadi · 2021-07-30T15:50:22Z

Yes, better to open another issue, I will send the summary in email.

Thanks,
Reza

hyunwoongko · 2021-07-30T15:55:59Z

#1248

I already opened new issue !

hyunwoongko closed this as completed Jun 16, 2021

hyunwoongko reopened this Jun 16, 2021

hyunwoongko closed this as completed Jun 16, 2021

switiz mentioned this issue Jul 12, 2021

GPT-NEO model parallelism not work issue #1209

Closed

hyunwoongko reopened this Jul 18, 2021

RezaYazdaniAminabadi mentioned this issue Jul 20, 2021

Reducing the memory-overhead of creating large-models for multi-GPU run #1244

Merged

hyunwoongko mentioned this issue Jul 22, 2021

We made a toolkit can parallelize almost all the Hugging Face models. But we have some question ! huggingface/transformers#12772

Closed

hyunwoongko closed this as completed Jul 23, 2021

hyunwoongko mentioned this issue Jul 23, 2021

Increasing model coverage of Tensor MP and Inference Engine. #1248

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about implementing model parallelism in the inference engine #1161

Questions about implementing model parallelism in the inference engine #1161

hyunwoongko commented Jun 15, 2021 •

edited

Loading

hyunwoongko commented Jun 15, 2021 •

edited

Loading

RezaYazdaniAminabadi commented Jun 15, 2021 •

edited

Loading

RezaYazdaniAminabadi commented Jun 15, 2021

hyunwoongko commented Jun 15, 2021

RezaYazdaniAminabadi commented Jun 15, 2021

RezaYazdaniAminabadi commented Jun 15, 2021

hyunwoongko commented Jun 16, 2021 •

edited

Loading

hyunwoongko commented Jun 16, 2021 •

edited

Loading

hyunwoongko commented Jun 16, 2021 •

edited

Loading

andreamad8 commented Jul 5, 2021

hyunwoongko commented Jul 5, 2021

andreamad8 commented Jul 5, 2021

RezaYazdaniAminabadi commented Jul 5, 2021

hyunwoongko commented Jul 18, 2021 •

edited

Loading

RezaYazdaniAminabadi commented Jul 18, 2021

RezaYazdaniAminabadi commented Jul 20, 2021

RezaYazdaniAminabadi commented Jul 20, 2021

hyunwoongko commented Jul 22, 2021

hyunwoongko commented Jul 22, 2021

hyunwoongko commented Jul 23, 2021 •

edited

Loading

hyunwoongko commented Jul 23, 2021 •

edited

Loading

RezaYazdaniAminabadi commented Jul 23, 2021

hyunwoongko commented Jul 27, 2021 •

edited

Loading

RezaYazdaniAminabadi commented Jul 27, 2021

hyunwoongko commented Jul 28, 2021 •

edited

Loading

RezaYazdaniAminabadi commented Jul 28, 2021

hyunwoongko commented Jul 30, 2021

hyunwoongko commented Jul 30, 2021 •

edited

Loading

RezaYazdaniAminabadi commented Jul 30, 2021

hyunwoongko commented Jul 30, 2021

Questions about implementing model parallelism in the inference engine #1161

Questions about implementing model parallelism in the inference engine #1161

Comments

hyunwoongko commented Jun 15, 2021 • edited Loading

hyunwoongko commented Jun 15, 2021 • edited Loading

RezaYazdaniAminabadi commented Jun 15, 2021 • edited Loading

RezaYazdaniAminabadi commented Jun 15, 2021

hyunwoongko commented Jun 15, 2021

RezaYazdaniAminabadi commented Jun 15, 2021

RezaYazdaniAminabadi commented Jun 15, 2021

hyunwoongko commented Jun 16, 2021 • edited Loading

hyunwoongko commented Jun 16, 2021 • edited Loading

hyunwoongko commented Jun 16, 2021 • edited Loading

andreamad8 commented Jul 5, 2021

hyunwoongko commented Jul 5, 2021

andreamad8 commented Jul 5, 2021

RezaYazdaniAminabadi commented Jul 5, 2021

hyunwoongko commented Jul 18, 2021 • edited Loading

RezaYazdaniAminabadi commented Jul 18, 2021

RezaYazdaniAminabadi commented Jul 20, 2021

RezaYazdaniAminabadi commented Jul 20, 2021

hyunwoongko commented Jul 22, 2021

hyunwoongko commented Jul 22, 2021

hyunwoongko commented Jul 23, 2021 • edited Loading

hyunwoongko commented Jul 23, 2021 • edited Loading

RezaYazdaniAminabadi commented Jul 23, 2021

hyunwoongko commented Jul 27, 2021 • edited Loading

RezaYazdaniAminabadi commented Jul 27, 2021

hyunwoongko commented Jul 28, 2021 • edited Loading

RezaYazdaniAminabadi commented Jul 28, 2021

hyunwoongko commented Jul 30, 2021

hyunwoongko commented Jul 30, 2021 • edited Loading

RezaYazdaniAminabadi commented Jul 30, 2021

hyunwoongko commented Jul 30, 2021

hyunwoongko commented Jun 15, 2021 •

edited

Loading

hyunwoongko commented Jun 15, 2021 •

edited

Loading

RezaYazdaniAminabadi commented Jun 15, 2021 •

edited

Loading

hyunwoongko commented Jun 16, 2021 •

edited

Loading

hyunwoongko commented Jun 16, 2021 •

edited

Loading

hyunwoongko commented Jun 16, 2021 •

edited

Loading

hyunwoongko commented Jul 18, 2021 •

edited

Loading

hyunwoongko commented Jul 23, 2021 •

edited

Loading

hyunwoongko commented Jul 23, 2021 •

edited

Loading

hyunwoongko commented Jul 27, 2021 •

edited

Loading

hyunwoongko commented Jul 28, 2021 •

edited

Loading

hyunwoongko commented Jul 30, 2021 •

edited

Loading