Tensor-Parallelism general support #1512

RezaYazdaniAminabadi · 2021-11-02T06:17:44Z

This PR provides support for model parallelism during inference without the need for injecting the kernels.

Add the PR in DeepSpeed-Example branch and verify the tensor-parallelism functionality on different model architectures:

Bert
Roberta
GPT2
GPT-Neo
GPT-J
Wav2vec2
T5

cc: @stas00 @hyunwoongko

…eepSpeed into inference/TP-general-support

hyunwoongko · 2022-01-22T23:08:36Z

I will review this today and apply it to oslo.

cc @stas00 @RezaYazdaniAminabadi

hyunwoongko · 2022-01-22T23:13:02Z

deepspeed/module_inject/replace_module.py

+            new_embedding.weight.data.copy_(data)
+            return new_embedding
+
+        def update_mp_params(child):


@RezaYazdaniAminabadi @stas00 This code works well for a few cases, but I don't think it's a good structure to scale to 70 models. Is there any more efficient way?

cc @jaketae do you have any nice idea for this?

That is true that this part needs some refactoring. Please let me know if you have some ideas

I left a comment on the issue. please note.
huggingface/transformers#13690 (comment)

hyunwoongko · 2022-01-22T23:15:06Z

deepspeed/module_inject/replace_module.py

@@ -299,18 +338,126 @@ def transpose(data):
            new_module.output_b.data = _4hh_b
        return new_module

+    def replace_wo_policy(module, all_reduce_linears):


Is the strategy of this method usually to apply column slice, and if the name of specific layers are input, to apply row slice them?

Can't we automate this a bit more? It would be nice to have a strategy that doesn't require parameter names at all.

@RezaYazdaniAminabadi You probably thought more than me when you are making this. I'm curious about your opinion.

I've been thinking briefly, what about profiling strategy?

First, We first replace the forward function of each Linear or Conv1D module with the profiling_forward function. This function measures the time the layer has been forwarded. and add get_first_forwared_time function. this function returns the time of first forward. if this time value is exist, profiling_forward no longer measures time.

Second, forward the module and get forwarded time from get_first_forwared_time.

All forwarded Linear or Conv1D layers except the last forwarded layer are considered columns

The last forwarded Linear or Conv1D layer are considered row.

It seems more flexible to just use torch.fx than this. I'll start automate the whole process of tensor & pipeline parallelization using torch.fx.

I'm missing the full context. Do you suggest to have a policy record for each model like in the example you have shown here:
#1512 (comment)

I'd help to see several full examples, then it's much easier to see how it can be integrated.

For example I started integrating Deepspeed-Inference huggingface/transformers#14426
after studying a few examples here: microsoft/DeepSpeedExamples#144

So I can see what's common, what's unique, which code sections are the driver and need to go into into the Trainer loop.

Monkey-see, monkey-do style is the easiest w/o needing to figure out all the low-level details.

Does it make sense?

I will write some example code after deployment so that you can easily apply it.

Yes, please.

As I reported to you originally it didn't appear that different OSLO components can be integrated separately and require all other OSLO components to work.

So Deepspeed-Inference I can relatively easily integrate into the HF Trainer since it doesn't require me to use anything else other than wrapping the model. We just need to figure out a few MPU quirks. With OSLO I have no idea how to do it, because what I tried didn't work.

But let's not derail this PR and discuss OSLO either on OSLO or HF Transformers Issues. This PR is about Deepspeed-Inference.

This has nothing to do with deepspeed, so let's talk about the transformers issue.
huggingface/transformers#13690

I've been thinking briefly, what about profiling strategy?

First, We first replace the forward function of each Linear or Conv1D module with the profiling_forward function. This function measures the time the layer has been forwarded. and add get_first_forwared_time function. this function returns the time of first forward. if this time value is exist, profiling_forward no longer measures time.

Second, forward the module and get forwarded time from get_first_forwared_time.

All forwarded Linear or Conv1D layers except the last forwarded layer are considered columns

The last forwarded Linear or Conv1D layer are considered row.

I think it is still not so easy to find which one should be using all-reduce, as it can be dependent on the architecture. But, I may miss something here. Maybe, we can have an offline chat about this? Thanks

@RezaYazdaniAminabadi Yes, offline chat would be better. When do you like it?

hyunwoongko · 2022-01-22T23:18:06Z

deepspeed/module_inject/replace_module.py

+            if len(linear_layer_setting) == 2:
+                linear_policies.update({linear_layer_setting[1]: _slice_embedding})
+        else:
+            if orig_layer_impl is HFGPT2LayerPolicy._orig_layer_class:


Rather than doing this, how about loading that module object and checking that it is a Conv1D? In the future, models using Conv1D modules other than GPT2 may exist.

That's a good point. Thanks @hyunwoongko

hyunwoongko · 2022-01-22T23:19:02Z

deepspeed/module_inject/replace_module.py

+                try:
+                    import transformers
+                    conv_linear_layer = True
+                    linear_policies = {transformers.model_utils.Conv1D: _replace}


Why did you set GPT2 not to slice embeddings?

Any special reason?

Please tell me if I misunderstood.

That embedding is not part of the layer, but a model. What I am slicing here is the transformer layers. Basically, that is just a small part of the model

hyunwoongko · 2022-01-23T00:17:02Z

deepspeed/module_inject/replace_module.py

+                except ImportError:
+                    linear_policies = {nn.Linear: _replace}
+            else:
+                linear_policies = {nn.Linear: _replace, nn.Embedding: _slice_embedding}


How do you differentiate between positional embedding and token embedding?

I used model.get_input_embeddings(). this is useful in this case.

RezaYazdaniAminabadi · 2022-01-25T17:17:51Z

I will analyze this today and apply it to oslo.

cc @stas00 @RezaYazdaniAminabadi

Thanks a lot @hyunwoongko for the thorough review on this PR. I will use some of your feedback to make this stronger.

Reza Yazdani added 4 commits October 28, 2021 02:06

fixing the softmax masking when using triangular masking

f7ef4b5

Merge branch 'master' of github.com:microsoft/DeepSpeed

dfb603f

Merge branch 'master' of github.com:microsoft/DeepSpeed

c5ecf32

add model-parallelism support independent of kernel-injection

0d8dd45

RezaYazdaniAminabadi requested review from awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz, niumanar, samyam, ShadenSmith and tjruwase as code owners November 2, 2021 06:17

Reza Yazdani and others added 5 commits November 2, 2021 11:19

remove unnecessary changes

9618523

add gpt-2 TP support without using policy

7752ec9

Merge branch 'master' into inference/TP-general-support

aa9eeca

fix some issues with the Tensor-Parallelism support

ebdeda9

Merge branch 'inference/TP-general-support' of github.com:microsoft/D…

ef1298b

…eepSpeed into inference/TP-general-support

RezaYazdaniAminabadi mentioned this pull request Nov 3, 2021

Inference: general tensor-parallel examples microsoft/DeepSpeedExamples#144

Merged

RezaYazdaniAminabadi and others added 4 commits November 3, 2021 17:44

Merge branch 'master' into inference/TP-general-support

edb157a

fix the replace-module for TP-inference support

70a4d81

Merge branch 'inference/TP-general-support' of github.com:microsoft/D…

8d14751

…eepSpeed into inference/TP-general-support

fix formatting

0024abc

This was referenced Nov 4, 2021

[Inference] Support GPT-J-6B #1332

Closed

Small change to Wav2Vec2 model to support Tensor-Parallelism with DeepSpeed huggingface/transformers#14298

Merged

jeffra and others added 4 commits November 8, 2021 09:58

Merge branch 'master' into inference/TP-general-support

bd3b08f

Merge branch 'master' into inference/TP-general-support

4092412

Merge branch 'master' into inference/TP-general-support

8e71030

Merge branch 'master' into inference/TP-general-support

b4a0ad8

jeffra approved these changes Nov 12, 2021

View reviewed changes

jeffra merged commit 9ce00a2 into master Nov 12, 2021

hyunwoongko reviewed Jan 22, 2022

View reviewed changes

hyunwoongko reviewed Jan 23, 2022

View reviewed changes

hyunwoongko mentioned this pull request Jan 25, 2022

[RFC] adding Tensor and Pipeline Parallelism to transformers huggingface/transformers#13690

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor-Parallelism general support #1512

Tensor-Parallelism general support #1512

RezaYazdaniAminabadi commented Nov 2, 2021 •

edited

Loading

hyunwoongko commented Jan 22, 2022 •

edited

Loading

hyunwoongko Jan 22, 2022

hyunwoongko Jan 22, 2022 •

edited

Loading

RezaYazdaniAminabadi Jan 25, 2022

hyunwoongko Jan 27, 2022

hyunwoongko Jan 22, 2022 •

edited

Loading

hyunwoongko Jan 22, 2022

hyunwoongko Jan 22, 2022 •

edited

Loading

hyunwoongko Jan 22, 2022 •

edited

Loading

hyunwoongko Jan 23, 2022 •

edited

Loading

stas00 Jan 25, 2022 •

edited

Loading

stas00 Jan 25, 2022 •

edited

Loading

hyunwoongko Jan 25, 2022 •

edited

Loading

RezaYazdaniAminabadi Jan 25, 2022

hyunwoongko Jan 25, 2022

hyunwoongko Jan 22, 2022

RezaYazdaniAminabadi Jan 25, 2022

hyunwoongko Jan 22, 2022

hyunwoongko Jan 22, 2022

hyunwoongko Jan 22, 2022

RezaYazdaniAminabadi Jan 25, 2022

hyunwoongko Jan 23, 2022

hyunwoongko Jan 25, 2022

RezaYazdaniAminabadi commented Jan 25, 2022

Tensor-Parallelism general support #1512

Tensor-Parallelism general support #1512

Conversation

RezaYazdaniAminabadi commented Nov 2, 2021 • edited Loading

hyunwoongko commented Jan 22, 2022 • edited Loading

Choose a reason for hiding this comment

hyunwoongko Jan 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hyunwoongko Jan 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hyunwoongko Jan 22, 2022 • edited Loading

Choose a reason for hiding this comment

hyunwoongko Jan 22, 2022 • edited Loading

Choose a reason for hiding this comment

hyunwoongko Jan 23, 2022 • edited Loading

Choose a reason for hiding this comment

stas00 Jan 25, 2022 • edited Loading

Choose a reason for hiding this comment

stas00 Jan 25, 2022 • edited Loading

Choose a reason for hiding this comment

hyunwoongko Jan 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RezaYazdaniAminabadi commented Jan 25, 2022

RezaYazdaniAminabadi commented Nov 2, 2021 •

edited

Loading

hyunwoongko commented Jan 22, 2022 •

edited

Loading

hyunwoongko Jan 22, 2022 •

edited

Loading

hyunwoongko Jan 22, 2022 •

edited

Loading

hyunwoongko Jan 22, 2022 •

edited

Loading

hyunwoongko Jan 22, 2022 •

edited

Loading

hyunwoongko Jan 23, 2022 •

edited

Loading

stas00 Jan 25, 2022 •

edited

Loading

stas00 Jan 25, 2022 •

edited

Loading

hyunwoongko Jan 25, 2022 •

edited

Loading