Autotp training #6922

inkcherry · 2025-01-02T03:22:50Z

FYI @tjruwase @GuanhuaWang @delock @skyshine102 context: #5445
changes/support

auto tensor parallel training for HF model(zero compatible. I only tested zero1 currently)
distributed ckpt save(UCP is not supported).
HF model files save(set gather_16bit_weights_on_model_save=True in ds config).
Dataloader check.
Uts.
tp layer refactor by abstract layer design.

HF trainer dependency:
transformer: https://github.com/inkcherry/transformers/tree/ds_tp
accelerate: https://github.com/inkcherry/accelerate/tree/ds_tp
I could send them once ds support these api.

Usage:
Users do not need to modify the client code, they only need to configure the settings in the config file to achieve the desired functionality.
Below is an example of code for fine-tuning a LLaMA 2 model (SFT). It supports Zero3/FSDP training and enables TP training by simply adjusting the configuration

https://github.com/inkcherry/stanford_alpaca/commits/tp_demo_1127/
This branch contains three commits, with the last two commits added for quick experiments and logging purposes.
results
loss curve（gbs=16）:
zero3(baseline)

tp(this)

zero1 with zero1+tp(zero compatible)

performance（For your reference only.）:
zero3(not enabled any acceleration.) : 18GB 2.3s/it
zero1：38GB 1.30s/it
zero1+tp: 24GB 1.66s/it
extension:
I think async-TP/domino .etc. can be implemented by inheriting a class and overriding the fwd/bwd methods. The logic for gather/partition can be reused to achieve this.(please correct me if I am wrong)

Complex sharding can also be achieved through independent partitioning and gathering. Partitioning is mandatory, while gathering is required for training.
TODO:
embedding vocab parallel
Currently, the parallelism for embeddings is primarily based on hidden_dim parallel combined with allreduce. This approach takes advantage of efficient reduction kernels. and it is not forced to use.
In training, however, the more common method is vocab parallelism. Enabling by default can save a certain amount of GPU memory.

thanks for @delock guidance.
I also verified inference with cpu-inference workloads(Optimized Model List in https://github.com/intel/intel-extension-for-pytorch/tree/main).
many thanks for @xuguangxin @ikurtchen @rogerxfeng8 ,@Yejing-Lai ,@ys950902 .etc. Help review and address matters related to inference.

…-precision version before the rebase, but the grad norm differs (display issue)

GuanhuaWang

Hi @inkcherry , @delock
Sorry for the delay. I just left some comments. Thanks

GuanhuaWang · 2025-01-16T00:43:47Z

deepspeed/module_inject/layers.py

+        if is_inference_mode:
+            dist.inference_all_reduce(input, group=group)
+        else:
+            dist.all_reduce(input.contiguous(), group=group)


is there any reason for input.contiguous()?

It seems that adding this makes it safer, potentially helping to avoid discontinuity introduced by transpose/permute.
FYI: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/tensor_parallel/mappings.py#L23

I am not very clear on the implementation detail of inference_all_reduce, so I have kept the original dist.inference_all_reduce code path.

If it is already contiguous, .contiguous() will not launch additional memory copy kernel. Or it won't hurt performance.

GuanhuaWang · 2025-01-16T00:58:40Z

deepspeed/module_inject/layers.py

+    @staticmethod
+    def symbolic(graph, input):
+        """Symbolic function for tracing."""
+        return dist.all_reduce(input.contiguous(), dist.get_tensor_model_parallel_group())


similar here, is this contiguous() necessary?

It is consistent with the previous situation.

GuanhuaWang · 2025-01-16T03:04:39Z

tests/unit/model_parallelism/test_autotp_training.py

+
+    @pytest.mark.parametrize("layer_type", ["linear", "linearallreduce"])
+    def test(self, layer_type):
+        tp_size = 4


can we parametrize and test tp_size of both 2 and 4?

Thanks for the reminder， added

GuanhuaWang · 2025-01-16T03:07:29Z

tests/unit/model_parallelism/test_autotp_training.py

+    reuse_dist_env = True
+
+    def test_save_original_weight(self):
+        tp_size = 4


same here, could we parameterize both tp_size 2 and 4?

Thanks for the reminder， added

GuanhuaWang · 2025-01-16T03:09:44Z

deepspeed/utils/groups.py

+        return
+
+    if data_parallel_size is None:
+        data_parallel_size = dist.get_world_size() // tensor_model_parallel_size


do we need to consider pipeline_parallel_size?

Currently, this feature does not support the pipeline and pipeline-related logic will not reach this part. Perhaps we can consider adding pipeline support in the future.

GuanhuaWang · 2025-01-16T03:19:15Z

deepspeed/runtime/tensor_parallel/tp_manager.py

+        self.tp_config = TPConfig()
+        self.tp_config.tp_size = tp_size
+        if tp_size <= 1:
+            self.tp_config.enabled = False


I don't see anywhere this flag is used (i.e. there seems no design/code if enabled flag == False)? is this needed?

Thank you for pointing that out. It's not necessary, I was referring to the inference config. I have removed it now.

GuanhuaWang · 2025-01-16T03:29:09Z

deepspeed/runtime/engine.py

+        Returns:
+        OrderedDict: The consolidated state dictionary if the current process rank is 0, otherwise None.
+        """
+        #TODO: If we use both Zero3 and tensor parallel simultaneously


I also don't see why need to gather weights/params in TP training/inference. If it is only used for re-collecting weights for single point checkpoint write, then you can use our universal checkpoint feature to convert model parallel strategy after training.

GuanhuaWang · 2025-01-16T03:30:17Z

deepspeed/runtime/tensor_parallel/config.py

+    tp_size: int = 1
+    """ Number of devices to split the model across using tensor parallelism. """
+
+    tp_grain_size: int = 64


this argument I also did not see any use case

The variable is used in the autoTP parser to set tile boundaries to accelerate GEMM.

DeepSpeed/deepspeed/module_inject/replace_module.py

Line 308 in 05eaf3d

set_tp_grain_size(config.tensor_parallel.tp_grain_size)

it has not been activated in training yet, as it requires support for uneven gather. I have added clearer comments for better understanding.

GuanhuaWang · 2025-01-16T03:33:40Z

deepspeed/module_inject/layers.py

+class Yuan_LinearAllreduce(LinearAllreduce):
+
+    #Yuan2
+    @torch.no_grad()
+    def partition(self, params_list):
+        weight, bias = shard_value_with_share_qk(params_list[0].data, params_list[1], self.tp_index,
+                                                 self.tp_world_size, False)
+        params_list[0].data = weight
+        if bias is not None:
+            params_list[1].data = bias
+
+
+class Yuan_LinearLayer(LinearLayer):
+    #Yuan2
+    @torch.no_grad()
+    def partition(self, params_list):
+        weight, bias = shard_value_with_share_qk(params_list[0].data, params_list[1], self.tp_index,
+                                                 self.tp_world_size, True)
+        params_list[0].data = move(weight, get_accelerator().current_device_name()).detach()
+        if bias is not None:
+            params_list[1].data = move(bias, get_accelerator().current_device_name()).detach()


is it possible to make an abstraction of partition method with arguments passed-in for different models? if doing this, we can avoid create 2 new classes (e.g., Yuan_linear & Yuan_linear+allreduce) for every new model structure.

Yes, they currently only have one method. every specific shard logic should have a corresponding reverse gather logic. The current shard method hasn’t implemented the corresponding gather. I think using a class might help reserve a potential placeholder and make the code more consistent.

GuanhuaWang · 2025-01-16T03:34:41Z

deepspeed/module_inject/layers.py

+        return new_obj
+
+
+class GatherReplacedLayerParams:


do we need gather TP params during training or inference?

no, the reason are integrated into the comments above.

into autotp_training

inkcherry · 2025-01-24T09:32:53Z

@tjruwase @GuanhuaWang Thank you for your review. I’ve added modifications or explanations. Could you take another look? Thanks!

hwchen2017

Hi @inkcherry, thanks for contributing. Just a heads-up, all the all_reduce call in domino is supposed to be asynchronous, and current LinearAllreduce and LinearLayer need to be updated to work with Domino.
For example, in the LinearAllreduce, we'd like to get the handle from asynchronous all reduce, and synchronize it later to overlap computation.
The Domino work is still in progress, and it's not finalized yet. So, you don't need to worry about the compatibility with Domino at this point. But one thing you can easily support is the async TP, similar to Megatron here. Maybe it can be your next PR.
Thanks for your help!

inkcherry added 30 commits April 22, 2024 17:15

auto tp training

674a873

update parallel_states

a2e4c47

Merge branch 'master' into HEAD

f4eb142

WA skips assertions, the loss remains exactly consistent with the low…

dd081ed

…-precision version before the rebase, but the grad norm differs (display issue)

save/load ckpt & save/load hf model basic POC

cdaed2f

finish all the basic functionalities

9aad0e7

update

2bb11fd

use groups for parallel_states

e75c1c2

enable bwd allreduce, enable scale loss by gas

840a5f2

add dataloader check

60bd6ab

refactor autoTP step1

9266383

rm parallel_states

07174a9

refactor autoTP step2

ee6323e

update ut step1

6461b84

update

4d73011

add uts

c79c3bb

finished all ut code base

97e659c

addllr scheduler test

a15905b

refine ut

e9802b0

fix bcast_objlist

88b8acf

refine layers.py

868be0b

refine gather

3788e07

pass codegen350M +TP2 ut

27b24f6

add mode choice

3d7b89f

fix chatglm

47a6b0b

fix chatglm2 with transformers=4.40 version

3a23997

uneven

e3ec46e

fix uneven

9685879

fix training

7b99b03

refine code

570645f

inkcherry added 13 commits January 13, 2025 19:56

format

a49e77e

use parameterized save path

0ef5274

Merge remote-tracking branch 'my/autotp_training' into autotp_training

481088d

refactor infer/training path

f740de0

format

726004d

remove empty line

bd8de77

remove autotp_size config from zero scope

c334da0

update

29eef07

format

ba47ed1

fix layer typo and rename

bbde63f

fix python3.9

bdca62c

refine code

5d89422

refine

0a9caff

GuanhuaWang reviewed Jan 16, 2025

View reviewed changes

inkcherry and others added 14 commits January 16, 2025 16:59

refine config

c923a3b

improve ut coverage for save

92be193

fix process exit early

23bd0fc

improve ut coverage

358f395

Merge remote-tracking branch 'origin/master' into autotp_training

cdfb54c

fix zero1 regression

6d030c4

Merge branch 'master' into autotp_training

f9e7756

fix ci

6e7f846

Merge branch 'autotp_training' of https://github.com/inkcherry/DeepSpeed

c4fde7e

into autotp_training

skip overflow test

05bcecd

Merge branch 'master' into autotp_training

86f1c77

Skip xpu tests until the ci is updated

668cb1a

Merge branch 'autotp_training' of https://github.com/inkcherry/DeepSpeed

2e042a4

into autotp_training

Merge branch 'master' into autotp_training

e08a234

hwchen2017 reviewed Jan 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autotp training #6922

Autotp training #6922

inkcherry commented Jan 2, 2025 •

edited

Loading

GuanhuaWang left a comment •

edited

Loading

GuanhuaWang Jan 16, 2025

inkcherry Jan 16, 2025

hwchen2017 Jan 24, 2025

GuanhuaWang Jan 16, 2025

inkcherry Jan 16, 2025

GuanhuaWang Jan 16, 2025

inkcherry Jan 17, 2025

GuanhuaWang Jan 16, 2025

inkcherry Jan 17, 2025

GuanhuaWang Jan 16, 2025

inkcherry Jan 16, 2025

GuanhuaWang Jan 16, 2025

inkcherry Jan 16, 2025

GuanhuaWang Jan 16, 2025

GuanhuaWang Jan 16, 2025

inkcherry Jan 16, 2025

GuanhuaWang Jan 16, 2025

inkcherry Jan 17, 2025

GuanhuaWang Jan 16, 2025

inkcherry Jan 17, 2025

inkcherry commented Jan 24, 2025

hwchen2017 left a comment •

edited

Loading

Autotp training #6922

Are you sure you want to change the base?

Autotp training #6922

Conversation

inkcherry commented Jan 2, 2025 • edited Loading

GuanhuaWang left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

inkcherry commented Jan 24, 2025

hwchen2017 left a comment • edited Loading

Choose a reason for hiding this comment

inkcherry commented Jan 2, 2025 •

edited

Loading

GuanhuaWang left a comment •

edited

Loading

hwchen2017 left a comment •

edited

Loading