Sharding Comm Optimization #48604

JZ-LIANG · 2022-12-01T06:59:29Z

PR types

Performance optimization

PR changes

Others

Describe

Optimizing the Communication for Sharding Strategy.

Results:
A100 40G, Two Nodes 8 GPUs per node

GPT3 with 6.7B parameters	Token / S	Memory MB
AutoParall Sharding stage2 Baseline (without any Fuse or Overlap)	29884	29874
DeepSpeed ZeRO stage2 （with Fuse & Overlap）	39936	35442
AutoParall Sharding stage2 + This PR	43778	36180

Optim1: Bucket communication

Ring based collectives need sufficient buffer size to maximize the utilization of Bandwidth ref.
Therefore Small tensors communication (e.g. param and grad for LayerNorm & Bias) would hurt the performance of communicaiton.
Fusing several communication into one bucket communication would reduce the number of comm kernel call, therefore reducing the overhead for kernel launch and comm prepare latency.
This PR offers independent configuration for Parameter communication Bucket & Gradient communication Bucket.

Optim2: Timing for Bucketing

the timing for Bucketing tensors coalescence is quite importance for big model training for two reasons:
- coalesce too early before computation: would materialize the actual memory space for tensors long before they are going to be computed, which would increase the Peak memory used and lead to OOM for big model training.
- coalesce too late after computation: if coalesce happens after small tensors are generated, it need create additional buffer and a D2D memcpy is need to copy data from small tensors to coalesce buffer, which means the memory usage is double and the communication would be delay until the D2D memcpy finish.
This PR offers in-placed tensors coalescence just right before the first individual small tensor in bucket gonna to be commutated, avoiding redundant memory usage and memcpy.

Optim3: Gradient Communication Overlap

Baseline: without any overlap
All Calc & Comm kernels are scheduled on default stream and executed sequentially.
Grad Comm overlap with backward computation:
Grad Comm overlap with update(optimizer) computation:
The Grad Comm could also be overlapped with the optimizer calculation, since we launch multiple Opt kernels for each grad bucket. This PR allows Grad Comm overlap with those opt kernels before its bucket's opt kernel.
This optimization is not applied for GPT model since there is a global gradient synchronization (ClipByGlobalNorm) need by GPT training strategy. But it could be apply to models like BERT and ResNet, who don't need that synchronization.

Optim4: Parameter Communication Overlap

Cross Iteration Overlap:
Current step's Param Comm overlap with next step's forward computation
Param Comm overlap with Optimizer update computation:

Optim5: Flag communication Overlap
There several flags need to sync in distributed training like found-nan-inf in Mixed Precision Train and Global Norm in ClipByGlobalNorm.
This PR allow those communications to be overlap with computation that has not data dependencies with them, if the FP16 Grads are ready while FP32 Grads are still under calculation, the FP16 Global Norm communication would be issue to overlap with FP32 Grads calculation.

Optim6: Multi-Stream Communication
When communication is cross node via low bandwidth link, the communication become the bottleneck and the computation depended on that communication is hold. There is lots of GAP in the calculation stream.

This PR use multi-communicator and multi-stream to enable communication to overlap each other, reducing the GAP in calculation stream held by communication.

Optim7: Memcpy Overlap
Replace memcpy_sycn with memcpy_async and scedule them to dedicated memcpy stream.

Optim8: In-place memory reusing
Comparing with not overlap, multiple streams overlapping leading to higher peak memory usage.
The reason is that allocations used across stream could not be reused immediately since the Single-Stream-Fast-Garbage-Collection-Assumption is broken ref. It will lead to OOM will train big model.

To deal with this problem, we apply the In-place memory reusing strategy. Some tenors will not be used after computation, whose allocation could be reusing for next computation. It would alleviate the peak memory usage deal to overlaping and could save the allocator overhead for search free space in memory pool, especially when train big model where there is thousand of small tensors and the pool is fragmented.
Support patterns: elementwise-add (bias-add, residual-add), reshape.

All Optimizations above could be combined with each other freely.

paddle-bot · 2022-12-01T06:59:33Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

* get default calc stream from execution ctx instead of global dev ctx pool.

…exe-sharding-grad

JiabinYang

LGTM

aoyulong

LGTM

JZ-LIANG added 2 commits November 28, 2022 17:37

remove deps and prior comm

944d869

grad comm fuse1

8cbeb23

JZ-LIANG added 27 commits December 1, 2022 17:08

grad comm fuse2

74c1329

bugfix

eee4992

bugfix

374ea1c

bugfix

aaebc25

bugfix

ca74aeb

bugfix

18ac31e

add grad coalesce deps

57021a2

bugfix

02b7df8

bugfix

0be670f

bugfix

6d76dc1

Bugfix for Collective default calc stream (PaddlePaddle#48308)

a45fce3

* get default calc stream from execution ctx instead of global dev ctx pool.

stage2 overlap

ad52e7e

grad clip deps

09f1404

revert to success interpreter

c86b892

update recmupute

5436ae3

Merge remote-tracking branch 'upstream/develop' into AutoParallel/new…

8492feb

…exe-sharding-grad

update new exe

a813d65

clean auto utils

20d2a3c

remove exe redundent deps

ed9fdc0

newexe log

be9bf7c

add deps for amp&global norm

856a194

add deps for amp&global norm2

2e570ae

add deps for amp&global norm3

31a5732

bugfix

a707c9e

bugfix

1fd28e0

local engine

bd9aeec

bugfix

d1b88c3

JZ-LIANG added 20 commits December 30, 2022 11:55

new unitest

ad2f6bc

new unitest

5b0fe80

new unitest

86735f3

new unitest

517abc0

new unitest

6ec0fa9

new unitest

ed8cc56

new unitest

13f4bc5

new unitest

62bdc76

new unitest

c549e5a

new unitest

5706d29

new unitest

11c44c6

new unitest

64d765b

new unitest

58c4a5b

new unitest

e3ffdd9

new unitest

63bd851

new unitest

98f8f09

new unitest

3f5be3a

new unitest

f05bd8b

new unitest

ee5508e

Merge remote-tracking branch 'upstream/develop' into AutoParallel/new…

5263a7d

…exe-sharding-grad

JZ-LIANG changed the title ~~[Auto Parallel-Performance] Sharding Stage2-3 Gradient Comm Optimization~~ [Auto Parallel-Performance] Sharding Comm Optimization Dec 30, 2022

JZ-LIANG added 4 commits January 3, 2023 10:58

bugfix

7403efc

bugfix

c4c3269

Merge remote-tracking branch 'upstream/develop' into AutoParallel/new…

0e7b75d

…exe-sharding-grad

assert_allclose

220a30c

JiabinYang approved these changes Jan 4, 2023

View reviewed changes

aoyulong approved these changes Jan 4, 2023

View reviewed changes

raindrops2sea approved these changes Jan 4, 2023

View reviewed changes

JZ-LIANG merged commit 5592f8a into PaddlePaddle:develop Jan 4, 2023

JZ-LIANG changed the title ~~[Auto Parallel-Performance] Sharding Comm Optimization~~ Sharding Comm Optimization Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharding Comm Optimization #48604

Sharding Comm Optimization #48604

JZ-LIANG commented Dec 1, 2022 •

edited

Loading

paddle-bot bot commented Dec 1, 2022

JiabinYang left a comment

aoyulong left a comment

Sharding Comm Optimization #48604

Sharding Comm Optimization #48604

Conversation

JZ-LIANG commented Dec 1, 2022 • edited Loading

PR types

PR changes

Describe

paddle-bot bot commented Dec 1, 2022

JiabinYang left a comment

Choose a reason for hiding this comment

aoyulong left a comment

Choose a reason for hiding this comment

JZ-LIANG commented Dec 1, 2022 •

edited

Loading