Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharding Comm Optimization #48604

Merged

Conversation

JZ-LIANG
Copy link
Contributor

@JZ-LIANG JZ-LIANG commented Dec 1, 2022

PR types

Performance optimization

PR changes

Others

Describe

Optimizing the Communication for Sharding Strategy.

Results:
A100 40G, Two Nodes 8 GPUs per node

GPT3 with 6.7B parameters Token / S Memory MB
AutoParall Sharding stage2 Baseline (without any Fuse or Overlap) 29884 29874
DeepSpeed ZeRO stage2 (with Fuse & Overlap) 39936 35442
AutoParall Sharding stage2 + This PR 43778 36180

Optim1: Bucket communication

  • Ring based collectives need sufficient buffer size to maximize the utilization of Bandwidth ref.
    Therefore Small tensors communication (e.g. param and grad for LayerNorm & Bias) would hurt the performance of communicaiton.

  • Fusing several communication into one bucket communication would reduce the number of comm kernel call, therefore reducing the overhead for kernel launch and comm prepare latency.

  • This PR offers independent configuration for Parameter communication Bucket & Gradient communication Bucket.

image

Optim2: Timing for Bucketing

  • the timing for Bucketing tensors coalescence is quite importance for big model training for two reasons:

    • coalesce too early before computation: would materialize the actual memory space for tensors long before they are going to be computed, which would increase the Peak memory used and lead to OOM for big model training.
    • coalesce too late after computation: if coalesce happens after small tensors are generated, it need create additional buffer and a D2D memcpy is need to copy data from small tensors to coalesce buffer, which means the memory usage is double and the communication would be delay until the D2D memcpy finish.

    This PR offers in-placed tensors coalescence just right before the first individual small tensor in bucket gonna to be commutated, avoiding redundant memory usage and memcpy.

Optim3: Gradient Communication Overlap

  • Baseline: without any overlap
    All Calc & Comm kernels are scheduled on default stream and executed sequentially.
    image

  • Grad Comm overlap with backward computation:
    image

  • Grad Comm overlap with update(optimizer) computation:
    The Grad Comm could also be overlapped with the optimizer calculation, since we launch multiple Opt kernels for each grad bucket. This PR allows Grad Comm overlap with those opt kernels before its bucket's opt kernel.
    This optimization is not applied for GPT model since there is a global gradient synchronization (ClipByGlobalNorm) need by GPT training strategy. But it could be apply to models like BERT and ResNet, who don't need that synchronization.

Optim4: Parameter Communication Overlap

  • Cross Iteration Overlap:
    Current step's Param Comm overlap with next step's forward computation
    image
  • Param Comm overlap with Optimizer update computation:
    image

Optim5: Flag communication Overlap
There several flags need to sync in distributed training like found-nan-inf in Mixed Precision Train and Global Norm in ClipByGlobalNorm.
This PR allow those communications to be overlap with computation that has not data dependencies with them, if the FP16 Grads are ready while FP32 Grads are still under calculation, the FP16 Global Norm communication would be issue to overlap with FP32 Grads calculation.

Optim6: Multi-Stream Communication
When communication is cross node via low bandwidth link, the communication become the bottleneck and the computation depended on that communication is hold. There is lots of GAP in the calculation stream.
image

This PR use multi-communicator and multi-stream to enable communication to overlap each other, reducing the GAP in calculation stream held by communication.
image

Optim7: Memcpy Overlap
Replace memcpy_sycn with memcpy_async and scedule them to dedicated memcpy stream.
image

Optim8: In-place memory reusing
Comparing with not overlap, multiple streams overlapping leading to higher peak memory usage.
The reason is that allocations used across stream could not be reused immediately since the Single-Stream-Fast-Garbage-Collection-Assumption is broken ref. It will lead to OOM will train big model.

To deal with this problem, we apply the In-place memory reusing strategy. Some tenors will not be used after computation, whose allocation could be reusing for next computation. It would alleviate the peak memory usage deal to overlaping and could save the allocator overhead for search free space in memory pool, especially when train big model where there is thousand of small tensors and the pool is fragmented.
Support patterns: elementwise-add (bias-add, residual-add), reshape.

All Optimizations above could be combined with each other freely.

@paddle-bot
Copy link

paddle-bot bot commented Dec 1, 2022

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@JZ-LIANG JZ-LIANG changed the title [Auto Parallel-Performance] Sharding Stage2-3 Gradient Comm Optimization [Auto Parallel-Performance] Sharding Comm Optimization Dec 30, 2022
Copy link
Contributor

@JiabinYang JiabinYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@aoyulong aoyulong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JZ-LIANG JZ-LIANG merged commit 5592f8a into PaddlePaddle:develop Jan 4, 2023
@JZ-LIANG JZ-LIANG changed the title [Auto Parallel-Performance] Sharding Comm Optimization Sharding Comm Optimization Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants