-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sharding Comm Optimization #48604
Sharding Comm Optimization #48604
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
* get default calc stream from execution ctx instead of global dev ctx pool.
…exe-sharding-grad
…exe-sharding-grad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Performance optimization
PR changes
Others
Describe
Optimizing the Communication for Sharding Strategy.
Results:
A100 40G, Two Nodes 8 GPUs per node
Optim1: Bucket communication
Ring based collectives need sufficient buffer size to maximize the utilization of Bandwidth ref.
Therefore Small tensors communication (e.g. param and grad for LayerNorm & Bias) would hurt the performance of communicaiton.
Fusing several communication into one bucket communication would reduce the number of comm kernel call, therefore reducing the overhead for kernel launch and comm prepare latency.
This PR offers independent configuration for Parameter communication Bucket & Gradient communication Bucket.
Optim2: Timing for Bucketing
the timing for Bucketing tensors coalescence is quite importance for big model training for two reasons:
This PR offers in-placed tensors coalescence just right before the first individual small tensor in bucket gonna to be commutated, avoiding redundant memory usage and memcpy.
Optim3: Gradient Communication Overlap
Baseline: without any overlap
All Calc & Comm kernels are scheduled on default stream and executed sequentially.
Grad Comm overlap with backward computation:
Grad Comm overlap with update(optimizer) computation:
The Grad Comm could also be overlapped with the optimizer calculation, since we launch multiple Opt kernels for each grad bucket. This PR allows Grad Comm overlap with those opt kernels before its bucket's opt kernel.
This optimization is not applied for GPT model since there is a global gradient synchronization (ClipByGlobalNorm) need by GPT training strategy. But it could be apply to models like BERT and ResNet, who don't need that synchronization.
Optim4: Parameter Communication Overlap
Current step's Param Comm overlap with next step's forward computation
Optim5: Flag communication Overlap
There several flags need to sync in distributed training like found-nan-inf in Mixed Precision Train and Global Norm in ClipByGlobalNorm.
This PR allow those communications to be overlap with computation that has not data dependencies with them, if the FP16 Grads are ready while FP32 Grads are still under calculation, the FP16 Global Norm communication would be issue to overlap with FP32 Grads calculation.
Optim6: Multi-Stream Communication
When communication is cross node via low bandwidth link, the communication become the bottleneck and the computation depended on that communication is hold. There is lots of GAP in the calculation stream.
This PR use multi-communicator and multi-stream to enable communication to overlap each other, reducing the GAP in calculation stream held by communication.
Optim7: Memcpy Overlap
Replace memcpy_sycn with memcpy_async and scedule them to dedicated memcpy stream.
Optim8: In-place memory reusing
Comparing with not overlap, multiple streams overlapping leading to higher peak memory usage.
The reason is that allocations used across stream could not be reused immediately since the Single-Stream-Fast-Garbage-Collection-Assumption is broken ref. It will lead to OOM will train big model.
To deal with this problem, we apply the In-place memory reusing strategy. Some tenors will not be used after computation, whose allocation could be reusing for next computation. It would alleviate the peak memory usage deal to overlaping and could save the allocator overhead for search free space in memory pool, especially when train big model where there is thousand of small tensors and the pool is fragmented.
Support patterns: elementwise-add (bias-add, residual-add), reshape.
All Optimizations above could be combined with each other freely.