Increase NCCL timeout to 3 hours #12345
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When training on a large dataset using DDP, the scanning process will be very long, and it will raise NCCL timeout error. Change the default timeout 30min to 3 hours, just the same as yolov8 (ultralytics/ultralytics#3343)
🤖 Generated by Copilot at ccc374c
Summary
🕒🌐🔒
Improved distributed training reliability by adding a timeout for process group initialization in
train.py
. This prevents the training from hanging indefinitely if some processes fail to join or communicate.Walkthrough
timedelta
module and addtimeout
argument todist.init_process_group
call to prevent hanging in distributed training (link, link)🛠️ PR Summary
Made with ❤️ by Ultralytics Actions
🌟 Summary
Improved Distributed Training Resilience with a Timeout Adjustment 🛠️
📊 Key Changes
timedelta
from thedatetime
module intrain.py
.🎯 Purpose & Impact