Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase NCCL timeout to 3 hours #12345

Merged
merged 3 commits into from
Nov 23, 2023
Merged

Conversation

wudashuo
Copy link
Contributor

@wudashuo wudashuo commented Nov 8, 2023

When training on a large dataset using DDP, the scanning process will be very long, and it will raise NCCL timeout error. Change the default timeout 30min to 3 hours, just the same as yolov8 (ultralytics/ultralytics#3343)

🤖 Generated by Copilot at ccc374c

Summary

🕒🌐🔒

Improved distributed training reliability by adding a timeout for process group initialization in train.py. This prevents the training from hanging indefinitely if some processes fail to join or communicate.

Sing, O Muse, of the valiant code warriors who trained
Their mighty models on distributed clusters, and gained
New insights from the data, but also faced a dread
Deadlock that stalled their progress and filled their hearts with dread.

Walkthrough

  • Import timedelta module and add timeout argument to dist.init_process_group call to prevent hanging in distributed training (link, link)

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Improved Distributed Training Resilience with a Timeout Adjustment 🛠️

📊 Key Changes

  • Imported timedelta from the datetime module in train.py.
  • Set a timeout for initializing the distributed training process group to 3 hours (10800 seconds).

🎯 Purpose & Impact

  • Purpose: To prevent distributed training jobs from failing silently if the network communication hangs, which can save time and resources during long training sessions.
  • Impact: Users employing distributed training with multiple GPUs will experience more robust and reliable training sessions, with lesser likelihood of undetected failures. 🤖🔗

wudashuo and others added 2 commits November 8, 2023 10:47
When training on a large dataset using DDP, the scanning process will be very long, and it will raise NCCL timeout error. Change the default timeout 30min to 3 hours, same as ultralytics yolov8 (ultralytics/ultralytics#3343)

Signed-off-by: Troy <[email protected]>
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 Hello @wudashuo, thank you for submitting a YOLOv5 🚀 PR! To allow your work to be integrated as seamlessly as possible, we advise you to:

  • ✅ Verify your PR is up-to-date with ultralytics/yolov5 master branch. If your PR is behind you can update your code by clicking the 'Update branch' button or by running git pull and git merge master locally.
  • ✅ Verify all YOLOv5 Continuous Integration (CI) checks are passing.
  • ✅ Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." — Bruce Lee

@glenn-jocher
Copy link
Member

@wudashuo thanks for your suggestion! We'll review this and consider it for the next release. It's always great to hear from the community and receive feedback on how we can improve the YOLOv5 experience. We appreciate your contribution and support.

Glenn Jocher

@glenn-jocher glenn-jocher merged commit cc232e3 into ultralytics:master Nov 23, 2023
7 checks passed
@glenn-jocher
Copy link
Member

@wudashuo PR merged! Thank you for your contributions.

pleb631 pushed a commit to pleb631/yolov5 that referenced this pull request Jan 6, 2024
* Increase NCCL timeout to 3 hours

When training on a large dataset using DDP, the scanning process will be very long, and it will raise NCCL timeout error. Change the default timeout 30min to 3 hours, same as ultralytics yolov8 (ultralytics/ultralytics#3343)

Signed-off-by: Troy <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Troy <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Glenn Jocher <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants