-
-
Notifications
You must be signed in to change notification settings - Fork 16.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using multi-GPU training reports errors #12213
Comments
👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help. For additional resources and information, please see the links below:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed! Thank you for your contributions to YOLO 🚀 and Vision AI ⭐ |
@jcluo1994 this issue seems to be related to the distributed training setup, specifically the NCCL communicator and the key-value store. You may want to ensure that the communication between the processes is set up correctly, and check the network setup to address any possible issues that may be causing these errors during training. Let me know if you need further assistance with this. |
👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help. For additional resources and information, please see the links below:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed! Thank you for your contributions to YOLO 🚀 and Vision AI ⭐ |
I met the same issue as you did when I am using deepspeed to finetune the LLM. Have you found any solution for this ? |
Hello @ANYMS-A, Thank you for reaching out and sharing your experience. It seems like you're encountering a similar issue with the NCCL communicator and key-value store during multi-GPU training. Let's work together to resolve this. To better assist you, could you please provide a minimum reproducible code example? This will help us understand the exact setup and conditions under which the issue occurs. You can refer to our guide on creating a minimum reproducible example. This step is crucial for us to reproduce and investigate the bug effectively. Additionally, please ensure that you are using the latest versions of Here's a quick checklist to help you get started:
If you have already tried these steps and the issue persists, please share the details of your setup and any additional logs or error messages you encounter. This information will be invaluable in diagnosing and resolving the issue. Thank you for your patience and cooperation. We're here to help! |
Search before asking
Question
Traceback (most recent call last):
File "train.py", line 647, in
main(opt)
File "train.py", line 536, in main
train(opt.hyp, opt, device, callbacks)
File "train.py", line 116, in train
with torch_distributed_zero_first(LOCAL_RANK):
File "/opt/conda/envs/train/lib/python3.8/contextlib.py", line 113, in enter
return next(self.gen)
File "/home/bml/yolov5/utils/torch_utils.py", line 92, in torch_distributed_zero_first
dist.barrier(device_ids=[local_rank])
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(args, kwargs)
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from doWait at ../torch/csrc/distributed/c10d/TCPStore.cpp:445 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3e54353617 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, char const) + 0x68 (0x7f3e5430ea56 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRefstd::string, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x32c (0x7f3e852c536c in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f3e852c64f2 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0x55 (0x7f3e852c6915 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3e8527e161 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3e8527e161 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3e8527e161 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3e8527e161 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xb2 (0x7f3e553460b2 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocatorc10::Device > const&, c10d::OpType, int, bool) + 0x203 (0x7f3e5534ba83 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #11: + 0xf19257 (0x7f3e5535a257 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #12: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x21 (0x7f3e5535bf01 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #13: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x3a7 (0x7f3e5535db27 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #14: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0xb25 (0x7f3e5536f7d5 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #15: + 0x55786a2 (0x7f3e852716a2 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #16: + 0x5582cc0 (0x7f3e8527bcc0 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #17: + 0x5582dc5 (0x7f3e8527bdc5 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #18: + 0x4bae85b (0x7f3e848a785b in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #19: + 0x4bac83c (0x7f3e848a583c in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #20: + 0x1904688 (0x7f3e815fd688 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #21: + 0x558c284 (0x7f3e85285284 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #22: + 0x558d1ed (0x7f3e852861ed in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #23: + 0xc407b8 (0x7f3e9787e7b8 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #24: + 0x3ee82f (0x7f3e9702c82f in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #25: PyCFunction_Call + 0x52 (0x4f5572 in /opt/conda/envs/train/bin/python)
frame #26: _PyObject_MakeTpCall + 0x3bb (0x4e0e1b in /opt/conda/envs/train/bin/python)
frame #27: /opt/conda/envs/train/bin/python() [0x4f531d]
frame #28: _PyEval_EvalFrameDefault + 0x1153 (0x4d9263 in /opt/conda/envs/train/bin/python)
frame #29: _PyEval_EvalCodeWithName + 0x2f1 (0x4d70d1 in /opt/conda/envs/train/bin/python)
frame #30: _PyFunction_Vectorcall + 0x19c (0x4e823c in /opt/conda/envs/train/bin/python)
frame #31: PyObject_Call + 0x34e (0x4f76ce in /opt/conda/envs/train/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x2073 (0x4da183 in /opt/conda/envs/train/bin/python)
frame #33: _PyEval_EvalCodeWithName + 0x2f1 (0x4d70d1 in /opt/conda/envs/train/bin/python)
frame #34: _PyFunction_Vectorcall + 0x19c (0x4e823c in /opt/conda/envs/train/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x1153 (0x4d9263 in /opt/conda/envs/train/bin/python)
frame #36: /opt/conda/envs/train/bin/python() [0x4fc29b]
frame #37: /opt/conda/envs/train/bin/python() [0x562b30]
frame #38: /opt/conda/envs/train/bin/python() [0x4e8cfb]
frame #39: _PyEval_EvalFrameDefault + 0x399 (0x4d84a9 in /opt/conda/envs/train/bin/python)
frame #40: _PyFunction_Vectorcall + 0x106 (0x4e81a6 in /opt/conda/envs/train/bin/python)
frame #41: /opt/conda/envs/train/bin/python() [0x4f5154]
frame #42: _PyEval_EvalFrameDefault + 0x2ab0 (0x4dabc0 in /opt/conda/envs/train/bin/python)
frame #43: _PyEval_EvalCodeWithName + 0x2f1 (0x4d70d1 in /opt/conda/envs/train/bin/python)
frame #44: _PyFunction_Vectorcall + 0x19c (0x4e823c in /opt/conda/envs/train/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x399 (0x4d84a9 in /opt/conda/envs/train/bin/python)
frame #46: _PyEval_EvalCodeWithName + 0x2f1 (0x4d70d1 in /opt/conda/envs/train/bin/python)
frame #47: _PyFunction_Vectorcall + 0x19c (0x4e823c in /opt/conda/envs/train/bin/python)
frame #48: _PyEval_EvalFrameDefault + 0x399 (0x4d84a9 in /opt/conda/envs/train/bin/python)
frame #49: _PyEval_EvalCodeWithName + 0x2f1 (0x4d70d1 in /opt/conda/envs/train/bin/python)
frame #50: PyEval_EvalCodeEx + 0x39 (0x585e29 in /opt/conda/envs/train/bin/python)
frame #51: PyEval_EvalCode + 0x1b (0x585deb in /opt/conda/envs/train/bin/python)
frame #52: /opt/conda/envs/train/bin/python() [0x5a5bd1]
frame #53: /opt/conda/envs/train/bin/python() [0x5a4bdf]
frame #54: /opt/conda/envs/train/bin/python() [0x45c538]
frame #55: PyRun_SimpleFileExFlags + 0x340 (0x45c0d9 in /opt/conda/envs/train/bin/python)
frame #56: /opt/conda/envs/train/bin/python() [0x44fe8f]
frame #57: Py_BytesMain + 0x39 (0x579e89 in /opt/conda/envs/train/bin/python)
frame #58: __libc_start_main + 0xf0 (0x7f3ed8afe840 in /lib/x86_64-linux-gnu/libc.so.6)
frame #59: /opt/conda/envs/train/bin/python() [0x579d3d]
. This may indicate a possible application crash on rank 0 or a network set up issue.
[2023-10-10 14:56:33,514] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 83 closing signal SIGTERM
[2023-10-10 14:56:33,629] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 84) of binary: /opt/conda/envs/train/bin/python
Traceback (most recent call last):
File "/opt/conda/envs/train/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/train/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 810, in
main()
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2023-10-10_14:56:33
host : 14064c861bcc
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 84)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Additional
i use the command in the flowing ,python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --weights yolov5s.pt --device 0,1,Thank you for your help.
The text was updated successfully, but these errors were encountered: