Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error: the launch timed out and was terminated #4851

Closed
coallar opened this issue Sep 18, 2021 · 2 comments
Closed

CUDA error: the launch timed out and was terminated #4851

coallar opened this issue Sep 18, 2021 · 2 comments
Labels
bug Something isn't working Stale Stale and schedule for closing soon

Comments

@coallar
Copy link

coallar commented Sep 18, 2021

YSYTERM: ubuntu20.04

driver info:

CUDA:0 (NVIDIA GeForce GTX TITAN X, 12204.4375MB)
CUDA:1 (NVIDIA GeForce GTX TITAN X, 12212.875MB)

torch&cuda info
torch.version ====> '1.8.0+cu111'

Conmand: python3 -m torch.distributed.launch --nproc_per_node 2 train.py --batch 32 --data coco.yaml --weights yolov5x.pt --device 0,1 --imgsz 560 --cfg yolov5x.yaml

error:

Image sizes 576 train, 576 val
Using 8 dataloader workers
Logging results to runs/train/exp5
Starting training for 60 epochs...

 Epoch   gpu_mem       box       obj       cls    labels  img_size
  0/59     10.6G    0.1132   0.02993         0        29       576:   3%|██▎                                                                                      | 1/39 [00:10<06:48, 10.75s/it]Reducer buckets have been rebuilt in this iteration.
  0/59     10.6G   0.09944   0.03163         0        19       576: 100%|████████████████████████████████████████████████████████████████████████████████████████| 39/39 [01:38<00:00,  2.54s/it]
           Class     Images     Labels          P          R     [email protected] [email protected]:.95:   3%|█▊                                                                       | 1/39 [00:05<03:43,  5.89s/it]Traceback (most recent call last):

File "train.py", line 620, in
main(opt)
File "train.py", line 518, in main
train(opt.hyp, opt, device, callbacks)
File "train.py", line 312, in train
pred = model(imgs) # forward
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/pycharmProject/competetion/datafountain/yolov5/yolov5-new/models/yolo.py", line 123, in forward
return self.forward_once(x, profile, visualize) # single-scale inference, train
File "/home/zgj/pycharmProject/competetion/datafountain/yolov5/yolov5-new/models/yolo.py", line 155, in forward_once
x = m(x) # run
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/pycharmProject/competetion/datafountain/yolov5/yolov5-new/models/common.py", line 137, in forward
return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), dim=1))
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/pycharmProject/competetion/datafountain/yolov5/yolov5-new/models/common.py", line 103, in forward
return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, *kwargs)
File "/home/zgj/pycharmProject/competetion/datafountain/yolov5/yolov5-new/models/common.py", line 45, in forward
return self.act(self.bn(self.conv(x)))
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(input, kwargs)
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 113, in forward
self.num_batches_tracked = self.num_batches_tracked + 1 # type: ignore
RuntimeError: CUDA error: the launch timed out and was terminated
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: the launch timed out and was terminated
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fa2a2d962f2 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const
, char const
, unsigned int, std::string const&) + 0x5b (0x7fa2a2d9367b in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void
) + 0x809 (0x7fa2a2fee1f9 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fa2a2d7e3a4 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocatorc10d::Reducer::Bucket >::~vector() + 0x2f9 (0x7fa316ec0ac9 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: c10d::Reducer::~Reducer() + 0x26a (0x7fa316eb5a8a in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer
, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7fa316edcd22 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7fa316818df6 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0xa2201f (0x7fa316ee001f in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: + 0x369f00 (0x7fa316827f00 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x36b16e (0x7fa31682916e in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: + 0xfa96c (0x560dc73e296c in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #12: + 0x18f2f5 (0x560dc74772f5 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #13: + 0xfaef8 (0x560dc73e2ef8 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #14: + 0xfd538 (0x560dc73e5538 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #15: + 0xfd5d9 (0x560dc73e55d9 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #16: + 0xfd5d9 (0x560dc73e55d9 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #17: PyDict_SetItemString + 0x401 (0x560dc74893d1 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #18: PyImport_Cleanup + 0xa4 (0x560dc75574e4 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #19: Py_FinalizeEx + 0x7a (0x560dc7557a9a in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #20: Py_RunMain + 0x1b8 (0x560dc755c5c8 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #21: Py_BytesMain + 0x39 (0x560dc755c939 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #22: __libc_start_main + 0xf3 (0x7fa31e2ce0b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #23: + 0x1e8f39 (0x560dc74d0f39 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)

Killing subprocess 160871
Killing subprocess 160872
Traceback (most recent call last):
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3', '-u', 'train.py', '--local_rank=1', '--batch', '32', '--data', 'coco.yaml', '--weights', 'yolov5x.pt', '--device', '0,1', '--imgsz', '560', '--cfg', 'yolov5x.yaml']' died with <Signals.SIGABRT: 6>.

how could I solve this problem

@coallar coallar added the bug Something isn't working label Sep 18, 2021
@glenn-jocher
Copy link
Member

glenn-jocher commented Sep 18, 2021

@coallar your command seems fine though --cfg yolov5x.yaml is redundant with your --weights. For best Multi-GPU performance we always recommend training DDP inside our Docker Image.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 19, 2021

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

@github-actions github-actions bot added the Stale Stale and schedule for closing soon label Oct 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale Stale and schedule for closing soon
Projects
None yet
Development

No branches or pull requests

2 participants