RuntimeError: CUDA error: invalid device ordinal #17

mjanddy · 2019-04-23T08:39:51Z

Traceback (most recent call last):
File "train.py", line 253, in
main(1, ngpus_per_node, args)
File "train.py", line 236, in main
train(training_dbs, validation_db, system_config, model, args)
File "train.py", line 186, in train
nnet.set_lr(learning_rate)
File "/usr/lib/python3.5/contextlib.py", line 77, in exit
self.gen.throw(type, value, traceback)
File "/data/mj/CornerNet-Lite/core/utils/tqdm.py", line 23, in stdout_to_tqdm
raise exc
File "/data/mj/CornerNet-Lite/core/utils/tqdm.py", line 21, in stdout_to_tqdm
yield save_stdout
File "train.py", line 168, in train
training_loss = nnet.train(**training)
File "/data/mj/CornerNet-Lite/core/nnet/py_factory.py", line 93, in train
loss = self.network(xs, ys)
File "/home/mj/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/data/mj/CornerNet-Lite/core/models/py_utils/data_parallel.py", line 66, in forward
inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids, self.chunk_sizes)
File "/data/mj/CornerNet-Lite/core/models/py_utils/data_parallel.py", line 77, in scatter
return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim, chunk_sizes=self.chunk_sizes)
File "/data/mj/CornerNet-Lite/core/models/py_utils/scatter_gather.py", line 30, in scatter_kwargs
inputs = scatter(inputs, target_gpus, dim, chunk_sizes) if inputs else []
File "/data/mj/CornerNet-Lite/core/models/py_utils/scatter_gather.py", line 25, in scatter
return scatter_map(inputs)
File "/data/mj/CornerNet-Lite/core/models/py_utils/scatter_gather.py", line 18, in scatter_map
return list(zip(map(scatter_map, obj)))
File "/data/mj/CornerNet-Lite/core/models/py_utils/scatter_gather.py", line 20, in scatter_map
return list(map(list, zip(map(scatter_map, obj))))
File "/data/mj/CornerNet-Lite/core/models/py_utils/scatter_gather.py", line 15, in scatter_map
return Scatter.apply(target_gpus, chunk_sizes, dim, obj)
File "/home/mj/.local/lib/python3.5/site-packages/torch/nn/parallel/_functions.py", line 89, in forward
outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
File "/home/mj/.local/lib/python3.5/site-packages/torch/cuda/comm.py", line 148, in scatter
return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: CUDA error: invalid device ordinal (exchangeDevice at /pytorch/aten/src/ATen/cuda/detail/CUDAGuardImpl.h:28)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f08e8ee5021 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f08e8ee48ea in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libc10.so)
frame #2: + 0x4e426f (0x7f09235f526f in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so)
frame #3: + 0x8cdfa2 (0x7f08e99c3fa2 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #4: + 0xa14ae5 (0x7f08e9b0aae5 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #5: at::TypeDefault::copy(at::Tensor const&, bool, c10::optionalc10::Device) const + 0x56 (0x7f08e9c47c76 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #6: + 0x977f47 (0x7f08e9a6df47 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #7: at::native::to(at::Tensor const&, at::TensorOptions const&, bool, bool) + 0x295 (0x7f08e9a6faf5 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #8: at::TypeDefault::to(at::Tensor const&, at::TensorOptions const&, bool, bool) const + 0x17 (0x7f08e9c0e4f7 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #9: torch::autograd::VariableType::to(at::Tensor const&, at::TensorOptions const&, bool, bool) const + 0x17a (0x7f08e814ebaa in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1)
frame #10: torch::cuda::scatter(at::Tensor const&, c10::ArrayRef, c10::optional<std::vector<long, std::allocator > > const&, long, c10::optional<std::vector<c10::optionalat::cuda::CUDAStream, std::allocator<c10::optionalat::cuda::CUDAStream > > > const&) + 0x391 (0x7f09235f75f1 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so)
frame #11: + 0x4ebd4f (0x7f09235fcd4f in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so)
frame #12: + 0x11642e (0x7f092322742e in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so)

frame #15: python() [0x53fc97]
frame #18: python() [0x4ec2e3]
frame #21: THPFunction_apply(_object, _object) + 0x581 (0x7f0923423ab1 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so)
frame #25: python() [0x4ec2e3]
frame #27: python() [0x535f0e]
frame #32: python() [0x4ec2e3]
frame #34: python() [0x535f0e]
frame #38: python() [0x5401ef]
frame #40: python() [0x5401ef]
frame #42: python() [0x53fc97]
frame #46: python() [0x4ec3f7]
frame #50: python() [0x4ec2e3]
frame #52: python() [0x4fbfce]
frame #54: python() [0x574db6]
frame #58: python() [0x4ec3f7]
frame #62: python() [0x5401ef]

what is wrong?

heilaw · 2019-04-24T02:16:33Z

This error usually happens when the number of available GPUs is less than the number of GPU requested in the configuration file. By default, CornerNet-Saccade and CornerNet-Squeeze require 4 GPUs to train. You can adjust that by changing the batch size and chunk sizes in the config files. The chunk size indicate the number of images per GPU. Sum of chunk sizes should be equal to the batch size.

mjanddy · 2019-04-24T02:39:35Z

thanks

SeeeeShiwei · 2019-05-14T11:28:18Z

Thanks

FMsunyh · 2019-09-11T10:30:48Z

@heilaw
Thanks

mjanddy closed this as completed Apr 24, 2019

chaoshengzhe mentioned this issue Jul 18, 2019

There is an error when i traiing. #96

Open

jerrywgz mentioned this issue Oct 30, 2019

RuntimeError: cannot join current thread #98

Open

Hello526 mentioned this issue Feb 7, 2020

_cpools error #138

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: invalid device ordinal #17

RuntimeError: CUDA error: invalid device ordinal #17

mjanddy commented Apr 23, 2019

heilaw commented Apr 24, 2019

mjanddy commented Apr 24, 2019

SeeeeShiwei commented May 14, 2019

FMsunyh commented Sep 11, 2019

RuntimeError: CUDA error: invalid device ordinal #17

RuntimeError: CUDA error: invalid device ordinal #17

Comments

mjanddy commented Apr 23, 2019

heilaw commented Apr 24, 2019

mjanddy commented Apr 24, 2019

SeeeeShiwei commented May 14, 2019

FMsunyh commented Sep 11, 2019