-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: CUDA error: invalid device ordinal #17
Comments
This error usually happens when the number of available GPUs is less than the number of GPU requested in the configuration file. By default, CornerNet-Saccade and CornerNet-Squeeze require 4 GPUs to train. You can adjust that by changing the batch size and chunk sizes in the config files. The chunk size indicate the number of images per GPU. Sum of chunk sizes should be equal to the batch size. |
thanks |
Thanks |
@heilaw |
Traceback (most recent call last):
File "train.py", line 253, in
main(1, ngpus_per_node, args)
File "train.py", line 236, in main
train(training_dbs, validation_db, system_config, model, args)
File "train.py", line 186, in train
nnet.set_lr(learning_rate)
File "/usr/lib/python3.5/contextlib.py", line 77, in exit
self.gen.throw(type, value, traceback)
File "/data/mj/CornerNet-Lite/core/utils/tqdm.py", line 23, in stdout_to_tqdm
raise exc
File "/data/mj/CornerNet-Lite/core/utils/tqdm.py", line 21, in stdout_to_tqdm
yield save_stdout
File "train.py", line 168, in train
training_loss = nnet.train(**training)
File "/data/mj/CornerNet-Lite/core/nnet/py_factory.py", line 93, in train
loss = self.network(xs, ys)
File "/home/mj/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/data/mj/CornerNet-Lite/core/models/py_utils/data_parallel.py", line 66, in forward
inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids, self.chunk_sizes)
File "/data/mj/CornerNet-Lite/core/models/py_utils/data_parallel.py", line 77, in scatter
return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim, chunk_sizes=self.chunk_sizes)
File "/data/mj/CornerNet-Lite/core/models/py_utils/scatter_gather.py", line 30, in scatter_kwargs
inputs = scatter(inputs, target_gpus, dim, chunk_sizes) if inputs else []
File "/data/mj/CornerNet-Lite/core/models/py_utils/scatter_gather.py", line 25, in scatter
return scatter_map(inputs)
File "/data/mj/CornerNet-Lite/core/models/py_utils/scatter_gather.py", line 18, in scatter_map
return list(zip(map(scatter_map, obj)))
File "/data/mj/CornerNet-Lite/core/models/py_utils/scatter_gather.py", line 20, in scatter_map
return list(map(list, zip(map(scatter_map, obj))))
File "/data/mj/CornerNet-Lite/core/models/py_utils/scatter_gather.py", line 15, in scatter_map
return Scatter.apply(target_gpus, chunk_sizes, dim, obj)
File "/home/mj/.local/lib/python3.5/site-packages/torch/nn/parallel/_functions.py", line 89, in forward
outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
File "/home/mj/.local/lib/python3.5/site-packages/torch/cuda/comm.py", line 148, in scatter
return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: CUDA error: invalid device ordinal (exchangeDevice at /pytorch/aten/src/ATen/cuda/detail/CUDAGuardImpl.h:28)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f08e8ee5021 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f08e8ee48ea in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libc10.so)
frame #2: + 0x4e426f (0x7f09235f526f in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so)
frame #3: + 0x8cdfa2 (0x7f08e99c3fa2 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #4: + 0xa14ae5 (0x7f08e9b0aae5 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #5: at::TypeDefault::copy(at::Tensor const&, bool, c10::optionalc10::Device) const + 0x56 (0x7f08e9c47c76 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #6: + 0x977f47 (0x7f08e9a6df47 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #7: at::native::to(at::Tensor const&, at::TensorOptions const&, bool, bool) + 0x295 (0x7f08e9a6faf5 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #8: at::TypeDefault::to(at::Tensor const&, at::TensorOptions const&, bool, bool) const + 0x17 (0x7f08e9c0e4f7 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #9: torch::autograd::VariableType::to(at::Tensor const&, at::TensorOptions const&, bool, bool) const + 0x17a (0x7f08e814ebaa in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1)
frame #10: torch::cuda::scatter(at::Tensor const&, c10::ArrayRef, c10::optional<std::vector<long, std::allocator > > const&, long, c10::optional<std::vector<c10::optionalat::cuda::CUDAStream, std::allocator<c10::optionalat::cuda::CUDAStream > > > const&) + 0x391 (0x7f09235f75f1 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so)
frame #11: + 0x4ebd4f (0x7f09235fcd4f in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so)
frame #12: + 0x11642e (0x7f092322742e in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so)
frame #15: python() [0x53fc97]
frame #18: python() [0x4ec2e3]
frame #21: THPFunction_apply(_object, _object) + 0x581 (0x7f0923423ab1 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so)
frame #25: python() [0x4ec2e3]
frame #27: python() [0x535f0e]
frame #32: python() [0x4ec2e3]
frame #34: python() [0x535f0e]
frame #38: python() [0x5401ef]
frame #40: python() [0x5401ef]
frame #42: python() [0x53fc97]
frame #46: python() [0x4ec3f7]
frame #50: python() [0x4ec2e3]
frame #52: python() [0x4fbfce]
frame #54: python() [0x574db6]
frame #58: python() [0x4ec3f7]
frame #62: python() [0x5401ef]
what is wrong?
The text was updated successfully, but these errors were encountered: