Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: invalid device ordinal #17

Closed
mjanddy opened this issue Apr 23, 2019 · 4 comments
Closed

RuntimeError: CUDA error: invalid device ordinal #17

mjanddy opened this issue Apr 23, 2019 · 4 comments

Comments

@mjanddy
Copy link

mjanddy commented Apr 23, 2019

Traceback (most recent call last):
File "train.py", line 253, in
main(1, ngpus_per_node, args)
File "train.py", line 236, in main
train(training_dbs, validation_db, system_config, model, args)
File "train.py", line 186, in train
nnet.set_lr(learning_rate)
File "/usr/lib/python3.5/contextlib.py", line 77, in exit
self.gen.throw(type, value, traceback)
File "/data/mj/CornerNet-Lite/core/utils/tqdm.py", line 23, in stdout_to_tqdm
raise exc
File "/data/mj/CornerNet-Lite/core/utils/tqdm.py", line 21, in stdout_to_tqdm
yield save_stdout
File "train.py", line 168, in train
training_loss = nnet.train(**training)
File "/data/mj/CornerNet-Lite/core/nnet/py_factory.py", line 93, in train
loss = self.network(xs, ys)
File "/home/mj/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/data/mj/CornerNet-Lite/core/models/py_utils/data_parallel.py", line 66, in forward
inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids, self.chunk_sizes)
File "/data/mj/CornerNet-Lite/core/models/py_utils/data_parallel.py", line 77, in scatter
return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim, chunk_sizes=self.chunk_sizes)
File "/data/mj/CornerNet-Lite/core/models/py_utils/scatter_gather.py", line 30, in scatter_kwargs
inputs = scatter(inputs, target_gpus, dim, chunk_sizes) if inputs else []
File "/data/mj/CornerNet-Lite/core/models/py_utils/scatter_gather.py", line 25, in scatter
return scatter_map(inputs)
File "/data/mj/CornerNet-Lite/core/models/py_utils/scatter_gather.py", line 18, in scatter_map
return list(zip(map(scatter_map, obj)))
File "/data/mj/CornerNet-Lite/core/models/py_utils/scatter_gather.py", line 20, in scatter_map
return list(map(list, zip(map(scatter_map, obj))))
File "/data/mj/CornerNet-Lite/core/models/py_utils/scatter_gather.py", line 15, in scatter_map
return Scatter.apply(target_gpus, chunk_sizes, dim, obj)
File "/home/mj/.local/lib/python3.5/site-packages/torch/nn/parallel/_functions.py", line 89, in forward
outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
File "/home/mj/.local/lib/python3.5/site-packages/torch/cuda/comm.py", line 148, in scatter
return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: CUDA error: invalid device ordinal (exchangeDevice at /pytorch/aten/src/ATen/cuda/detail/CUDAGuardImpl.h:28)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f08e8ee5021 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f08e8ee48ea in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libc10.so)
frame #2: + 0x4e426f (0x7f09235f526f in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so)
frame #3: + 0x8cdfa2 (0x7f08e99c3fa2 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #4: + 0xa14ae5 (0x7f08e9b0aae5 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #5: at::TypeDefault::copy(at::Tensor const&, bool, c10::optionalc10::Device) const + 0x56 (0x7f08e9c47c76 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #6: + 0x977f47 (0x7f08e9a6df47 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #7: at::native::to(at::Tensor const&, at::TensorOptions const&, bool, bool) + 0x295 (0x7f08e9a6faf5 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #8: at::TypeDefault::to(at::Tensor const&, at::TensorOptions const&, bool, bool) const + 0x17 (0x7f08e9c0e4f7 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #9: torch::autograd::VariableType::to(at::Tensor const&, at::TensorOptions const&, bool, bool) const + 0x17a (0x7f08e814ebaa in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1)
frame #10: torch::cuda::scatter(at::Tensor const&, c10::ArrayRef, c10::optional<std::vector<long, std::allocator > > const&, long, c10::optional<std::vector<c10::optionalat::cuda::CUDAStream, std::allocator<c10::optionalat::cuda::CUDAStream > > > const&) + 0x391 (0x7f09235f75f1 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so)
frame #11: + 0x4ebd4f (0x7f09235fcd4f in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so)
frame #12: + 0x11642e (0x7f092322742e in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so)

frame #15: python() [0x53fc97]
frame #18: python() [0x4ec2e3]
frame #21: THPFunction_apply(_object
, _object
) + 0x581 (0x7f0923423ab1 in /home/mj/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so)
frame #25: python() [0x4ec2e3]
frame #27: python() [0x535f0e]
frame #32: python() [0x4ec2e3]
frame #34: python() [0x535f0e]
frame #38: python() [0x5401ef]
frame #40: python() [0x5401ef]
frame #42: python() [0x53fc97]
frame #46: python() [0x4ec3f7]
frame #50: python() [0x4ec2e3]
frame #52: python() [0x4fbfce]
frame #54: python() [0x574db6]
frame #58: python() [0x4ec3f7]
frame #62: python() [0x5401ef]

what is wrong?

@heilaw
Copy link
Contributor

heilaw commented Apr 24, 2019

This error usually happens when the number of available GPUs is less than the number of GPU requested in the configuration file. By default, CornerNet-Saccade and CornerNet-Squeeze require 4 GPUs to train. You can adjust that by changing the batch size and chunk sizes in the config files. The chunk size indicate the number of images per GPU. Sum of chunk sizes should be equal to the batch size.

@mjanddy
Copy link
Author

mjanddy commented Apr 24, 2019

thanks

@mjanddy mjanddy closed this as completed Apr 24, 2019
@SeeeeShiwei
Copy link

Thanks

@FMsunyh
Copy link

FMsunyh commented Sep 11, 2019

@heilaw
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants