-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to train with multiple GPUs #62
Comments
have you solved this problem? |
following @HaiJuntang @shawnnnkb |
Reference #4 |
@jing-turing get_loss |
Thinks for your reply, but how to run inference code with multi GPUs? I run test.sh which CUDA_VISIBLE_DEVICES=4,5 python tools/train_val.py --config $@ -e . but errors as: Traceback (most recent call last):
|
将monodetr.yaml配置成 gpu_ids: [0,1,2,3],进行多卡训练出现以下错误
Traceback (most recent call last): | 0/464 [00:00<?, ?it/s]
File "tools/train_val.py", line 113, in
main()
File "tools/train_val.py", line 100, in main
trainer.train()
File "/media/data2/tanghaijun/newMonoDETR/MonoDETR-main/lib/helpers/trainer_helper.py", line 76, in train
self.train_one_epoch(epoch)
File "/media/data2/tanghaijun/newMonoDETR/MonoDETR-main/lib/helpers/trainer_helper.py", line 137, in train_one_epoch
outputs = self.model(inputs, calibs, targets, img_sizes, dn_args=dn_args)
File "/media/data2/tanghaijun/anaconda3/envs/newMonoDETR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/media/data2/tanghaijun/anaconda3/envs/newMonoDETR/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/media/data2/tanghaijun/anaconda3/envs/newMonoDETR/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/media/data2/tanghaijun/anaconda3/envs/newMonoDETR/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/media/data2/tanghaijun/anaconda3/envs/newMonoDETR/lib/python3.8/site-packages/torch/_utils.py", line 457, in reraise
raise exception
TypeError: Caught TypeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/media/data2/tanghaijun/anaconda3/envs/newMonoDETR/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/media/data2/tanghaijun/anaconda3/envs/newMonoDETR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
TypeError: forward() missing 4 required positional arguments: 'images', 'calibs', 'targets', and 'img_sizes'
The text was updated successfully, but these errors were encountered: