How to train with multiple GPUs #62

HaiJuntang · 2024-04-13T14:05:41Z

将monodetr.yaml配置成 gpu_ids: [0,1,2,3]，进行多卡训练出现以下错误
Traceback (most recent call last): | 0/464 [00:00<?, ?it/s]
File "tools/train_val.py", line 113, in
main()
File "tools/train_val.py", line 100, in main
trainer.train()
File "/media/data2/tanghaijun/newMonoDETR/MonoDETR-main/lib/helpers/trainer_helper.py", line 76, in train
self.train_one_epoch(epoch)
File "/media/data2/tanghaijun/newMonoDETR/MonoDETR-main/lib/helpers/trainer_helper.py", line 137, in train_one_epoch
outputs = self.model(inputs, calibs, targets, img_sizes, dn_args=dn_args)
File "/media/data2/tanghaijun/anaconda3/envs/newMonoDETR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/media/data2/tanghaijun/anaconda3/envs/newMonoDETR/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/media/data2/tanghaijun/anaconda3/envs/newMonoDETR/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/media/data2/tanghaijun/anaconda3/envs/newMonoDETR/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/media/data2/tanghaijun/anaconda3/envs/newMonoDETR/lib/python3.8/site-packages/torch/_utils.py", line 457, in reraise
raise exception
TypeError: Caught TypeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/media/data2/tanghaijun/anaconda3/envs/newMonoDETR/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/media/data2/tanghaijun/anaconda3/envs/newMonoDETR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
TypeError: forward() missing 4 required positional arguments: 'images', 'calibs', 'targets', and 'img_sizes'

shawnnnkb · 2024-05-08T07:42:57Z

have you solved this problem？

simoneagliano · 2024-10-23T07:59:24Z

following @HaiJuntang @shawnnnkb

jing-turing · 2024-10-28T01:41:13Z

Reference #4

simoneagliano · 2024-10-31T16:15:40Z

@jing-turing
Hello, thanks for pointing that out.
I have some issues because the model init function expects the targets as input and we're not providing it.
I guess I should remove it also in each function it shows up:

get_loss
forward
loss_labels
loss_cardinality
and so on..

Carmen279252 · 2025-01-03T07:45:58Z

@jing-turing Hello, thanks for pointing that out. I have some issues because the model init function expects the targets as input and we're not providing it. I guess I should remove it also in each function it shows up:

get_loss forward loss_labels loss_cardinality and so on..

Thinks for your reply， but how to run inference code with multi GPUs？ I run test.sh which CUDA_VISIBLE_DEVICES=4,5 python tools/train_val.py --config $@ -e . but errors as:

Traceback (most recent call last):
File "tools/train_val.py", line 111, in
main()
File "tools/train_val.py", line 66, in main
tester.test()
File "/data1/xmx/MonoDETR/lib/helpers/tester_helper.py", line 36, in test
load_checkpoint(model=self.model,
File "/data1/xmx/MonoDETR/lib/helpers/save_helper.py", line 39, in load_checkpoint
model.load_state_dict(checkpoint['model_state'])
File "/data1/xmx/anaconda/envs/monodetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DataParallel:
Missing key(s) in state_dict: "module.depthaware_transformer.level_embed", "module.depthaware_transformer.encoder.layers.0.self_attn.sampling_offsets.weight", "module.depthaware_transformer.encoder.laye .........

    like DataParallel with multi gpus is errors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to train with multiple GPUs #62

How to train with multiple GPUs #62

HaiJuntang commented Apr 13, 2024

shawnnnkb commented May 8, 2024

simoneagliano commented Oct 23, 2024

jing-turing commented Oct 28, 2024

simoneagliano commented Oct 31, 2024

Carmen279252 commented Jan 3, 2025

How to train with multiple GPUs #62

How to train with multiple GPUs #62

Comments

HaiJuntang commented Apr 13, 2024

shawnnnkb commented May 8, 2024

simoneagliano commented Oct 23, 2024

jing-turing commented Oct 28, 2024

simoneagliano commented Oct 31, 2024

Carmen279252 commented Jan 3, 2025