Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train with multiple GPUs #62

Open
HaiJuntang opened this issue Apr 13, 2024 · 5 comments
Open

How to train with multiple GPUs #62

HaiJuntang opened this issue Apr 13, 2024 · 5 comments

Comments

@HaiJuntang
Copy link

将monodetr.yaml配置成 gpu_ids: [0,1,2,3],进行多卡训练出现以下错误
Traceback (most recent call last): | 0/464 [00:00<?, ?it/s]
File "tools/train_val.py", line 113, in
main()
File "tools/train_val.py", line 100, in main
trainer.train()
File "/media/data2/tanghaijun/newMonoDETR/MonoDETR-main/lib/helpers/trainer_helper.py", line 76, in train
self.train_one_epoch(epoch)
File "/media/data2/tanghaijun/newMonoDETR/MonoDETR-main/lib/helpers/trainer_helper.py", line 137, in train_one_epoch
outputs = self.model(inputs, calibs, targets, img_sizes, dn_args=dn_args)
File "/media/data2/tanghaijun/anaconda3/envs/newMonoDETR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/media/data2/tanghaijun/anaconda3/envs/newMonoDETR/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/media/data2/tanghaijun/anaconda3/envs/newMonoDETR/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/media/data2/tanghaijun/anaconda3/envs/newMonoDETR/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/media/data2/tanghaijun/anaconda3/envs/newMonoDETR/lib/python3.8/site-packages/torch/_utils.py", line 457, in reraise
raise exception
TypeError: Caught TypeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/media/data2/tanghaijun/anaconda3/envs/newMonoDETR/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/media/data2/tanghaijun/anaconda3/envs/newMonoDETR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
TypeError: forward() missing 4 required positional arguments: 'images', 'calibs', 'targets', and 'img_sizes'

@shawnnnkb
Copy link

have you solved this problem?

@simoneagliano
Copy link

following @HaiJuntang @shawnnnkb

@jing-turing
Copy link

Reference #4

@simoneagliano
Copy link

@jing-turing
Hello, thanks for pointing that out.
I have some issues because the model init function expects the targets as input and we're not providing it.
I guess I should remove it also in each function it shows up:

get_loss
forward
loss_labels
loss_cardinality
and so on..

@Carmen279252
Copy link

@jing-turing Hello, thanks for pointing that out. I have some issues because the model init function expects the targets as input and we're not providing it. I guess I should remove it also in each function it shows up:

get_loss forward loss_labels loss_cardinality and so on..

Thinks for your reply, but how to run inference code with multi GPUs? I run test.sh which CUDA_VISIBLE_DEVICES=4,5 python tools/train_val.py --config $@ -e . but errors as:

Traceback (most recent call last):
File "tools/train_val.py", line 111, in
main()
File "tools/train_val.py", line 66, in main
tester.test()
File "/data1/xmx/MonoDETR/lib/helpers/tester_helper.py", line 36, in test
load_checkpoint(model=self.model,
File "/data1/xmx/MonoDETR/lib/helpers/save_helper.py", line 39, in load_checkpoint
model.load_state_dict(checkpoint['model_state'])
File "/data1/xmx/anaconda/envs/monodetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DataParallel:
Missing key(s) in state_dict: "module.depthaware_transformer.level_embed", "module.depthaware_transformer.encoder.layers.0.self_attn.sampling_offsets.weight", "module.depthaware_transformer.encoder.laye .........

    like DataParallel with multi gpus is errors. 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants