Multiple GPU training #4

Muyiyunzi · 2022-04-28T13:11:51Z

Hi,
First of all, thanks a lot for your efforts on MonoDETR!
I went nearly crazy these days driven by this multiple GPU training problem. I was intended to reproduce your work by using 2 16G GPU with batchsize=16, so I tried to revise the 'gpu_ids' to '0,1' in trainer section of the config .yaml file and CUDA_VISIBLE_DEVICES=0,1 in train.sh, however, it went wrong in the matcher.py saying that the batchsize here is 8 but the target is 16, so the matching clearly failed.
I also tried with python -m torch.distributed.launch --nproc_per_node=8 --use_env added in the train.sh file, but also failed.
Then I turned on the debug mode and it seemed that inputs before fed into the model, i.e. line 131 of trainer_helper.py, the size of inputs is 16,3,384,1280 but the outputs is a dict of 7 which containing tensors of size 8, 50, ..., and I added printing messages in the model forward method and it showed only once which means it was only called once.
This machine has ran loads of time of the monodle codes succesfully with 2 gpus, and the forward method should be called twice theoretically if the data was split into 2 pieces fed into 2 gpus accordingly.
I wonder if that was the version problem since monodle is using much lower version of pytorch.
btw, my env is python 3.8.13, pytorch 1.11.0, cuda 10.2. Looking forward to your insights on this! or if someone has successfully trained this work in multiple GPUs way could enlighten me please, many thanks!

The text was updated successfully, but these errors were encountered:

czy341181 · 2022-04-29T05:24:00Z

@Muyiyunzi , I solved this by the following methods

we don't need to pass the 'targets' parameter, it is a list (although I don't know why).
So at line 131 of trainer_helper.py,
outputs = self.model(inputs, calibs, targets, img_sizes) ---> outputs = self.model(inputs, calibs, img_sizes)

Also, you should modify it in monodetr.py

Muyiyunzi · 2022-04-29T08:25:56Z

@czy341181 Great! It worked!
I did notice that the targets parameters are not used in monodetr model, now it seems that the dict value affects the data-spliting process. Thanks a lot!

Muyiyunzi · 2022-04-29T08:52:32Z

To sum up, if you want to train the model with multiple gpus, i shall use 2 gpus here for example:

in monodetr.py, line 129, forward method, delete 'targets'.
in trainer_helper.py, line 131, delete 'targets'.
in tester_helper.py, line 79, delete 'targets'.
revise the 'gpu_ids' in the trainer section of your config .yaml file to '0,1'
train by .sh file with argument python tools/train_val.py --config $@ as in the readme.md. It is both OK whether you added CUDA_VISIBLE_DEVICES=0,1.

ZrrSkywalker · 2022-04-29T09:39:09Z

Thanks for your helpful summarization!

zehuichen123 · 2022-05-05T14:30:14Z

Hi, @Muyiyunzi ! Thanks for your multi-GPU tips! I wonder if the performance remains the same when training with multiple GPUs? Do you have a try?

Muyiyunzi · 2022-05-06T10:38:01Z

Hi @zehuichen123 , I've tried twice with 2 GPUs, batchsize 16, leaving other params unchanged, I got 18.62 (val) at epoch 163; 19.09 at epoch 164 which are the two best results of the two runs.

zehuichen123 · 2022-05-09T13:14:04Z

Thanks for your reply! I will give it a try!

simoneagliano · 2024-10-28T14:28:31Z

Hello, thanks for pointing that out.
I have some issues because the model init function expects the targets as input and we're not providing it.
I guess I should remove it also in each function it shows up:

get_loss
forward
loss_labels
loss_cardinality
and so on..

simoneagliano · 2024-11-27T11:52:26Z

Hello, shall I open a new issue?

@Muyiyunzi

simoneagliano · 2024-12-12T15:22:08Z

@zehuichen123 did you give a shoot?

simoneagliano · 2024-12-12T15:24:03Z

@czy341181 shall I use CUDA_VISIBLE_DEVICES=0,1 python tools/train_val.py

Carmen279252 · 2025-01-03T07:45:07Z

To sum up, if you want to train the model with multiple gpus, i shall use 2 gpus here for example:

in monodetr.py, line 129, forward method, delete 'targets'.

in trainer_helper.py, line 131, delete 'targets'.

in tester_helper.py, line 79, delete 'targets'.

revise the 'gpu_ids' in the trainer section of your config .yaml file to '0,1'

train by .sh file with argument python tools/train_val.py --config $@ as in the readme.md. It is both OK whether you added CUDA_VISIBLE_DEVICES=0,1.

Thinks for your reply， but how to run inference code with multi GPUs？ I run test.sh which CUDA_VISIBLE_DEVICES=4,5 python tools/train_val.py --config $@ -e . but errors as:

Traceback (most recent call last):
File "tools/train_val.py", line 111, in
main()
File "tools/train_val.py", line 66, in main
tester.test()
File "/data1/xmx/MonoDETR/lib/helpers/tester_helper.py", line 36, in test
load_checkpoint(model=self.model,
File "/data1/xmx/MonoDETR/lib/helpers/save_helper.py", line 39, in load_checkpoint
model.load_state_dict(checkpoint['model_state'])
File "/data1/xmx/anaconda/envs/monodetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DataParallel:
Missing key(s) in state_dict: "module.depthaware_transformer.level_embed", "module.depthaware_transformer.encoder.layers.0.self_attn.sampling_offsets.weight", "module.depthaware_transformer.encoder.laye .........

    like DataParallel with multi gpus is errors.

Muyiyunzi closed this as completed Apr 29, 2022

jing-turing mentioned this issue Oct 28, 2024

How to train with multiple GPUs #62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple GPU training #4

Multiple GPU training #4

Muyiyunzi commented Apr 28, 2022 •

edited

Loading

czy341181 commented Apr 29, 2022 •

edited

Loading

Muyiyunzi commented Apr 29, 2022 •

edited

Loading

Muyiyunzi commented Apr 29, 2022

ZrrSkywalker commented Apr 29, 2022

zehuichen123 commented May 5, 2022

Muyiyunzi commented May 6, 2022

zehuichen123 commented May 9, 2022

simoneagliano commented Oct 28, 2024

simoneagliano commented Nov 27, 2024 •

edited

Loading

simoneagliano commented Dec 12, 2024

simoneagliano commented Dec 12, 2024

Carmen279252 commented Jan 3, 2025

Multiple GPU training #4

Multiple GPU training #4

Comments

Muyiyunzi commented Apr 28, 2022 • edited Loading

czy341181 commented Apr 29, 2022 • edited Loading

Muyiyunzi commented Apr 29, 2022 • edited Loading

Muyiyunzi commented Apr 29, 2022

ZrrSkywalker commented Apr 29, 2022

zehuichen123 commented May 5, 2022

Muyiyunzi commented May 6, 2022

zehuichen123 commented May 9, 2022

simoneagliano commented Oct 28, 2024

simoneagliano commented Nov 27, 2024 • edited Loading

simoneagliano commented Dec 12, 2024

simoneagliano commented Dec 12, 2024

Carmen279252 commented Jan 3, 2025

Muyiyunzi commented Apr 28, 2022 •

edited

Loading

czy341181 commented Apr 29, 2022 •

edited

Loading

Muyiyunzi commented Apr 29, 2022 •

edited

Loading

simoneagliano commented Nov 27, 2024 •

edited

Loading