Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple GPU training #4

Closed
Muyiyunzi opened this issue Apr 28, 2022 · 12 comments
Closed

Multiple GPU training #4

Muyiyunzi opened this issue Apr 28, 2022 · 12 comments

Comments

@Muyiyunzi
Copy link

Muyiyunzi commented Apr 28, 2022

Hi,
First of all, thanks a lot for your efforts on MonoDETR!
I went nearly crazy these days driven by this multiple GPU training problem. I was intended to reproduce your work by using 2 16G GPU with batchsize=16, so I tried to revise the 'gpu_ids' to '0,1' in trainer section of the config .yaml file and CUDA_VISIBLE_DEVICES=0,1 in train.sh, however, it went wrong in the matcher.py saying that the batchsize here is 8 but the target is 16, so the matching clearly failed.
I also tried with python -m torch.distributed.launch --nproc_per_node=8 --use_env added in the train.sh file, but also failed.
Then I turned on the debug mode and it seemed that inputs before fed into the model, i.e. line 131 of trainer_helper.py, the size of inputs is 16,3,384,1280 but the outputs is a dict of 7 which containing tensors of size 8, 50, ..., and I added printing messages in the model forward method and it showed only once which means it was only called once.
This machine has ran loads of time of the monodle codes succesfully with 2 gpus, and the forward method should be called twice theoretically if the data was split into 2 pieces fed into 2 gpus accordingly.
I wonder if that was the version problem since monodle is using much lower version of pytorch.
btw, my env is python 3.8.13, pytorch 1.11.0, cuda 10.2. Looking forward to your insights on this! or if someone has successfully trained this work in multiple GPUs way could enlighten me please, many thanks!

@czy341181
Copy link

czy341181 commented Apr 29, 2022

@Muyiyunzi , I solved this by the following methods

we don't need to pass the 'targets' parameter, it is a list (although I don't know why).
So at line 131 of trainer_helper.py,
outputs = self.model(inputs, calibs, targets, img_sizes) ---> outputs = self.model(inputs, calibs, img_sizes)

Also, you should modify it in monodetr.py

@Muyiyunzi
Copy link
Author

Muyiyunzi commented Apr 29, 2022

@czy341181 Great! It worked!
I did notice that the targets parameters are not used in monodetr model, now it seems that the dict value affects the data-spliting process. Thanks a lot!

@Muyiyunzi
Copy link
Author

To sum up, if you want to train the model with multiple gpus, i shall use 2 gpus here for example:

  1. in monodetr.py, line 129, forward method, delete 'targets'.
  2. in trainer_helper.py, line 131, delete 'targets'.
  3. in tester_helper.py, line 79, delete 'targets'.
  4. revise the 'gpu_ids' in the trainer section of your config .yaml file to '0,1'
  5. train by .sh file with argument python tools/train_val.py --config $@ as in the readme.md. It is both OK whether you added CUDA_VISIBLE_DEVICES=0,1.

@ZrrSkywalker
Copy link
Owner

Thanks for your helpful summarization!

@zehuichen123
Copy link

Hi, @Muyiyunzi ! Thanks for your multi-GPU tips! I wonder if the performance remains the same when training with multiple GPUs? Do you have a try?

@Muyiyunzi
Copy link
Author

Hi @zehuichen123 , I've tried twice with 2 GPUs, batchsize 16, leaving other params unchanged, I got 18.62 (val) at epoch 163; 19.09 at epoch 164 which are the two best results of the two runs.

@zehuichen123
Copy link

Thanks for your reply! I will give it a try!

@simoneagliano
Copy link

Hello, thanks for pointing that out.
I have some issues because the model init function expects the targets as input and we're not providing it.
I guess I should remove it also in each function it shows up:

  • get_loss
  • forward
  • loss_labels
  • loss_cardinality
    and so on..

@simoneagliano
Copy link

simoneagliano commented Nov 27, 2024

Hello, shall I open a new issue?

@Muyiyunzi

@simoneagliano
Copy link

@zehuichen123 did you give a shoot?

@simoneagliano
Copy link

@czy341181 shall I use CUDA_VISIBLE_DEVICES=0,1 python tools/train_val.py

@Carmen279252
Copy link

To sum up, if you want to train the model with multiple gpus, i shall use 2 gpus here for example:

  1. in monodetr.py, line 129, forward method, delete 'targets'.
  2. in trainer_helper.py, line 131, delete 'targets'.
  3. in tester_helper.py, line 79, delete 'targets'.
  4. revise the 'gpu_ids' in the trainer section of your config .yaml file to '0,1'
  5. train by .sh file with argument python tools/train_val.py --config $@ as in the readme.md. It is both OK whether you added CUDA_VISIBLE_DEVICES=0,1.

Thinks for your reply, but how to run inference code with multi GPUs? I run test.sh which CUDA_VISIBLE_DEVICES=4,5 python tools/train_val.py --config $@ -e . but errors as:

Traceback (most recent call last):
File "tools/train_val.py", line 111, in
main()
File "tools/train_val.py", line 66, in main
tester.test()
File "/data1/xmx/MonoDETR/lib/helpers/tester_helper.py", line 36, in test
load_checkpoint(model=self.model,
File "/data1/xmx/MonoDETR/lib/helpers/save_helper.py", line 39, in load_checkpoint
model.load_state_dict(checkpoint['model_state'])
File "/data1/xmx/anaconda/envs/monodetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DataParallel:
Missing key(s) in state_dict: "module.depthaware_transformer.level_embed", "module.depthaware_transformer.encoder.layers.0.self_attn.sampling_offsets.weight", "module.depthaware_transformer.encoder.laye .........

    like DataParallel with multi gpus is errors.  

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants