-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple GPU training #4
Comments
@Muyiyunzi , I solved this by the following methods we don't need to pass the 'targets' parameter, it is a list (although I don't know why). Also, you should modify it in monodetr.py |
@czy341181 Great! It worked! |
To sum up, if you want to train the model with multiple gpus, i shall use 2 gpus here for example:
|
Thanks for your helpful summarization! |
Hi, @Muyiyunzi ! Thanks for your multi-GPU tips! I wonder if the performance remains the same when training with multiple GPUs? Do you have a try? |
Hi @zehuichen123 , I've tried twice with 2 GPUs, batchsize 16, leaving other params unchanged, I got 18.62 (val) at epoch 163; 19.09 at epoch 164 which are the two best results of the two runs. |
Thanks for your reply! I will give it a try! |
Hello, thanks for pointing that out.
|
Hello, shall I open a new issue? |
@zehuichen123 did you give a shoot? |
@czy341181 shall I use CUDA_VISIBLE_DEVICES=0,1 python tools/train_val.py |
Thinks for your reply, but how to run inference code with multi GPUs? I run test.sh which CUDA_VISIBLE_DEVICES=4,5 python tools/train_val.py --config $@ -e . but errors as: Traceback (most recent call last):
|
Hi,
First of all, thanks a lot for your efforts on MonoDETR!
I went nearly crazy these days driven by this multiple GPU training problem. I was intended to reproduce your work by using 2 16G GPU with batchsize=16, so I tried to revise the 'gpu_ids' to '0,1' in trainer section of the config .yaml file and CUDA_VISIBLE_DEVICES=0,1 in train.sh, however, it went wrong in the matcher.py saying that the batchsize here is 8 but the target is 16, so the matching clearly failed.
I also tried with
python -m torch.distributed.launch --nproc_per_node=8 --use_env
added in the train.sh file, but also failed.Then I turned on the debug mode and it seemed that inputs before fed into the model, i.e. line 131 of trainer_helper.py, the size of inputs is 16,3,384,1280 but the outputs is a dict of 7 which containing tensors of size 8, 50, ..., and I added printing messages in the model forward method and it showed only once which means it was only called once.
This machine has ran loads of time of the monodle codes succesfully with 2 gpus, and the forward method should be called twice theoretically if the data was split into 2 pieces fed into 2 gpus accordingly.
I wonder if that was the version problem since monodle is using much lower version of pytorch.
btw, my env is python 3.8.13, pytorch 1.11.0, cuda 10.2. Looking forward to your insights on this! or if someone has successfully trained this work in multiple GPUs way could enlighten me please, many thanks!
The text was updated successfully, but these errors were encountered: