Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with running on multiple GPUs #483

Open
mackpack opened this issue Jan 2, 2019 · 5 comments
Open

Issue with running on multiple GPUs #483

mackpack opened this issue Jan 2, 2019 · 5 comments

Comments

@mackpack
Copy link

mackpack commented Jan 2, 2019

I am having an issue with running CycleGAN on multiple GPUs. It works well when running on a single GPU (albeit very slowly, as expected) using

 python3 train.py --dataroot ./datasets/cezanne2photo --name cezanne2photo_cyclegan --model cycle_gan

Now when I try to train on multiple GPUs using

python3 train.py --dataroot ./datasets/maps --name maps_cyclegan --model cycle_gan --gpu_ids 0,1,2,3 --batch_size 16 --norm instance

I have also tried to run it with and without the --norm instance parameter and
also tried with --batch_size 4. This always leads to the same result:

The program stops at "create web directory" (I've let it run for a couple of days at this point, without any noticeable progress). It looks like a single python3 process is putting a single thread under full load, none of the other python3 processes get any CPU time. Trying to kill that process also seems impossible - I have had to restart the machine every time. None of the GPUs are ever under load and barely any of their memory is used.

I am using Python 3.5.2, CUDA 9.2, pytorch 1.0, cuDNN 7.4.1. The system has four 1080ti GPUs and an AMD Ryzen Threadripper 1950X.

@junyanz
Copy link
Owner

junyanz commented Jan 2, 2019

Your command with multi-GPU training works for me. I am using Python 3.6.4, PyTorch 0.4.1, CUDA 9.0, cudnn 7.0.5. @taesungp

@tangtao1999
Copy link

I am having an issue with running CycleGAN on multiple GPUs. It works well when running on a single GPU (albeit very slowly, as expected) using

 python3 train.py --dataroot ./datasets/cezanne2photo --name cezanne2photo_cyclegan --model cycle_gan

Now when I try to train on multiple GPUs using

python3 train.py --dataroot ./datasets/maps --name maps_cyclegan --model cycle_gan --gpu_ids 0,1,2,3 --batch_size 16 --norm instance

I have also tried to run it with and without the --norm instance parameter and
also tried with --batch_size 4. This always leads to the same result:

The program stops at "create web directory" (I've let it run for a couple of days at this point, without any noticeable progress). It looks like a single python3 process is putting a single thread under full load, none of the other python3 processes get any CPU time. Trying to kill that process also seems impossible - I have had to restart the machine every time. None of the GPUs are ever under load and barely any of their memory is used.

I am using Python 3.5.2, CUDA 9.2, pytorch 1.0, cuDNN 7.4.1. The system has four 1080ti GPUs and an AMD Ryzen Threadripper 1950X.

It also happen to me. How to solve this? Guys, i need your help!

@banyet1
Copy link

banyet1 commented Mar 29, 2019

I am having an issue with running CycleGAN on multiple GPUs. It works well when running on a single GPU (albeit very slowly, as expected) using

 python3 train.py --dataroot ./datasets/cezanne2photo --name cezanne2photo_cyclegan --model cycle_gan

Now when I try to train on multiple GPUs using

python3 train.py --dataroot ./datasets/maps --name maps_cyclegan --model cycle_gan --gpu_ids 0,1,2,3 --batch_size 16 --norm instance

I have also tried to run it with and without the --norm instance parameter and
also tried with --batch_size 4. This always leads to the same result:
The program stops at "create web directory" (I've let it run for a couple of days at this point, without any noticeable progress). It looks like a single python3 process is putting a single thread under full load, none of the other python3 processes get any CPU time. Trying to kill that process also seems impossible - I have had to restart the machine every time. None of the GPUs are ever under load and barely any of their memory is used.
I am using Python 3.5.2, CUDA 9.2, pytorch 1.0, cuDNN 7.4.1. The system has four 1080ti GPUs and an AMD Ryzen Threadripper 1950X.

It also happen to me. How to solve this? Guys, i need your help!

The same issue for me, any suggestion?

@junyanz
Copy link
Owner

junyanz commented Mar 30, 2019

I haven't been able to reproduce the error on my machine. @taesungp @ssnl

@ssnl
Copy link
Collaborator

ssnl commented Mar 30, 2019

Could you verify that basic cuda comm primitives works on your machine? E.g., try torch.cuda.broadcast. Other primitives you can try can be found at https://pytorch.org/docs/stable/cuda.html#communication-collectives

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants