Issue with running on multiple GPUs #483

mackpack · 2019-01-02T13:23:31Z

I am having an issue with running CycleGAN on multiple GPUs. It works well when running on a single GPU (albeit very slowly, as expected) using

 python3 train.py --dataroot ./datasets/cezanne2photo --name cezanne2photo_cyclegan --model cycle_gan

Now when I try to train on multiple GPUs using

python3 train.py --dataroot ./datasets/maps --name maps_cyclegan --model cycle_gan --gpu_ids 0,1,2,3 --batch_size 16 --norm instance

I have also tried to run it with and without the --norm instance parameter and
also tried with --batch_size 4. This always leads to the same result:

The program stops at "create web directory" (I've let it run for a couple of days at this point, without any noticeable progress). It looks like a single python3 process is putting a single thread under full load, none of the other python3 processes get any CPU time. Trying to kill that process also seems impossible - I have had to restart the machine every time. None of the GPUs are ever under load and barely any of their memory is used.

I am using Python 3.5.2, CUDA 9.2, pytorch 1.0, cuDNN 7.4.1. The system has four 1080ti GPUs and an AMD Ryzen Threadripper 1950X.

The text was updated successfully, but these errors were encountered:

junyanz · 2019-01-02T15:40:11Z

Your command with multi-GPU training works for me. I am using Python 3.6.4, PyTorch 0.4.1, CUDA 9.0, cudnn 7.0.5. @taesungp

tangtao1999 · 2019-01-23T03:15:35Z

I am having an issue with running CycleGAN on multiple GPUs. It works well when running on a single GPU (albeit very slowly, as expected) using
 python3 train.py --dataroot ./datasets/cezanne2photo --name cezanne2photo_cyclegan --model cycle_gan
Now when I try to train on multiple GPUs using
python3 train.py --dataroot ./datasets/maps --name maps_cyclegan --model cycle_gan --gpu_ids 0,1,2,3 --batch_size 16 --norm instance
I have also tried to run it with and without the --norm instance parameter and
also tried with --batch_size 4. This always leads to the same result:

The program stops at "create web directory" (I've let it run for a couple of days at this point, without any noticeable progress). It looks like a single python3 process is putting a single thread under full load, none of the other python3 processes get any CPU time. Trying to kill that process also seems impossible - I have had to restart the machine every time. None of the GPUs are ever under load and barely any of their memory is used.

I am using Python 3.5.2, CUDA 9.2, pytorch 1.0, cuDNN 7.4.1. The system has four 1080ti GPUs and an AMD Ryzen Threadripper 1950X.

It also happen to me. How to solve this? Guys, i need your help!

banyet1 · 2019-03-29T19:40:05Z

I am having an issue with running CycleGAN on multiple GPUs. It works well when running on a single GPU (albeit very slowly, as expected) using
 python3 train.py --dataroot ./datasets/cezanne2photo --name cezanne2photo_cyclegan --model cycle_gan
Now when I try to train on multiple GPUs using
python3 train.py --dataroot ./datasets/maps --name maps_cyclegan --model cycle_gan --gpu_ids 0,1,2,3 --batch_size 16 --norm instance
I have also tried to run it with and without the --norm instance parameter and
also tried with --batch_size 4. This always leads to the same result:
The program stops at "create web directory" (I've let it run for a couple of days at this point, without any noticeable progress). It looks like a single python3 process is putting a single thread under full load, none of the other python3 processes get any CPU time. Trying to kill that process also seems impossible - I have had to restart the machine every time. None of the GPUs are ever under load and barely any of their memory is used.
I am using Python 3.5.2, CUDA 9.2, pytorch 1.0, cuDNN 7.4.1. The system has four 1080ti GPUs and an AMD Ryzen Threadripper 1950X.
It also happen to me. How to solve this? Guys, i need your help!

The same issue for me, any suggestion?

junyanz · 2019-03-30T03:35:40Z

I haven't been able to reproduce the error on my machine. @taesungp @ssnl

ssnl · 2019-03-30T03:38:57Z

Could you verify that basic cuda comm primitives works on your machine? E.g., try torch.cuda.broadcast. Other primitives you can try can be found at https://pytorch.org/docs/stable/cuda.html#communication-collectives

sangrockEG mentioned this issue Jun 26, 2019

My machine freezes with multi-gpu learning #685

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with running on multiple GPUs #483

Issue with running on multiple GPUs #483

mackpack commented Jan 2, 2019

junyanz commented Jan 2, 2019

tangtao1999 commented Jan 23, 2019

banyet1 commented Mar 29, 2019

junyanz commented Mar 30, 2019

ssnl commented Mar 30, 2019

Issue with running on multiple GPUs #483

Issue with running on multiple GPUs #483

Comments

mackpack commented Jan 2, 2019

junyanz commented Jan 2, 2019

tangtao1999 commented Jan 23, 2019

banyet1 commented Mar 29, 2019

junyanz commented Mar 30, 2019

ssnl commented Mar 30, 2019