-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with running on multiple GPUs #483
Comments
Your command with multi-GPU training works for me. I am using Python 3.6.4, PyTorch 0.4.1, CUDA 9.0, cudnn 7.0.5. @taesungp |
It also happen to me. How to solve this? Guys, i need your help! |
The same issue for me, any suggestion? |
Could you verify that basic cuda comm primitives works on your machine? E.g., try |
I am having an issue with running CycleGAN on multiple GPUs. It works well when running on a single GPU (albeit very slowly, as expected) using
Now when I try to train on multiple GPUs using
I have also tried to run it with and without the
--norm instance
parameter andalso tried with
--batch_size 4
. This always leads to the same result:The program stops at "create web directory" (I've let it run for a couple of days at this point, without any noticeable progress). It looks like a single python3 process is putting a single thread under full load, none of the other python3 processes get any CPU time. Trying to kill that process also seems impossible - I have had to restart the machine every time. None of the GPUs are ever under load and barely any of their memory is used.
I am using Python 3.5.2, CUDA 9.2, pytorch 1.0, cuDNN 7.4.1. The system has four 1080ti GPUs and an AMD Ryzen Threadripper 1950X.
The text was updated successfully, but these errors were encountered: