-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Problem with multi-GPU training #58
Comments
I met the same issue. I'm using an 8GPU machine and it works fine with NGPUS=2, but stuck after first iteration when using NGPUS=4 or 8. |
Hi, @robpal1990 So the training get's stuck once you use 3 GPUs, is that right? Did you try using 4 GPUs as well? In the past we had hangs as well, but those were due to a driver bug that was fixed in versions 384, 390 and 396. The driver versions with the fix are >=384.139 (for cuda 9.0) and >=396.26 (for cuda 9.2). They are out already. If you can, it's better to move to 396, if not, update 384 Could you check that your drivers satisfy those requirements? |
I also met the same issue, I am using 3 GPUs machine. It works fine when I use 2GPUs, but stuck when using 3 GPUs. PyTorch version: 1.0.0.dev20181029 OS: Ubuntu 16.04.4 LTS Python version: 3.7 Nvidia driver version: 390.59 |
I'll try reproducing this and report back |
I tried using 4 and run into the same issue. Thanks a lot for your help. |
@fmassa You need to do this manually, from the local .deb file found on Nvidia website, the cuda-9-2 package found in the Ubuntu repo automatically installs the nvidia-410 and nvidia-410-dev packages, overwriting the 396 drivers. Clearly a drivers issue, seems like the framework is not compatible with 410. |
@robpal1990 thanks a lot for the info. So this seems to be related to the hangs that we were facing in the past. They were due to a problem with the driver when some cudnn convolution was selected. @slayton58 @ngimel Are you aware of this hang with 410.48 drivers? |
I have the same problem too. Environment: The bug is weird to me. If I only two GPUs, everything is fine. If I try to use 4 GPUs, sometimes it occurs. |
A follow-up, since it seems that the issue is still present is some settings. I haven't mentioned in my first post that I have everything installed without conda (simply with Then, I created a Docker build, extending the image |
Well, in my situation, training on 3 GPUs works fine, while training on 4 GPUs stucks. The codes are run on a server with 4 Titan Xp cards. I hope this issue can be fixed soon. |
I think this is a bad mix with CUDA / CUDNN and the NVidia driver, and I'm not sure there is anything we can do. You might need to update a few things in your system. For info, here is the setup I use and which works fine:
|
Isn't CUDA and CUDNN built in pytorch? I thought the version I installed on my server manually would have no influence on that built in pytorch itself. Here is my setup info.
|
So, I think that 390.87 is a driver for CUDA 9.1, and it might be prior to the fix that I mentioned. Updating the nvidia driver to the latest release should fix the issue. |
It's weird. I did not touch the driver but only installed all patches of my CUDA(9.0). After that was done, this issue was gone. I installed pytorch by using pip and I thought CUDA is built in pytorch if I installed it in that way. If that is true, how would my own local CUDA patches influence the behavior of pytorch codes? |
hi @robpal1990 |
I had it installed via apt on Ubuntu, so simply type Then I downloaded and installed the drivers from Nvidia website (https://www.nvidia.com/Download/index.aspx). In my case I also had to build CUDA with these drivers. Download the one from Nvidia website since the one in apt repo will overwrite your drivers with 410 version. |
hi @fmassa |
PyTorch supports CUDA 10, but you need to compile PyTorch from source I think |
hi @chengyangfu @fmassa |
May I ask what CUDA and driver version are you using? I'm stuck with this issue in CUDA8.0.61, cudnn 7102 , driver 390.97 even with 2 1280ti cards. I tried both nightly and stable version pytorch. |
I solved my case. When no positive example is presented in training, it blows up. 😞 I think it's related to the following issue. |
Recently, I updated my PyTorch to v1.0.0 and it solved this problem. Driver Version: 415.27 |
|
Hello,
I have successfully built maskrcnn_benchmark on Ubuntu 16.04. My workstation has 4x1080Ti (CUDA 9.2, cuDNN 7, Nvidia drivers 410.48) and I tried to train on COCO dataset on multiple GPUs. I used the script provided in "Perform training on COCO dataset" section.
One GPU worked fine with:
Then I used
(the same train_net.py file and the same config, changed images per batch to 4) and everything worked fine.
Next I tried the same thing for 3 GPUs (changed NGPUS=3, images per batch to 6) and the training gets stuck during the first interation. I have the following logging information and it does not change:
The GPU memory is used, the temperature goes up, but nothing is happening (I tried multiple times and then gave up).
Any ideas? I'd be grateful for help.
The text was updated successfully, but these errors were encountered: