Problem with multi-GPU training #58

robpal1990 · 2018-10-29T16:40:14Z

Hello,

I have successfully built maskrcnn_benchmark on Ubuntu 16.04. My workstation has 4x1080Ti (CUDA 9.2, cuDNN 7, Nvidia drivers 410.48) and I tried to train on COCO dataset on multiple GPUs. I used the script provided in "Perform training on COCO dataset" section.

One GPU worked fine with:

python tools/train_net.py --config-file "configs/e2e_mask_rcnn_R_50_FPN_1x.yaml" SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1

Then I used

export NGPUS=2
python -m torch.distributed.launch --nproc_per_node=$NGPUS tools/train_net.py --config-file "configs/e2e_mask_rcnn_R_50_FPN_1x.yaml" SOLVER.IMS_PER_BATCH 4 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1

(the same train_net.py file and the same config, changed images per batch to 4) and everything worked fine.

Next I tried the same thing for 3 GPUs (changed NGPUS=3, images per batch to 6) and the training gets stuck during the first interation. I have the following logging information and it does not change:

2018-10-29 17:20:55,722 maskrcnn_benchmark.trainer INFO: Start training
2018-10-29 17:20:58,453 maskrcnn_benchmark.trainer INFO: eta: 22 days, 18:04:18  iter: 0  loss: 6.7175 (6.7175)  loss_classifier: 4.4688 (4.4688)  loss_box_reg: 0.0044 (0.0044)  loss_mask: 1.4084 (1.4084)  loss_objectness: 0.7262 (0.7262)  loss_rpn_box_reg: 0.1097 (0.1097)  time: 2.7304 (2.7304)  data: 2.4296 (2.4296)  lr: 0.000833  max mem: 1749

The GPU memory is used, the temperature goes up, but nothing is happening (I tried multiple times and then gave up).

Any ideas? I'd be grateful for help.

The text was updated successfully, but these errors were encountered:

sneakerkg · 2018-10-29T17:10:33Z

I met the same issue. I'm using an 8GPU machine and it works fine with NGPUS=2, but stuck after first iteration when using NGPUS=4 or 8.
torch (1.0.0a0+ff608a9)
torchvision (0.2.1)
CUDA 9.2

fmassa · 2018-10-29T17:15:01Z

Hi,

@robpal1990 So the training get's stuck once you use 3 GPUs, is that right? Did you try using 4 GPUs as well?

In the past we had hangs as well, but those were due to a driver bug that was fixed in versions 384, 390 and 396.

The driver versions with the fix are >=384.139 (for cuda 9.0) and >=396.26 (for cuda 9.2). They are out already. If you can, it's better to move to 396, if not, update 384

Could you check that your drivers satisfy those requirements?

zxphistory · 2018-10-30T02:11:16Z

I also met the same issue, I am using 3 GPUs machine. It works fine when I use 2GPUs, but stuck when using 3 GPUs.

PyTorch version: 1.0.0.dev20181029
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.4 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 9.0.176
GPU models and configuration:
GPU 0: TITAN V
GPU 1: TITAN V
GPU 2: TITAN V

Nvidia driver version: 390.59
cuDNN version: Could not collect

fmassa · 2018-10-30T07:46:48Z

I'll try reproducing this and report back

robpal1990 · 2018-10-30T08:47:28Z

Hi,

@robpal1990 So the training get's stuck once you use 3 GPUs, is that right? Did you try using 4 GPUs as well?

In the past we had hangs as well, but those were due to a driver bug that was fixed in versions 384, 390 and 396.

The driver versions with the fix are >=384.139 (for cuda 9.0) and >=396.26 (for cuda 9.2). They are out already. If you can, it's better to move to 396, if not, update 384

Could you check that your drivers satisfy those requirements?

I tried using 4 and run into the same issue.
As I mentioned in the first post, I have the 410.48 version of the drivers. I will try downgrading to a 396.xx version.

Thanks a lot for your help.

robpal1990 · 2018-10-30T11:23:24Z

@fmassa
I first upgraded the drivers to the latest 410.xx version and the training on 3 GPUs did not work.
I then re-compiled CUDA 9.2 with 396.54 drivers and the training on 3 GPUs works.

You need to do this manually, from the local .deb file found on Nvidia website, the cuda-9-2 package found in the Ubuntu repo automatically installs the nvidia-410 and nvidia-410-dev packages, overwriting the 396 drivers.

Clearly a drivers issue, seems like the framework is not compatible with 410.

fmassa · 2018-10-30T12:16:40Z

@robpal1990 thanks a lot for the info.

So this seems to be related to the hangs that we were facing in the past. They were due to a problem with the driver when some cudnn convolution was selected.

@slayton58 @ngimel Are you aware of this hang with 410.48 drivers?

chengyangfu · 2018-10-31T06:24:13Z

I have the same problem too.

Environment:
Python: 3.5
GPU : 4 1080Ti.
CUDA : 9.0 (with all the patches)
CuDNN: 7.1
NCCL2: download from Nvidia
Nvidia Driver: 390, 396, 410
PyTorch: Compiled from the source ( v1.0rc0, and v1.0rc1)
Ubuntu : 16.04

The bug is weird to me. If I only two GPUs, everything is fine. If I try to use 4 GPUs, sometimes it occurs.
P.S. I also found out when I use Nvidia driver 410, the frequency is much lower.

robpal1990 · 2018-10-31T14:07:30Z

A follow-up, since it seems that the issue is still present is some settings.

I haven't mentioned in my first post that I have everything installed without conda (simply with pip3, python3). Training on 3 GPUs works.

Then, I created a Docker build, extending the image pytorch/pytorch:nightly-devel-cuda9.2-cudnn7 (this one has conda inside). I followed the instruction as before, but in the docker container training on 3 GPUs gets stuck on the first iteration again (1 and 2 GPUs work fine) and as of now I don't know how to fix it. From what I know, Docker shares the drivers (396.54) with the host machine, so it's surprising.

yelantf · 2018-11-12T05:57:35Z

Well, in my situation, training on 3 GPUs works fine, while training on 4 GPUs stucks. The codes are run on a server with 4 Titan Xp cards. I hope this issue can be fixed soon.

fmassa · 2018-11-12T09:53:21Z

I think this is a bad mix with CUDA / CUDNN and the NVidia driver, and I'm not sure there is anything we can do. You might need to update a few things in your system.

For info, here is the setup I use and which works fine:

PyTorch version: 1.0.0a0+dd2c487
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.4 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 9.2.88
GPU models and configuration: 8 Tesla V100
Nvidia driver version: 396.51

CUDNN 7.0

yelantf · 2018-11-12T13:58:36Z

Isn't CUDA and CUDNN built in pytorch? I thought the version I installed on my server manually would have no influence on that built in pytorch itself. Here is my setup info.
@fmassa

PyTorch version: 1.0.0.dev20181108
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: 9.0.176
GPU models and configuration: 
GPU 0: TITAN Xp
GPU 1: TITAN Xp
GPU 2: TITAN Xp
GPU 3: TITAN Xp

Nvidia driver version: 390.87
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a

Versions of relevant libraries:
[pip] Could not collect
[conda] Could not collect

fmassa · 2018-11-12T15:32:20Z

So, I think that 390.87 is a driver for CUDA 9.1, and it might be prior to the fix that I mentioned.

Updating the nvidia driver to the latest release should fix the issue.

yelantf · 2018-11-15T13:21:46Z

So, I think that 390.87 is a driver for CUDA 9.1, and it might be prior to the fix that I mentioned.

Updating the nvidia driver to the latest release should fix the issue.

It's weird. I did not touch the driver but only installed all patches of my CUDA(9.0). After that was done, this issue was gone. I installed pytorch by using pip and I thought CUDA is built in pytorch if I installed it in that way. If that is true, how would my own local CUDA patches influence the behavior of pytorch codes?

zimenglan-sysu-512 · 2018-11-16T08:43:50Z

hi @robpal1990
how do manually degrade the driver ?

robpal1990 · 2018-11-16T09:35:18Z

hi @robpal1990
how do manually degrade the driver ?

I had it installed via apt on Ubuntu, so simply type sudo apt remove nvidia-3xx nvidia-3xx-dev (fill xx with your version.

Then I downloaded and installed the drivers from Nvidia website (https://www.nvidia.com/Download/index.aspx). In my case I also had to build CUDA with these drivers. Download the one from Nvidia website since the one in apt repo will overwrite your drivers with 410 version.

zimenglan-sysu-512 · 2018-11-16T14:45:01Z

hi @fmassa
right now support cuda 10?

fmassa · 2018-11-16T15:05:21Z

PyTorch supports CUDA 10, but you need to compile PyTorch from source I think

zimenglan-sysu-512 · 2018-11-22T10:49:59Z

hi @chengyangfu @fmassa
i find that after installing the new version of nccl, the hang doesn't occur. it seems to be solved.

Godricly · 2018-12-18T06:02:23Z

May I ask what CUDA and driver version are you using? I'm stuck with this issue in CUDA8.0.61, cudnn 7102 , driver 390.97 even with 2 1280ti cards. I tried both nightly and stable version pytorch.

Godricly · 2018-12-18T06:48:05Z

I solved my case. When no positive example is presented in training, it blows up. 😞 I think it's related to the following issue.
One tricky solution is to increase you batch size. 👿

chengyangfu · 2019-01-30T20:32:34Z

Recently, I updated my PyTorch to v1.0.0 and it solved this problem.

Driver Version: 415.27
CUDA version: cuda_9.2.148_396.37 + patch 1
CUDNN version: cudnn-9.2-linux-x64-v7.3.1
NCCL version: nccl_2.3.7-1+cuda9.2

lanfeng4659 · 2019-04-11T12:28:25Z

hi @chengyangfu @fmassa
i find that after installing the new version of nccl, the hang doesn't occur. it seems to be solved.
could you tell me the version of those packages? pytorch, cuda, nvidia driver and nccl, thank you!

fmassa added the awaiting response label Oct 29, 2018

fmassa added dependency bug and removed awaiting response labels Oct 30, 2018

fmassa mentioned this issue Nov 15, 2018

RuntimeError: CUDA error: out of memory #120

Closed

chengyangfu mentioned this issue Nov 21, 2018

Add RetinaNet Implementation #102

Merged

zimenglan-sysu-512 mentioned this issue Nov 25, 2018

multi GPU training problem #208

Closed

engineer1109 mentioned this issue Dec 20, 2018

Strange Problem #293

Open

yxchng mentioned this issue Apr 12, 2019

Program turns into zombie process when killed using ctrl-c #659

Open

dedoogong mentioned this issue Aug 16, 2019

RuntimeError: "SigmoidFocalLoss_forward" not implemented for 'Half' #1048

Open

Jacobew mentioned this issue Apr 19, 2020

add dcn from mmdetection #693

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with multi-GPU training #58

Problem with multi-GPU training #58

robpal1990 commented Oct 29, 2018 •

edited

Loading

sneakerkg commented Oct 29, 2018

fmassa commented Oct 29, 2018

zxphistory commented Oct 30, 2018

fmassa commented Oct 30, 2018

robpal1990 commented Oct 30, 2018

robpal1990 commented Oct 30, 2018

fmassa commented Oct 30, 2018

chengyangfu commented Oct 31, 2018 •

edited

Loading

robpal1990 commented Oct 31, 2018 •

edited

Loading

yelantf commented Nov 12, 2018

fmassa commented Nov 12, 2018

yelantf commented Nov 12, 2018 •

edited

Loading

fmassa commented Nov 12, 2018

yelantf commented Nov 15, 2018

zimenglan-sysu-512 commented Nov 16, 2018

robpal1990 commented Nov 16, 2018

zimenglan-sysu-512 commented Nov 16, 2018

fmassa commented Nov 16, 2018

zimenglan-sysu-512 commented Nov 22, 2018

Godricly commented Dec 18, 2018

Godricly commented Dec 18, 2018

chengyangfu commented Jan 30, 2019

lanfeng4659 commented Apr 11, 2019

Problem with multi-GPU training #58

Problem with multi-GPU training #58

Comments

robpal1990 commented Oct 29, 2018 • edited Loading

sneakerkg commented Oct 29, 2018

fmassa commented Oct 29, 2018

zxphistory commented Oct 30, 2018

fmassa commented Oct 30, 2018

robpal1990 commented Oct 30, 2018

robpal1990 commented Oct 30, 2018

fmassa commented Oct 30, 2018

chengyangfu commented Oct 31, 2018 • edited Loading

robpal1990 commented Oct 31, 2018 • edited Loading

yelantf commented Nov 12, 2018

fmassa commented Nov 12, 2018

yelantf commented Nov 12, 2018 • edited Loading

fmassa commented Nov 12, 2018

yelantf commented Nov 15, 2018

zimenglan-sysu-512 commented Nov 16, 2018

robpal1990 commented Nov 16, 2018

zimenglan-sysu-512 commented Nov 16, 2018

fmassa commented Nov 16, 2018

zimenglan-sysu-512 commented Nov 22, 2018

Godricly commented Dec 18, 2018

Godricly commented Dec 18, 2018

chengyangfu commented Jan 30, 2019

lanfeng4659 commented Apr 11, 2019

robpal1990 commented Oct 29, 2018 •

edited

Loading

chengyangfu commented Oct 31, 2018 •

edited

Loading

robpal1990 commented Oct 31, 2018 •

edited

Loading

yelantf commented Nov 12, 2018 •

edited

Loading