Deadlock on multi GPUs #410

BongBong87 · 2018-10-24T05:39:29Z

Hello,
I ran this command for my datasets (only 8 pairs of images) on server by using Jupyter interface:
=> python train.py --dataroot ./datasets/facades --name facades_pix2pix --model pix2pix --direction BtoA --gpu_ids 0,1,2,3 --batch_size 4
and also:
=> python train.py --dataroot ./datasets/facades --name facades_pix2pix --model pix2pix --direction BtoA --gpu_ids 0,1,2,3 --batch_size 4 --norm instance

In first epoch, It ran to here:

[Network G] Total number of parameters : 54.410 M
[Network D] Total number of parameters : 2.768 M

and similar to infinite loop(deadlock). No error was occurred but the program can not continue. I check the status of 4 TitanXp cards. All of them is 100% usage.
Some infor:
4 Nvidia TitanXp
Cuda 9.2 cuaDnn 7.1
Ubuntu core 16.04
For supplement: I used Python 3.5 and Pytorch 0.4.1.

p/s: I ran this model in each single GPU, it worked well.

Is there any idea for me? Thanks so much for your help.

BongBong87 · 2018-10-24T05:40:41Z

some people said it related to NCCL of Nvidia? Is it right?

jiashu-zhu · 2020-07-20T14:32:23Z

Hi, I met this issue recently, have you figured it out? @BongBong87

Ellysian · 2024-04-19T14:42:07Z

@jiashu-zhu same issue, any updates?

BongBong87 closed this as completed Oct 26, 2018

sangrockEG mentioned this issue Jun 26, 2019

My machine freezes with multi-gpu learning #685

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock on multi GPUs #410

Deadlock on multi GPUs #410

BongBong87 commented Oct 24, 2018

BongBong87 commented Oct 24, 2018

jiashu-zhu commented Jul 20, 2020

Ellysian commented Apr 19, 2024

Deadlock on multi GPUs #410

Deadlock on multi GPUs #410

Comments

BongBong87 commented Oct 24, 2018

[Network G] Total number of parameters : 54.410 M [Network D] Total number of parameters : 2.768 M

BongBong87 commented Oct 24, 2018

jiashu-zhu commented Jul 20, 2020

Ellysian commented Apr 19, 2024

[Network G] Total number of parameters : 54.410 M
[Network D] Total number of parameters : 2.768 M