Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock on multi GPUs #410

Closed
BongBong87 opened this issue Oct 24, 2018 · 3 comments
Closed

Deadlock on multi GPUs #410

BongBong87 opened this issue Oct 24, 2018 · 3 comments

Comments

@BongBong87
Copy link

Hello,
I ran this command for my datasets (only 8 pairs of images) on server by using Jupyter interface:
=> python train.py --dataroot ./datasets/facades --name facades_pix2pix --model pix2pix --direction BtoA --gpu_ids 0,1,2,3 --batch_size 4
and also:
=> python train.py --dataroot ./datasets/facades --name facades_pix2pix --model pix2pix --direction BtoA --gpu_ids 0,1,2,3 --batch_size 4 --norm instance

In first epoch, It ran to here:

[Network G] Total number of parameters : 54.410 M
[Network D] Total number of parameters : 2.768 M

and similar to infinite loop(deadlock). No error was occurred but the program can not continue. I check the status of 4 TitanXp cards. All of them is 100% usage.
Some infor:
4 Nvidia TitanXp
Cuda 9.2 cuaDnn 7.1
Ubuntu core 16.04
For supplement: I used Python 3.5 and Pytorch 0.4.1.

p/s: I ran this model in each single GPU, it worked well.

Is there any idea for me? Thanks so much for your help.

@BongBong87
Copy link
Author

some people said it related to NCCL of Nvidia? Is it right?

@jiashu-zhu
Copy link

Hi, I met this issue recently, have you figured it out? @BongBong87

@Ellysian
Copy link

@jiashu-zhu same issue, any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants