My machine freezes with multi-gpu learning #685

sangrockEG · 2019-06-26T05:00:48Z

I think this is similar issue with issue #327, issue #410, issue #483

When I use single gpu, everything is fine.
But when I use multi-gpu, after few iterations (around 200~300 iters) it freezes at all.
In above issues, system freezes before the iteration is started.
But in my case, it freezes after few iterations.

And even verification examples such as torch.cuda.broadcast work very well.
I know this kind of problem is hard to solve, but I really need some helps..

fengyu19 · 2019-07-18T21:36:30Z

I have the same issue as you. When I try to use multi-gpu to train 2 models, everything is fine at the beginning, but after about 10 epochs,, the gpu-util is about 0, the training is really slow. Did you figure it out?

sangrockEG · 2019-07-19T05:36:46Z

Nope. I failed to fix it, and just run with single gpu.

And I think our issues are quite different..
In my case, literally whole system is frozen and crashed.
This is not a problem of speed.
But anyway learning on multi-gpu with this code seems not that stable.

junyanz · 2019-07-22T19:33:49Z

I suspect that visdom is not stable with Multi-GPUS but I haven't tested it. Could you disable visdom by --display_id 0?

sangrockEG · 2019-07-23T03:39:41Z

OK I'll try it and notice you
Thanks a lot!

jiashu-zhu · 2020-07-19T18:01:12Z

Hi, I met the same issue as you @fengyu19 @sangrockEG , have you figured it out?

yassineAlouini · 2024-10-14T15:45:15Z

You might fix your run if you had the following env variable:

export NCCL_P2P_DISABLE=1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

My machine freezes with multi-gpu learning #685

My machine freezes with multi-gpu learning #685

sangrockEG commented Jun 26, 2019

fengyu19 commented Jul 18, 2019

sangrockEG commented Jul 19, 2019

junyanz commented Jul 22, 2019

sangrockEG commented Jul 23, 2019

jiashu-zhu commented Jul 19, 2020

yassineAlouini commented Oct 14, 2024

My machine freezes with multi-gpu learning #685

My machine freezes with multi-gpu learning #685

Comments

sangrockEG commented Jun 26, 2019

fengyu19 commented Jul 18, 2019

sangrockEG commented Jul 19, 2019

junyanz commented Jul 22, 2019

sangrockEG commented Jul 23, 2019

jiashu-zhu commented Jul 19, 2020

yassineAlouini commented Oct 14, 2024