Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

My machine freezes with multi-gpu learning #685

Open
sangrockEG opened this issue Jun 26, 2019 · 6 comments
Open

My machine freezes with multi-gpu learning #685

sangrockEG opened this issue Jun 26, 2019 · 6 comments

Comments

@sangrockEG
Copy link

I think this is similar issue with issue #327, issue #410, issue #483

When I use single gpu, everything is fine.
But when I use multi-gpu, after few iterations (around 200~300 iters) it freezes at all.
In above issues, system freezes before the iteration is started.
But in my case, it freezes after few iterations.

And even verification examples such as torch.cuda.broadcast work very well.
I know this kind of problem is hard to solve, but I really need some helps..

@fengyu19
Copy link

I have the same issue as you. When I try to use multi-gpu to train 2 models, everything is fine at the beginning, but after about 10 epochs,, the gpu-util is about 0, the training is really slow. Did you figure it out?

@sangrockEG
Copy link
Author

Nope. I failed to fix it, and just run with single gpu.

And I think our issues are quite different..
In my case, literally whole system is frozen and crashed.
This is not a problem of speed.
But anyway learning on multi-gpu with this code seems not that stable.

@junyanz
Copy link
Owner

junyanz commented Jul 22, 2019

I suspect that visdom is not stable with Multi-GPUS but I haven't tested it. Could you disable visdom by --display_id 0?

@sangrockEG
Copy link
Author

OK I'll try it and notice you
Thanks a lot!

@jiashu-zhu
Copy link

Hi, I met the same issue as you @fengyu19 @sangrockEG , have you figured it out?

@yassineAlouini
Copy link

You might fix your run if you had the following env variable:

export NCCL_P2P_DISABLE=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants