-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
My machine freezes with multi-gpu learning #685
Comments
I have the same issue as you. When I try to use multi-gpu to train 2 models, everything is fine at the beginning, but after about 10 epochs,, the gpu-util is about 0, the training is really slow. Did you figure it out? |
Nope. I failed to fix it, and just run with single gpu. And I think our issues are quite different.. |
I suspect that visdom is not stable with Multi-GPUS but I haven't tested it. Could you disable visdom by |
OK I'll try it and notice you |
Hi, I met the same issue as you @fengyu19 @sangrockEG , have you figured it out? |
You might fix your run if you had the following env variable:
|
I think this is similar issue with issue #327, issue #410, issue #483
When I use single gpu, everything is fine.
But when I use multi-gpu, after few iterations (around 200~300 iters) it freezes at all.
In above issues, system freezes before the iteration is started.
But in my case, it freezes after few iterations.
And even verification examples such as torch.cuda.broadcast work very well.
I know this kind of problem is hard to solve, but I really need some helps..
The text was updated successfully, but these errors were encountered: