-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regarding multi-gpu training #339
Comments
Can single gpu be trained succefully? |
Yes, single gpu works fine. |
For multiple GPUs, the distributed version is recommended. As for the error, you can set num_workers to 0 in the config and try again |
What's the version of pytorch? Could you try an older version, say 1.3? |
My version is 1.6.0, cause I want to use auto mixed precision training. |
I've tried to reduce my PyTorch to an older version 1.3.0 and 1.4.0, unfortunately, these two versions don't make it in my cluster. My assumption something goes wrong with this cluster (4 M40 24GB). But I didn't figure out the reasons. My M40 cluster info: Then I clone the codes to another machine with 2 TITAN RTX 24GB with PyTorch 1.6.0. And it works fine with distributed training. |
It is nearly impossible to diagnose cluster issues without access to it. Maybe you can find IT to help? |
Sure, it works anyway. Thanks for your reply. |
We have some fp16 support in progress #200. PyTorch amp is one way to do it, and it might be included in near future. |
You may post a request here #9 to help us prioritize |
Ok, thanks. |
* highlight function to load checkpoint from url * highlight function to load checkpoint from url
* fix unit test of autocast * fix compatiblity of unit test of optimizerwrapper * clean code * fix as comment * fix docstring
Hey, do you know if i can run multiple GPUs on one system, without using the distributed approach? Does MMPose even support that? I checked the runner.py in MMEngine and it seems like there is no multi GPU training without opening ports or something like that. |
Q1: I use normal dataparallel training with this command but got the error.
Q2: I use distributed training with this command but got the error.
The text was updated successfully, but these errors were encountered: