Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding multi-gpu training #339

Closed
YibinXie opened this issue Dec 8, 2020 · 13 comments
Closed

Regarding multi-gpu training #339

YibinXie opened this issue Dec 8, 2020 · 13 comments
Labels

Comments

@YibinXie
Copy link

YibinXie commented Dec 8, 2020

Q1: I use normal dataparallel training with this command but got the error.

cd ..
python tools/train.py configs/top_down/hrnet/coco/hrnet_w32_coco_256x192.py --gpus 4

image

Q2: I use distributed training with this command but got the error.

cd ..
bash tools/dist_train.sh configs/top_down/hrnet/coco/hrnet_w32_coco_256x192.py 4

image

@innerlee
Copy link
Contributor

innerlee commented Dec 8, 2020

Can single gpu be trained succefully?

@YibinXie
Copy link
Author

YibinXie commented Dec 8, 2020

Can single gpu be trained succefully?

Yes, single gpu works fine.

@innerlee
Copy link
Contributor

innerlee commented Dec 8, 2020

For multiple GPUs, the distributed version is recommended.

As for the error, you can set num_workers to 0 in the config and try again

@YibinXie
Copy link
Author

YibinXie commented Dec 8, 2020

For multiple GPUs, the distributed version is recommended.

As for the error, you can set num_workers to 0 in the config and try again

I set workers to 0 and It's stuck here.
I never used distributed training before. I think your code is right. Yesterday, I tried to learn how to use pytorch distributed training, and I managed to distributed training a classification task on a very small dataset. But when I use the same way to train COCO dataset for pose estimation. It failed. I think maybe something goes wrong with my machine.
image
image

@innerlee
Copy link
Contributor

innerlee commented Dec 8, 2020

What's the version of pytorch? Could you try an older version, say 1.3?

@innerlee innerlee added the freeze label Dec 8, 2020
@YibinXie
Copy link
Author

YibinXie commented Dec 8, 2020

What's the version of pytorch? Could you try an older version, say 1.3?

My version is 1.6.0, cause I want to use auto mixed precision training.
Ok, I will try an older version later.
Many thanks for your instant reply.

@YibinXie
Copy link
Author

YibinXie commented Dec 9, 2020

What's the version of pytorch? Could you try an older version, say 1.3?

I've tried to reduce my PyTorch to an older version 1.3.0 and 1.4.0, unfortunately, these two versions don't make it in my cluster. My assumption something goes wrong with this cluster (4 M40 24GB). But I didn't figure out the reasons.

My M40 cluster info:
24 Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
120G memory
4*Tesla M40 24GB
I list the info above, maybe you could give some comments.

Then I clone the codes to another machine with 2 TITAN RTX 24GB with PyTorch 1.6.0. And it works fine with distributed training.

@innerlee
Copy link
Contributor

innerlee commented Dec 9, 2020

It is nearly impossible to diagnose cluster issues without access to it. Maybe you can find IT to help?

@YibinXie
Copy link
Author

YibinXie commented Dec 9, 2020

It is nearly impossible to diagnose cluster issues without access to it. Maybe you can find IT to help?

Sure, it works anyway. Thanks for your reply.
By the way, have you thought to add the PyTorch amp feature to the project?

@innerlee
Copy link
Contributor

innerlee commented Dec 9, 2020

We have some fp16 support in progress #200. PyTorch amp is one way to do it, and it might be included in near future.

@innerlee
Copy link
Contributor

innerlee commented Dec 9, 2020

You may post a request here #9 to help us prioritize

@YibinXie
Copy link
Author

YibinXie commented Dec 9, 2020

You may post a request here #9 to help us prioritize

Ok, thanks.

@jin-s13 jin-s13 closed this as completed Dec 9, 2020
rollingman1 pushed a commit to rollingman1/mmpose that referenced this issue Nov 5, 2021
* highlight function to load checkpoint from url

* highlight function to load checkpoint from url
HAOCHENYE added a commit to HAOCHENYE/mmpose that referenced this issue Jun 27, 2023
* fix unit test of autocast

* fix compatiblity of unit test of optimizerwrapper

* clean code

* fix as comment

* fix docstring
@MrRaptorious
Copy link

Hey, do you know if i can run multiple GPUs on one system, without using the distributed approach?
I do not have access to network stuff and therefore cant run the train_dist script.

Does MMPose even support that? I checked the runner.py in MMEngine and it seems like there is no multi GPU training without opening ports or something like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants