Regarding multi-gpu training #339

YibinXie · 2020-12-08T03:38:17Z

Q1: I use normal dataparallel training with this command but got the error.

cd ..
python tools/train.py configs/top_down/hrnet/coco/hrnet_w32_coco_256x192.py --gpus 4

Q2: I use distributed training with this command but got the error.

cd ..
bash tools/dist_train.sh configs/top_down/hrnet/coco/hrnet_w32_coco_256x192.py 4

The text was updated successfully, but these errors were encountered:

innerlee · 2020-12-08T03:41:19Z

Can single gpu be trained succefully?

YibinXie · 2020-12-08T03:42:02Z

Can single gpu be trained succefully?

Yes, single gpu works fine.

innerlee · 2020-12-08T03:51:37Z

For multiple GPUs, the distributed version is recommended.

As for the error, you can set num_workers to 0 in the config and try again

YibinXie · 2020-12-08T04:17:55Z

For multiple GPUs, the distributed version is recommended.

As for the error, you can set num_workers to 0 in the config and try again

I set workers to 0 and It's stuck here.
I never used distributed training before. I think your code is right. Yesterday, I tried to learn how to use pytorch distributed training, and I managed to distributed training a classification task on a very small dataset. But when I use the same way to train COCO dataset for pose estimation. It failed. I think maybe something goes wrong with my machine.

innerlee · 2020-12-08T04:40:08Z

What's the version of pytorch? Could you try an older version, say 1.3?

YibinXie · 2020-12-08T04:42:27Z

What's the version of pytorch? Could you try an older version, say 1.3?

My version is 1.6.0, cause I want to use auto mixed precision training.
Ok, I will try an older version later.
Many thanks for your instant reply.

YibinXie · 2020-12-09T03:38:31Z

What's the version of pytorch? Could you try an older version, say 1.3?

I've tried to reduce my PyTorch to an older version 1.3.0 and 1.4.0, unfortunately, these two versions don't make it in my cluster. My assumption something goes wrong with this cluster (4 M40 24GB). But I didn't figure out the reasons.

My M40 cluster info:
24 Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
120G memory
4*Tesla M40 24GB
I list the info above, maybe you could give some comments.

Then I clone the codes to another machine with 2 TITAN RTX 24GB with PyTorch 1.6.0. And it works fine with distributed training.

innerlee · 2020-12-09T03:45:44Z

It is nearly impossible to diagnose cluster issues without access to it. Maybe you can find IT to help?

YibinXie · 2020-12-09T03:48:43Z

It is nearly impossible to diagnose cluster issues without access to it. Maybe you can find IT to help?

Sure, it works anyway. Thanks for your reply.
By the way, have you thought to add the PyTorch amp feature to the project?

innerlee · 2020-12-09T03:58:39Z

We have some fp16 support in progress #200. PyTorch amp is one way to do it, and it might be included in near future.

innerlee · 2020-12-09T03:59:30Z

You may post a request here #9 to help us prioritize

YibinXie · 2020-12-09T04:21:55Z

You may post a request here #9 to help us prioritize

Ok, thanks.

* highlight function to load checkpoint from url * highlight function to load checkpoint from url

* fix unit test of autocast * fix compatiblity of unit test of optimizerwrapper * clean code * fix as comment * fix docstring

MrRaptorious · 2024-05-08T08:31:48Z

Hey, do you know if i can run multiple GPUs on one system, without using the distributed approach?
I do not have access to network stuff and therefore cant run the train_dist script.

Does MMPose even support that? I checked the runner.py in MMEngine and it seems like there is no multi GPU training without opening ports or something like that.

innerlee added the freeze label Dec 8, 2020

YibinXie mentioned this issue Dec 9, 2020

Roadmap of MMPose #9

Open

jin-s13 closed this as completed Dec 9, 2020

rollingman1 pushed a commit to rollingman1/mmpose that referenced this issue Nov 5, 2021

[Docs] Highlight loading checkpoints directly from url (open-mmlab#339)

303b62b

* highlight function to load checkpoint from url * highlight function to load checkpoint from url

HAOCHENYE added a commit to HAOCHENYE/mmpose that referenced this issue Jun 27, 2023

[Fix] Fix pytorch version compatibility of autocast (open-mmlab#339)

59b0ccf

* fix unit test of autocast * fix compatiblity of unit test of optimizerwrapper * clean code * fix as comment * fix docstring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding multi-gpu training #339

Regarding multi-gpu training #339

YibinXie commented Dec 8, 2020 •

edited

Loading

innerlee commented Dec 8, 2020

YibinXie commented Dec 8, 2020

innerlee commented Dec 8, 2020

YibinXie commented Dec 8, 2020 •

edited

Loading

innerlee commented Dec 8, 2020

YibinXie commented Dec 8, 2020

YibinXie commented Dec 9, 2020

innerlee commented Dec 9, 2020

YibinXie commented Dec 9, 2020

innerlee commented Dec 9, 2020

innerlee commented Dec 9, 2020

YibinXie commented Dec 9, 2020

MrRaptorious commented May 8, 2024

Regarding multi-gpu training #339

Regarding multi-gpu training #339

Comments

YibinXie commented Dec 8, 2020 • edited Loading

innerlee commented Dec 8, 2020

YibinXie commented Dec 8, 2020

innerlee commented Dec 8, 2020

YibinXie commented Dec 8, 2020 • edited Loading

innerlee commented Dec 8, 2020

YibinXie commented Dec 8, 2020

YibinXie commented Dec 9, 2020

innerlee commented Dec 9, 2020

YibinXie commented Dec 9, 2020

innerlee commented Dec 9, 2020

innerlee commented Dec 9, 2020

YibinXie commented Dec 9, 2020

MrRaptorious commented May 8, 2024

YibinXie commented Dec 8, 2020 •

edited

Loading

YibinXie commented Dec 8, 2020 •

edited

Loading