A100 GPU MIG feature support for trainer #10529

minwang-ai · 2021-11-14T17:19:00Z

🚀 Feature

A100 GPU MIG feature support for trainer

Motivation

Hi, could you update trainer class for MIG feature GPU?

For our LSF scheduler , -gpu "num=1:mig=1" will give you a single smallest possible slice (1/7th)
-gpu "num=1:mig=7" gives you a single entire A100 GPU (80G memory)(equivalent to the previous "num=1")

What will be the number of GPU argument for trainer? The real number of GPU , or the slice number?
for example num=1:mig=4, should it be gpu =1 or 4?

I found that torch.cuda.device_count() returns 1 but I am not sure if I pass the num 1 to -gpu of trainer, will it use 10G memory or 40G memory?

How about num=2:mig=4?

Can we pass -1 to use all available gpu? How will pytorchlightning decide the number of gpus? Can we get a print information ?

Pitch

It would be better to make -gpus obvious,
for example: -gpus and -migs or treat the total slice of MIG from one A100 as one gpu.

Alternatives

tchaton · 2021-11-15T09:59:49Z

Dear @MinWang1997,

I would recommend setting the Trainer gpus=-1 and use gpus:mig to slice the A100 as you wish. Here is a guide: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/

Did you slice our A100 into 3g.20gb or anything else ?

Best,
T.C

minwang-ai · 2021-11-15T14:24:35Z

Did you slice our A100 into 3g.20gb or anything else ?

Hi tchaton,

Our institute slice A100 into 8 slices each of them is 10GB. The admins told me that when I submit LSF jobs, I need to specify ‘num=1:mig=4’ for 1 GPU with 40GB size. It seems that they will treat num=1 as 1 GPU instead of 4 slices.

I would recommend setting the Trainer gpus=-1

Do I need to export cuda_visible_devices environment variable=0 for 1 GPU or export cuda_visible_devices environment variable=0,1 for num=2:mig=4’ in my .bashrc file /bsub file or python script when I set-gpu=-1?

tchaton · 2021-11-15T15:21:52Z

Hey @MinWang1997,

Yes, for the meanwhile, I believe it would be the simplest approach.

minwang-ai · 2021-11-15T15:46:12Z

Do I need to export cuda_visible_devices environment variable=0 for 1 GPU or export cuda_visible_devices environment variable=0,1 for num=2:mig=4’ in my .bashrc file /bsub file or python script

Hi Tchaton,

Thank you for your reply!
Could you have a look at this question?

tchaton · 2021-11-15T15:53:39Z

Dear @MinWang1997,

I personally don't have experience with MIG. I would say yes, it says like the right behavior to set the GPUs as env var.

Won't mig=4 map to 4 GPUs ?

kaushikb11 · 2021-11-16T11:06:41Z

@MinWang1997

Could you run sudo nvidia-smi mig -lgi and let us know the list of available GPU instances?

Sinan81 · 2023-02-09T03:08:43Z

any update on this. I currently have access to an A100 machine. Is there a multi-gpu test code I can run to see compatibility with MIG.

Sinan81 · 2023-02-15T20:46:42Z

#16755

Sinan81 · 2023-02-15T20:48:35Z

@Borda Is Lightning capable of working with non-integer valued CUDA_VISIBLE_DEVICES list? MIG feature requires UUIDs as opposed to integers.

akihironitta added the feature Is an improvement or enhancement label Nov 15, 2021

tchaton added the trainer: argument label Nov 15, 2021

kaushikb11 self-assigned this Nov 15, 2021

tchaton mentioned this issue Nov 15, 2021

[RFC] Future of gpus/ipus/tpu_cores with respect to devices #10410

Closed

Borda self-assigned this Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A100 GPU MIG feature support for trainer #10529

A100 GPU MIG feature support for trainer #10529

minwang-ai commented Nov 14, 2021

tchaton commented Nov 15, 2021

minwang-ai commented Nov 15, 2021

tchaton commented Nov 15, 2021

minwang-ai commented Nov 15, 2021

tchaton commented Nov 15, 2021

kaushikb11 commented Nov 16, 2021

Sinan81 commented Feb 9, 2023

Sinan81 commented Feb 15, 2023

Sinan81 commented Feb 15, 2023 •

edited

Loading

A100 GPU MIG feature support for trainer #10529

A100 GPU MIG feature support for trainer #10529

Comments

minwang-ai commented Nov 14, 2021

🚀 Feature

Motivation

Pitch

Alternatives

tchaton commented Nov 15, 2021

minwang-ai commented Nov 15, 2021

tchaton commented Nov 15, 2021

minwang-ai commented Nov 15, 2021

tchaton commented Nov 15, 2021

kaushikb11 commented Nov 16, 2021

Sinan81 commented Feb 9, 2023

Sinan81 commented Feb 15, 2023

Sinan81 commented Feb 15, 2023 • edited Loading

Sinan81 commented Feb 15, 2023 •

edited

Loading