Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A100 GPU MIG feature support for trainer #10529

Open
minwang-ai opened this issue Nov 14, 2021 · 9 comments
Open

A100 GPU MIG feature support for trainer #10529

minwang-ai opened this issue Nov 14, 2021 · 9 comments
Assignees
Labels
feature Is an improvement or enhancement trainer: argument

Comments

@minwang-ai
Copy link

🚀 Feature

A100 GPU MIG feature support for trainer

Motivation

Hi, could you update trainer class for MIG feature GPU?

For our LSF scheduler , -gpu "num=1:mig=1" will give you a single smallest possible slice (1/7th)
 -gpu "num=1:mig=7" gives you a single entire A100 GPU (80G memory)(equivalent to the previous "num=1")

What will be the number of GPU argument for trainer? The real number of GPU , or the slice number?
for example num=1:mig=4, should it be gpu =1 or 4?

I found that torch.cuda.device_count() returns 1 but I am not sure if I pass the num 1 to -gpu of trainer, will it use 10G memory or 40G memory?

How about num=2:mig=4?

Can we pass -1 to use all available gpu? How will pytorchlightning decide the number of gpus? Can we get a print information ?

Pitch

It would be better to make -gpus obvious,
for example: -gpus and -migs or treat the total slice of MIG from one A100 as one gpu.

Alternatives

@akihironitta akihironitta added the feature Is an improvement or enhancement label Nov 15, 2021
@kaushikb11 kaushikb11 self-assigned this Nov 15, 2021
@tchaton
Copy link
Contributor

tchaton commented Nov 15, 2021

Dear @MinWang1997,

I would recommend setting the Trainer gpus=-1 and use gpus:mig to slice the A100 as you wish. Here is a guide: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/

Did you slice our A100 into 3g.20gb or anything else ?

Best,
T.C

@minwang-ai
Copy link
Author

Did you slice our A100 into 3g.20gb or anything else ?

Hi tchaton,

Our institute slice A100 into 8 slices each of them is 10GB. The admins told me that when I submit LSF jobs, I need to specify ‘num=1:mig=4’ for 1 GPU with 40GB size. It seems that they will treat num=1 as 1 GPU instead of 4 slices.

I would recommend setting the Trainer gpus=-1

Do I need to export cuda_visible_devices environment variable=0 for 1 GPU or export cuda_visible_devices environment variable=0,1 for num=2:mig=4’ in my .bashrc file /bsub file or python script when I set-gpu=-1?

@tchaton
Copy link
Contributor

tchaton commented Nov 15, 2021

Hey @MinWang1997,

Yes, for the meanwhile, I believe it would be the simplest approach.

@minwang-ai
Copy link
Author

Do I need to export cuda_visible_devices environment variable=0 for 1 GPU or export cuda_visible_devices environment variable=0,1 for num=2:mig=4’ in my .bashrc file /bsub file or python script

Hi Tchaton,

Thank you for your reply!
Could you have a look at this question?

@tchaton
Copy link
Contributor

tchaton commented Nov 15, 2021

Dear @MinWang1997,

I personally don't have experience with MIG. I would say yes, it says like the right behavior to set the GPUs as env var.

Won't mig=4 map to 4 GPUs ?

@kaushikb11
Copy link
Contributor

@MinWang1997

Could you run sudo nvidia-smi mig -lgi and let us know the list of available GPU instances?

@Borda Borda self-assigned this Nov 7, 2022
@Sinan81
Copy link

Sinan81 commented Feb 9, 2023

any update on this. I currently have access to an A100 machine. Is there a multi-gpu test code I can run to see compatibility with MIG.

@Sinan81
Copy link

Sinan81 commented Feb 15, 2023

#16755

@Sinan81
Copy link

Sinan81 commented Feb 15, 2023

@Borda Is Lightning capable of working with non-integer valued CUDA_VISIBLE_DEVICES list? MIG feature requires UUIDs as opposed to integers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement trainer: argument
Projects
None yet
Development

No branches or pull requests

6 participants