-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A100 GPU MIG feature support for trainer #10529
Comments
Dear @MinWang1997, I would recommend setting the Trainer gpus=-1 and use gpus:mig to slice the A100 as you wish. Here is a guide: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/ Did you slice our A100 into 3g.20gb or anything else ? Best, |
Hi tchaton, Our institute slice A100 into 8 slices each of them is 10GB. The admins told me that when I submit LSF jobs, I need to specify ‘num=1:mig=4’ for 1 GPU with 40GB size. It seems that they will treat num=1 as 1 GPU instead of 4 slices.
Do I need to |
Hey @MinWang1997, Yes, for the meanwhile, I believe it would be the simplest approach. |
Hi Tchaton, Thank you for your reply! |
Dear @MinWang1997, I personally don't have experience with MIG. I would say yes, it says like the right behavior to set the GPUs as env var. Won't mig=4 map to 4 GPUs ? |
@MinWang1997 Could you run |
any update on this. I currently have access to an A100 machine. Is there a multi-gpu test code I can run to see compatibility with MIG. |
@Borda Is Lightning capable of working with non-integer valued CUDA_VISIBLE_DEVICES list? MIG feature requires UUIDs as opposed to integers. |
🚀 Feature
A100 GPU MIG feature support for trainer
Motivation
Hi, could you update trainer class for MIG feature GPU?
For our LSF scheduler , -gpu "num=1:mig=1" will give you a single smallest possible slice (1/7th)
-gpu "num=1:mig=7" gives you a single entire A100 GPU (80G memory)(equivalent to the previous "num=1")
What will be the number of GPU argument for trainer? The real number of GPU , or the slice number?
for example num=1:mig=4, should it be gpu =1 or 4?
I found that torch.cuda.device_count() returns 1 but I am not sure if I pass the num 1 to -gpu of trainer, will it use 10G memory or 40G memory?
How about num=2:mig=4?
Can we pass -1 to use all available gpu? How will pytorchlightning decide the number of gpus? Can we get a print information ?
Pitch
It would be better to make -gpus obvious,
for example: -gpus and -migs or treat the total slice of MIG from one A100 as one gpu.
Alternatives
The text was updated successfully, but these errors were encountered: