Imbalanced GPU memory with DDP, single machine multiple GPUs #6568

Alexfinkelshtein · 2021-03-17T16:02:05Z

Alexfinkelshtein
Mar 17, 2021

Hi all, Multi-GPU question here. I have an imbalance in memory usage between the GPUs, GPU0 mem usage is about 20% higher then the rest of my GPUs. this leads to an anoying bottleneck, while I have 20% x (num_gpus-1) unused memory, the peak of GPU0 blocks my ability to utilize it.
reading through this blog post that is aimed for general Pytorch, I see that this can happen when not using a ModelParallel equivalent model (attached image), I am trying to figure out how to do it using lightning especially steps 2-3 in the backward row described in the attached image.
from the lightning Multi-GPUs docs, I couldn't figure it out, the model parallelism that is described there seem to be different.
I have multiple gpus on a single machine and I'm training with ddp, and DDPPlugin(find_unused_parameters=True)).
any ideas \ resources \ solutions will be much appreciated

MicPie · 2021-06-14T10:35:16Z

MicPie
Jun 14, 2021

This fixed the same problem in a setup using only pytorch (no pl): https://discuss.pytorch.org/t/extra-10gb-memory-on-gpu-0-in-ddp-tutorial/118113

0 replies

ankurhanda · 2022-02-04T04:47:16Z

ankurhanda
Feb 4, 2022

Hi @Alexfinkelshtein

Did you manage to solve this issue in pytorch-lightning? I observe something similar where I am seeing that one GPU has max-ed out with the GPU memory while others not. I'm using ddp strategy. I didn't see anything to address this in their docs.

Every 2.0s: nvidia-smi                                                                                                                                                              Thu Feb  3 18:47:11 2022

Thu Feb  3 18:47:11 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-DGXS...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   50C    P0   220W / 300W |  16114MiB / 16155MiB |     98%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-DGXS...  On   | 00000000:08:00.0 Off |                    0 |
| N/A   50C    P0   255W / 300W |   9787MiB / 16158MiB |     93%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-DGXS...  On   | 00000000:0E:00.0 Off |                    0 |
| N/A   51C    P0   276W / 300W |   9801MiB / 16158MiB |     91%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-DGXS...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   51C    P0   295W / 300W |   9753MiB / 16158MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

1 reply

DinoMan Feb 14, 2022

@ankurhanda I also have the same issue. I do not see where this could be coming from since I do not have any .cuda() calls in my code.

YanhaoWu · 2022-05-25T07:45:20Z

YanhaoWu
May 25, 2022

Is there any solution now? I am also troubled by this problem

0 replies

Dan-wanna-M · 2024-11-27T01:17:31Z

Dan-wanna-M
Nov 27, 2024

def worker_init_fn(worker_id: int):
    import os
    os.environ['CUDA_VISIBLE_DEVICES'] = ''

Pass this to Dataloader constructor appears to work.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Imbalanced GPU memory with DDP, single machine multiple GPUs #6568

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Imbalanced GPU memory with DDP, single machine multiple GPUs #6568

Alexfinkelshtein Mar 17, 2021

Replies: 4 comments · 1 reply

MicPie Jun 14, 2021

ankurhanda Feb 4, 2022

DinoMan Feb 14, 2022

YanhaoWu May 25, 2022

Dan-wanna-M Nov 27, 2024

Alexfinkelshtein
Mar 17, 2021

Replies: 4 comments 1 reply

MicPie
Jun 14, 2021

ankurhanda
Feb 4, 2022

YanhaoWu
May 25, 2022

Dan-wanna-M
Nov 27, 2024