Question about nccl p2p disable #631

hw-protein · 2022-02-04T06:55:04Z

Hi, I don't know much about nccl.

I want to train deep learning model with multiple GPU devices within a single node by pytorch.

I do not know the exact reason, but the model "freeze"(stuck) when using 4 or more GPUs. So, while trying various things, I confirmed that the model works by setting the variable NCCL_P2P_DISABLE =1 .
As far as I know, if NCCL_P2P_DISABLE is set to 1, communication between GPUs is performed using shared memory instead of P2P/ICP.
I would like to know what potential problems can arise when NCCL_P2P_DISABLE is set to 1 like this. I'm guessing there won't be any problems, right?

sjeaugey · 2022-02-04T16:17:31Z

There won't be any problem, just reduced performance (how much if any depends on the system).

P2P not being functional is usually tied to ACS being enabled on the system. Disabling VT-d in the BIOS, passing iommu=pt to the linux kernel command line, or disabling ACS in the PCI switches usually help.

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#gpu-direct

hw-protein · 2022-02-05T05:55:35Z

Thank you!

mirachakshu · 2023-02-07T06:26:57Z

check this issue - NVIDIA/nccl-tests#117 and this one on nvidia website. As of now, it appears that nVidia has reproduced the issue and may be working on a fix.

https://forums.developer.nvidia.com/t/standard-nvidia-cuda-tests-fail-with-dual-rtx-4090-linux-box/233202/8

mirachakshu · 2023-02-21T21:41:05Z

And as per latest comment, it appears that they are not going to fix it for 4090s.

https://forums.developer.nvidia.com/t/standard-nvidia-cuda-tests-fail-with-dual-rtx-4090-linux-box/233202/16

Haoang97 · 2023-05-31T08:03:07Z

Hi @sjeaugey , I also met this issue and bothered me for a long time. Very thanks for you adivce. NCCL_P2P_DISABLE=1 doesn't work for me. I wonder how to pass the iommu=pt to the linux kernel command line.

sjeaugey · 2023-05-31T08:09:37Z

If NCCL_P2P_DISABLE=1 doesn't work for you, then you likely have a different issue, and it would be better to open another issue and describe the problem in details there.

Haoang97 · 2023-05-31T08:13:01Z

If NCCL_P2P_DISABLE=1 doesn't work for you, then you likely have a different issue, and it would be better to open another issue and describe the problem in details there.

Yeah, I will open a new issue. thank you!

hw-protein closed this as completed Feb 5, 2022

kwen2501 mentioned this issue Feb 8, 2022

Cannot use DDP with NCCL backend on A100 GPUs pytorch/pytorch#68735

Closed

pineking mentioned this issue Aug 1, 2022

paddle 2.3.0 官方镜像在 A40 上多卡运行会卡住 PaddlePaddle/Paddle#44777

Closed

Macsim2 mentioned this issue May 19, 2023

A100 nccl-test normal bandwidth #841

Open

lvhan028 mentioned this issue Jun 12, 2024

[Bug] Official image doesn't work for 4090 on CUDA 12.3 (but works for all other CUDA versions, and works for 12.3 on other GPU types) InternLM/lmdeploy#1750

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about nccl p2p disable #631

Question about nccl p2p disable #631

hw-protein commented Feb 4, 2022

sjeaugey commented Feb 4, 2022

hw-protein commented Feb 5, 2022

mirachakshu commented Feb 7, 2023

mirachakshu commented Feb 21, 2023 •

edited

Loading

Haoang97 commented May 31, 2023

sjeaugey commented May 31, 2023

Haoang97 commented May 31, 2023

Question about nccl p2p disable #631

Question about nccl p2p disable #631

Comments

hw-protein commented Feb 4, 2022

sjeaugey commented Feb 4, 2022

hw-protein commented Feb 5, 2022

mirachakshu commented Feb 7, 2023

mirachakshu commented Feb 21, 2023 • edited Loading

Haoang97 commented May 31, 2023

sjeaugey commented May 31, 2023

Haoang97 commented May 31, 2023

mirachakshu commented Feb 21, 2023 •

edited

Loading