Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about nccl p2p disable #631

Closed
hw-protein opened this issue Feb 4, 2022 · 7 comments
Closed

Question about nccl p2p disable #631

hw-protein opened this issue Feb 4, 2022 · 7 comments

Comments

@hw-protein
Copy link

Hi, I don't know much about nccl.

I want to train deep learning model with multiple GPU devices within a single node by pytorch.

I do not know the exact reason, but the model "freeze"(stuck) when using 4 or more GPUs. So, while trying various things, I confirmed that the model works by setting the variable NCCL_P2P_DISABLE =1 .
As far as I know, if NCCL_P2P_DISABLE is set to 1, communication between GPUs is performed using shared memory instead of P2P/ICP.
I would like to know what potential problems can arise when NCCL_P2P_DISABLE is set to 1 like this. I'm guessing there won't be any problems, right?

@sjeaugey
Copy link
Member

sjeaugey commented Feb 4, 2022

There won't be any problem, just reduced performance (how much if any depends on the system).

P2P not being functional is usually tied to ACS being enabled on the system. Disabling VT-d in the BIOS, passing iommu=pt to the linux kernel command line, or disabling ACS in the PCI switches usually help.

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#gpu-direct

@hw-protein
Copy link
Author

Thank you!

@mirachakshu
Copy link

check this issue - NVIDIA/nccl-tests#117 and this one on nvidia website. As of now, it appears that nVidia has reproduced the issue and may be working on a fix.

https://forums.developer.nvidia.com/t/standard-nvidia-cuda-tests-fail-with-dual-rtx-4090-linux-box/233202/8

@mirachakshu
Copy link

mirachakshu commented Feb 21, 2023

And as per latest comment, it appears that they are not going to fix it for 4090s.

https://forums.developer.nvidia.com/t/standard-nvidia-cuda-tests-fail-with-dual-rtx-4090-linux-box/233202/16

@Haoang97
Copy link

Hi @sjeaugey , I also met this issue and bothered me for a long time. Very thanks for you adivce. NCCL_P2P_DISABLE=1 doesn't work for me. I wonder how to pass the iommu=pt to the linux kernel command line.

@sjeaugey
Copy link
Member

If NCCL_P2P_DISABLE=1 doesn't work for you, then you likely have a different issue, and it would be better to open another issue and describe the problem in details there.

@Haoang97
Copy link

If NCCL_P2P_DISABLE=1 doesn't work for you, then you likely have a different issue, and it would be better to open another issue and describe the problem in details there.

Yeah, I will open a new issue. thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants