-
Notifications
You must be signed in to change notification settings - Fork 843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about nccl p2p disable #631
Comments
There won't be any problem, just reduced performance (how much if any depends on the system). P2P not being functional is usually tied to ACS being enabled on the system. Disabling VT-d in the BIOS, passing iommu=pt to the linux kernel command line, or disabling ACS in the PCI switches usually help. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#gpu-direct |
Thank you! |
check this issue - NVIDIA/nccl-tests#117 and this one on nvidia website. As of now, it appears that nVidia has reproduced the issue and may be working on a fix. |
And as per latest comment, it appears that they are not going to fix it for 4090s. |
Hi @sjeaugey , I also met this issue and bothered me for a long time. Very thanks for you adivce. NCCL_P2P_DISABLE=1 doesn't work for me. I wonder how to pass the iommu=pt to the linux kernel command line. |
If |
Yeah, I will open a new issue. thank you! |
Hi, I don't know much about nccl.
I want to train deep learning model with multiple GPU devices within a single node by pytorch.
I do not know the exact reason, but the model "freeze"(stuck) when using 4 or more GPUs. So, while trying various things, I confirmed that the model works by setting the variable NCCL_P2P_DISABLE =1 .
As far as I know, if NCCL_P2P_DISABLE is set to 1, communication between GPUs is performed using shared memory instead of P2P/ICP.
I would like to know what potential problems can arise when NCCL_P2P_DISABLE is set to 1 like this. I'm guessing there won't be any problems, right?
The text was updated successfully, but these errors were encountered: