Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using GPUDirect and NCCL with torch 2.5 (nvidia-nccl-cu12 2.21.5) #414

Open
danielkovtun opened this issue Nov 5, 2024 · 0 comments
Open

Comments

@danielkovtun
Copy link

Hello,

I am working on upgrading pytorch to latest stable release (2.5.1) and observed NCCL issues which I believe are tied to our usage of the GPUDirect-TCPX + NCCL daemonsets for A3 High VMs on GKE.

For context, I have a working pytorch 2.1.2 setup that uses the daemonset and other configuration as per the docs that works for multi-node communication via NCCL. The pytorch library was installed from a pre-built wheel, which pulled in the nvidia-nccl-cu12==2.18.1 dependency. With NCCL_DEBUG set, the version which is printed by torch distributed is 2.18.1. The mounted libraries from the Daemonset appear to be libnccl.so.2.18.5.

For the version where I am using pytorch 2.5.1, the base image used was pytorch/pytorch:2.5.1-cuda12-cudnn9-devel, which pulls in the nvidia-nccl-cu12==2.21.5 dependency. When using this image, the NCCL backend works on single-node multi-device (all 8), but with multi-node, we see NCCL errors:

[5]:torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
[5]:ncclInternalError: Internal check failed.
[5]:Last error:
[5]:NET/GPUDirectTCPX failed to connect socket
[5]:Exception raised from create at ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317 (most recent call first):

I'm happy to provide a more detailed stack trace if helpful.

I was hoping the maintainers could help address a couple of questions regarding the daemonset and its version constraints. In the docs for installing the GPUDirect binary and configuring NCCL via the daemonset, it states that the daemonset will install a specific NCCL library version.

Questions

  1. Does this mean that the NCCL library that is used by containers that request GPU and use hostPath volume mounts to mount the library and the binary in the /home/kubernetes/bin/nvidia/lib64 directory on the VM is the specific NCCL library version hard-coded in the nccl-installer?
  2. Does this mean that applications which are compiled to use a newer NCCL version (2.21.5) are not possible to run on A3 VMs with GPUDirect-TCPX?
  3. Is there a way to control the NCCL version installed by the daemonset? Inspecting the container, I see that the installer entrypoint /scripts/container_entry.sh install --install-nccl says that it installs "NCCL main branch". As far as I can tell, the NCCL installation simply involves copying a pre-built libnccl.so.2.18.5 from /var/lib/tcpx/lib64/ to /usr/local/nvidia/lib64
install_nccl() {
  local -r nccltype=$1
  echo -n "Installing NCCL ${nccltype}, "
  if [[ "${nccltype}" == "nvtx" ]]; then
    cp -P /third_party/nccl-netsupport-nvtx/build/lib/libnccl.so* /var/lib/tcpx/lib64/
    cp -P /third_party/nccl-netsupport-nvtx/build/lib/libnccl.so* /var/lib/tcpxo/lib64/
    cp -P /third_party/nccl-netsupport-nvtx/build/lib/libnccl.so* /var/lib/fastrak/lib64/
  else
    cp -P /third_party/nccl-netsupport/build/lib/libnccl.so* /var/lib/tcpx/lib64/
    cp -P /third_party/nccl-netsupport/build/lib/libnccl.so* /var/lib/tcpxo/lib64/
    cp -P /third_party/nccl-netsupport/build/lib/libnccl.so* /var/lib/fastrak/lib64/
  fi
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant