Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test CUDA failure common.cu:941 'invalid device ordinal' when test two nodes with nvhpc #263

Open
heya5 opened this issue Nov 4, 2024 · 3 comments

Comments

@heya5
Copy link

heya5 commented Nov 4, 2024

I compile nccl-tests with the command:

make MPI=1 MPI_HOME=${NVHPC_ROOT}/comm_libs/12.4/hpcx/hpcx-2.19/ompi NCCL_HOME=${NVHPC_ROOT}/comm_libs/nccl CUDA_HOME=${NVHPC_ROOT}/cuda

And run the command to test the all_reduce_perf:

mpirun --host g0010:8,g0016:8 \
-x LD_LIBRARY_PATH=${NVHPC_ROOT}/comm_libs/nccl/lib:${NVHPC_ROOT}/cuda/lib64 \
-x PATH=${NVHPC_ROOT}/comm_libs/mpi/bin:${NVHPC_ROOT}/compilers/bin:$PATH \
-x NCCL_SOCKET_IFNAME=ib0 \
-x NCCL_DEBUG=INFO \
-x NCCL_DEBUG_SUBSYS=INIT,NET,GRAPH \
-x NCCL_PXN_DISABLE=0 \
-x NCCL_IB_QPS_PER_CONNECTION=4 \
-x NCCL_ALGO=Ring \
-x NCCL_IB_DISABLE=0 \
-x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_6,mlx5_7,mlx5_8,mlx5_9 \
/home/clouduser/nccl-tests/build/all_reduce_perf -b 128M -e 2G -f 2 -g 8

And I got the error:

# Using devices
g0010: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0010 pid 888386: Test failure common.cu:891
g0010: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0010 pid 888385: Test failure common.cu:891
g0010: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0010 pid 888382: Test failure common.cu:891
g0016: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0016 pid 3645592: Test failure common.cu:891
g0010: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0010 pid 888383: Test failure common.cu:891
g0010: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0010 pid 888387: Test failure common.cu:891
g0010: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0010 pid 888384: Test failure common.cu:891
g0010: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0010 pid 888388: Test failure common.cu:891
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
g0016: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0016 pid 3645590: Test failure common.cu:891
g0016: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0016 pid 3645594: Test failure common.cu:891
g0016: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0016 pid 3645588: Test failure common.cu:891
g0016: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0016 pid 3645593: Test failure common.cu:891
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[21831,1],5]
  Exit code:    2

NOTE: When I directly use openmpi instead of nvhpc, the test run successfully.

@AddyLaddy
Copy link
Collaborator

Normally we use mpirun to launch one process per GPU so you don't need -g 8 on the nccl-test command line in that case

@heya5
Copy link
Author

heya5 commented Nov 5, 2024

@AddyLaddy Thanks!
After changing the -g 8 to -g 1, I get the error:

g0016:3715149:3715349 [0] transport/net_ib.cc:1297 NCCL WARN NET/IB : Got completion from peer 10.26.10.16<48059> with error 4, opcode 0, len 0, vendor err 81 (Send)
g0016:3715149:3715349 [0] NCCL INFO transport/net.cc:1008 -> 6
g0016:3715149:3715349 [0] NCCL INFO proxy.cc:679 -> 6
g0016:3715149:3715349 [0] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]

g0016:3715153:3715355 [4] transport/net_ib.cc:1297 NCCL WARN NET/IB : Got completion from peer 10.26.10.16<55431> with error 4, opcode 0, len 0, vendor err 81 (Send)
g0016:3715153:3715355 [4] NCCL INFO transport/net.cc:1008 -> 6
g0016:3715153:3715355 [4] NCCL INFO proxy.cc:679 -> 6
g0016:3715153:3715355 [4] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]

I find an issue may help to solve the problem NVIDIA/nccl#928 ,
So I add -x NCCL_NET_GDR_LEVEL=0 to the command,

mpirun --host g0010:8,g0016:8 \
-x LD_LIBRARY_PATH=${NVHPC_ROOT}/comm_libs/nccl/lib:${NVHPC_ROOT}/cuda/lib64 \
-x PATH=${NVHPC_ROOT}/comm_libs/mpi/bin:${NVHPC_ROOT}/compilers/bin:$PATH \
-x NCCL_SOCKET_IFNAME=ib0 \
-x NCCL_DEBUG=INFO \
-x NCCL_DEBUG_SUBSYS=INIT,NET,GRAPH \
-x NCCL_PXN_DISABLE=0 \
-x NCCL_IB_QPS_PER_CONNECTION=4 \
-x NCCL_ALGO=Ring \
-x NCCL_IB_DISABLE=0 \
-x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_6,mlx5_7,mlx5_8,mlx5_9 \
-x NCCL_NET_GDR_LEVEL=0 \
/home/clouduser/nccl-tests/build/all_reduce_perf -b 128M -e 2G -f 2 -g 1

And I get a new error:

g0016:3715926:3716128 [3] transport/net_ib.cc:1297 NCCL WARN NET/IB : Got completion from peer 10.26.10.16<54520> with error 12, opcode 0, len 5350, vendor err 129 (Recv)
g0016:3715926:3716128 [3] NCCL INFO transport/net.cc:1134 -> 6
g0016:3715926:3716128 [3] NCCL INFO proxy.cc:679 -> 6
g0016:3715926:3716128 [3] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]

g0010:958137:958336 [3] transport/net_ib.cc:1297 NCCL WARN NET/IB : Got completion from peer 10.26.10.10<51872> with error 12, opcode 0, len 5413, vendor err 129 (Recv)
g0010:958137:958336 [3] NCCL INFO transport/net.cc:1134 -> 6
g0010:958137:958336 [3] NCCL INFO proxy.cc:679 -> 6
g0010:958137:958336 [3] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]

And I also try to add -x NCCL_IB_GID_INDEX=3, but still get the error 12.

@AddyLaddy
Copy link
Collaborator

Yeah, the first NET/IB error looks like the typical one when ACS is not disabled and GDRDMA is used.
The second one looks like a typical connection timeout issue when the nodes cannot communicate via the NET/IB device(s).

I'd suggest resolving the ACS issue and also using the perftests suite to check that each node can communicate successfully over the NET/IB devices using something like ib_write_bw or similar.

Also be careful with NCCL_IB_HCA=mlx5_1 as that will select all NICs with that prefix, so mlx5_10, mlx5_11 etc. If those exist on this platform it may have not been your intention for NCCL to select them. You can instead use NCCL_IB_HCA==mlx5_1 etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants