Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what is cu:990 error? how to solve this problem? #230

Open
MAKER-park opened this issue Jun 27, 2024 · 5 comments
Open

what is cu:990 error? how to solve this problem? #230

MAKER-park opened this issue Jun 27, 2024 · 5 comments

Comments

@MAKER-park
Copy link

thank you for attention this problem.
my workstation spec is
RTX A4000 *2
WSL2_Ubuntu-22.04
cudnn 8.9
(base) heartlab@DESKTOP-GGBQPHK:~/nccl-tests$ nvidia-smi
Fri Jun 28 05:15:17 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 551.61 CUDA Version: 12.4 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A4000 On | 00000000:65:00.0 Off | Off |
| 41% 37C P8 6W / 140W | 17MiB / 16376MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A4000 On | 00000000:B3:00.0 On | Off |
| 41% 37C P8 7W / 140W | 571MiB / 16376MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 31 G /Xwayland N/A |
| 1 N/A N/A 31 G /Xwayland N/A |
+---------------------------------------------------------------------------------------+

and i run this command when i done make command
(base) heartlab@DESKTOP-GGBQPHK:~/nccl-tests$ mpirun -np 2 --allow-run-as-root -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

Using devices

nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

Using devices

Rank 0 Group 0 Pid 64755 on DESKTOP-GGBQPHK device 0 [0x65] NVIDIA RTX A4000

Rank 0 Group 0 Pid 64756 on DESKTOP-GGBQPHK device 0 [0x65] NVIDIA RTX A4000

DESKTOP-GGBQPHK:64755:64755 [0] NCCL INFO Bootstrap : Using eth0:172.30.81.89<0>
DESKTOP-GGBQPHK:64756:64756 [0] NCCL INFO Bootstrap : Using eth0:172.30.81.89<0>
DESKTOP-GGBQPHK:64756:64756 [0] NCCL INFO cudaDriverVersion 12040
DESKTOP-GGBQPHK:64756:64756 [0] NCCL INFO NCCL version 2.22.3+cuda12.0
DESKTOP-GGBQPHK:64755:64755 [0] NCCL INFO cudaDriverVersion 12040
DESKTOP-GGBQPHK:64755:64755 [0] NCCL INFO NCCL version 2.22.3+cuda12.0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO NET/IB : No device found.
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO NET/Socket : Using [0]eth0:172.30.81.89<0> [1]veth9d2d103:fe80::b051:e3ff:febe:607b%veth9d2d103<0>
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Using network Socket
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO NET/IB : No device found.
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO NET/Socket : Using [0]eth0:172.30.81.89<0> [1]veth9d2d103:fe80::b051:e3ff:febe:607b%veth9d2d103<0>
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Using network Socket
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO ncclCommInitRank comm 0x5567a8af3a00 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 65000 commId 0x1e41f00635db9132 - Init START
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO comm 0x5567a8af3a00 rank 0 nRanks 1 nNodes 1 localRanks 1 localRank 0 MNNVL 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 00/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 01/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 02/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 03/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 04/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 05/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 06/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 07/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 08/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 09/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 10/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 11/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 12/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 13/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 14/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 15/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 16/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 17/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 18/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 19/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 20/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 21/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 22/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 23/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 24/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 25/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 26/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 27/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 28/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 29/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 30/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 31/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO P2P Chunksize set to 131072

DESKTOP-GGBQPHK:64756:64771 [0] include/alloc.h:123 NCCL WARN Cuda failure 999 'unknown error'
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO include/alloc.h:215 -> 1
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO channel.cc:42 -> 1
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO init.cc:544 -> 1
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO init.cc:1156 -> 1
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO init.cc:1408 -> 1
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO group.cc:70 -> 1 [Async thread]
DESKTOP-GGBQPHK:64756:64756 [0] NCCL INFO group.cc:420 -> 1
DESKTOP-GGBQPHK:64756:64756 [0] NCCL INFO group.cc:546 -> 1
DESKTOP-GGBQPHK:64756:64756 [0] NCCL INFO group.cc:101 -> 1
DESKTOP-GGBQPHK:64756:64756 [0] NCCL INFO init.cc:1761 -> 1
DESKTOP-GGBQPHK: Test NCCL failure common.cu:990 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
.. DESKTOP-GGBQPHK pid 64756: Test failure common.cu:876
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO ncclCommInitRank comm 0x5567734f9a00 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 65000 commId 0xd904a9f238296abf - Init START
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO comm 0x5567734f9a00 rank 0 nRanks 1 nNodes 1 localRanks 1 localRank 0 MNNVL 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 00/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 01/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 02/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 03/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 04/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 05/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 06/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 07/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 08/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 09/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 10/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 11/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 12/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 13/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 14/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 15/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 16/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 17/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 18/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 19/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 20/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 21/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 22/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 23/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 24/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 25/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 26/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 27/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 28/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 29/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 30/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 31/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO P2P Chunksize set to 131072

DESKTOP-GGBQPHK:64755:64773 [0] include/alloc.h:123 NCCL WARN Cuda failure 999 'unknown error'
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO include/alloc.h:215 -> 1
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO channel.cc:42 -> 1
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO init.cc:544 -> 1
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO init.cc:1156 -> 1
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO init.cc:1408 -> 1
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO group.cc:70 -> 1 [Async thread]
DESKTOP-GGBQPHK:64755:64755 [0] NCCL INFO group.cc:420 -> 1
DESKTOP-GGBQPHK:64755:64755 [0] NCCL INFO group.cc:546 -> 1
DESKTOP-GGBQPHK:64755:64755 [0] NCCL INFO group.cc:101 -> 1
DESKTOP-GGBQPHK:64755:64755 [0] NCCL INFO init.cc:1761 -> 1
DESKTOP-GGBQPHK: Test NCCL failure common.cu:990 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
.. DESKTOP-GGBQPHK pid 64755: Test failure common.cu:876

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[26406,1],1]
Exit code: 3

what happend this problem i just want tensorflow multi gpu for reduce vram stress...

@AddyLaddy
Copy link
Collaborator

It looks like you're running a single process test twice and they are both using the same device . You need to compile the nccl-tests with MPI=1 for this to work.

@MAKER-park
Copy link
Author

@AddyLaddy thank you for reply!

then you mean instead of 'make' use 'make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl' ?

but this command is not working in my system...

here my result
(TF) heartlab@DESKTOP-GGBQPHK:~/nccl-tests$ make MPI=1 MPI_HOME=/usr CUDA_HOME=/usr/local/cuda NCCL_HOME=/home/heartlab/anaconda3/envs/TF
make -C src build BUILDDIR=/home/heartlab/nccl-tests/build
make[1]: Entering directory '/home/heartlab/nccl-tests/src'
Compiling timer.cc > /home/heartlab/nccl-tests/build/timer.o
Compiling /home/heartlab/nccl-tests/build/verifiable/verifiable.o
Compiling all_reduce.cu > /home/heartlab/nccl-tests/build/all_reduce.o
In file included from all_reduce.cu:8:
common.h:14:10: fatal error: mpi.h: No such file or directory
14 | #include "mpi.h"
| ^~~~~~~
compilation terminated.
make[1]: *** [Makefile:94: /home/heartlab/nccl-tests/build/all_reduce.o] Error 1
make[1]: Leaving directory '/home/heartlab/nccl-tests/src'
make: *** [Makefile:20: src.build] Error 2

clearly, install mpi and setup my bashrc file

export PATH=/usr/bin:$PATH
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/openmpi/lib:$LD_LIBRARY_PATH
export C_INCLUDE_PATH=/usr/lib/x86_64-linux-gnu/openmpi/include:/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi:$C_INCLUDE_PATH

but is okay just 'make' command. what happened in my case. haha....

@MAKER-park
Copy link
Author

@AddyLaddy
i found mpi library location

' make MPI=1 MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi/ CUDA_HOME=/usr/local/cuda NCCL_HOME=/home/heartlab/anaconda3/envs/TF'

make -C src build BUILDDIR=/home/heartlab/nccl-tests/build make[1]: Entering directory '/home/heartlab/nccl-tests/src' Compiling timer.cc > /home/heartlab/nccl-tests/build/timer.o Compiling /home/heartlab/nccl-tests/build/verifiable/verifiable.o Compiling all_reduce.cu > /home/heartlab/nccl-tests/build/all_reduce.o Compiling common.cu > /home/heartlab/nccl-tests/build/common.o Linking /home/heartlab/nccl-tests/build/all_reduce.o > /home/heartlab/nccl-tests/build/all_reduce_perf Compiling all_gather.cu > /home/heartlab/nccl-tests/build/all_gather.o Linking /home/heartlab/nccl-tests/build/all_gather.o > /home/heartlab/nccl-tests/build/all_gather_perf Compiling broadcast.cu > /home/heartlab/nccl-tests/build/broadcast.o Linking /home/heartlab/nccl-tests/build/broadcast.o > /home/heartlab/nccl-tests/build/broadcast_perf Compiling reduce_scatter.cu > /home/heartlab/nccl-tests/build/reduce_scatter.o Linking /home/heartlab/nccl-tests/build/reduce_scatter.o > /home/heartlab/nccl-tests/build/reduce_scatter_perf Compiling reduce.cu > /home/heartlab/nccl-tests/build/reduce.o Linking /home/heartlab/nccl-tests/build/reduce.o > /home/heartlab/nccl-tests/build/reduce_perf Compiling alltoall.cu > /home/heartlab/nccl-tests/build/alltoall.o Linking /home/heartlab/nccl-tests/build/alltoall.o > /home/heartlab/nccl-tests/build/alltoall_perf Compiling scatter.cu > /home/heartlab/nccl-tests/build/scatter.o Linking /home/heartlab/nccl-tests/build/scatter.o > /home/heartlab/nccl-tests/build/scatter_perf Compiling gather.cu > /home/heartlab/nccl-tests/build/gather.o Linking /home/heartlab/nccl-tests/build/gather.o > /home/heartlab/nccl-tests/build/gather_perf Compiling sendrecv.cu > /home/heartlab/nccl-tests/build/sendrecv.o Linking /home/heartlab/nccl-tests/build/sendrecv.o > /home/heartlab/nccl-tests/build/sendrecv_perf Compiling hypercube.cu > /home/heartlab/nccl-tests/build/hypercube.o Linking /home/heartlab/nccl-tests/build/hypercube.o > /home/heartlab/nccl-tests/build/hypercube_perf make[1]: Leaving directory '/home/heartlab/nccl-tests/src'
maybe compile done

and run this command for test

mpirun -np 1 ./build/all_reduce_perf -b 8 -e 64M -f 2 -g 2

`# nThread 1 nGpus 2 minBytes 8 maxBytes 67108864 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

Using devices

Rank 0 Group 0 Pid 137045 on DESKTOP-GGBQPHK device 0 [0x65] NVIDIA RTX A4000

Rank 1 Group 0 Pid 137045 on DESKTOP-GGBQPHK device 1 [0xb3] NVIDIA RTX A4000

DESKTOP-GGBQPHK:137045:137045 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
DESKTOP-GGBQPHK:137045:137045 [0] NCCL INFO Bootstrap : Using eth0:172.30.81.89<0>
DESKTOP-GGBQPHK:137045:137045 [1] NCCL INFO cudaDriverVersion 12040
DESKTOP-GGBQPHK:137045:137045 [1] NCCL INFO NCCL version 2.22.3+cuda12.0
DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO NET/Socket : Using [0]eth0:172.30.81.89<0>
DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO Using network Socket
DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO Using network Socket
DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO ncclCommInitRank comm 0x563ca4501b40 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 65000 commId 0xc573bf65f65fb713 - Init START
DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO ncclCommInitRank comm 0x563ca4540ec0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId b3000 commId 0xc573bf65f65fb713 - Init START
DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /dev/null
DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO NCCL_TOPO_FILE set by environment to /dev/null
DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL
DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO NCCL_SHM_DISABLE set by environment to 1.
DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO comm 0x563ca4501b40 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO Channel 00/02 : 0 1
DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO Channel 01/02 : 0 1
DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO NCCL_BUFFSIZE set by environment to 1048576.
DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO P2P Chunksize set to 131072
DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO comm 0x563ca4540ec0 rank 1 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO P2P Chunksize set to 131072

DESKTOP-GGBQPHK:137045:137059 [0] include/alloc.h:123 NCCL WARN Cuda failure 999 'unknown error'
DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO include/alloc.h:215 -> 1
DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO channel.cc:42 -> 1
DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO init.cc:544 -> 1
DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO init.cc:1156 -> 1
DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO init.cc:1408 -> 1
DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO group.cc:70 -> 1 [Async thread]

DESKTOP-GGBQPHK:137045:137060 [1] include/alloc.h:123 NCCL WARN Cuda failure 999 'unknown error'
DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO include/alloc.h:215 -> 1
DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO channel.cc:42 -> 1
DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO init.cc:544 -> 1
DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO init.cc:1156 -> 1
DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO init.cc:1408 -> 1
DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO group.cc:70 -> 1 [Async thread]
DESKTOP-GGBQPHK:137045:137045 [1] NCCL INFO group.cc:420 -> 1
DESKTOP-GGBQPHK:137045:137045 [1] NCCL INFO group.cc:546 -> 1
DESKTOP-GGBQPHK:137045:137045 [1] NCCL INFO group.cc:101 -> 1
DESKTOP-GGBQPHK:137045:137045 [1] NCCL INFO init.cc:1761 -> 1
DESKTOP-GGBQPHK: Test NCCL failure common.cu:990 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
.. DESKTOP-GGBQPHK pid 137045: Test failure common.cu:876

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[35994,1],0]
Exit code: 3

`

same result came out.

wsl problem? or RTX A4000 not good for multi use GPU learning system?

@AddyLaddy
Copy link
Collaborator

Ok that looks better. But the same CUDA error. But I don't know which RTX parts now support multi-GPU communications.
There is also the nvbandwidth tool to check CUDA P2P transfers.

@kiskra-nvidia
Copy link
Member

NCCL INFO NCCL_P2P_LEVEL set by environment to NVL

The above looks suspect -- as far as I can tell, A4000 does not support NVLink?!

Perhaps the following link is relevant, as it references A4000 and the same error code 999 from cuMemSetAccess: https://forums.developer.nvidia.com/t/rivermax-sdk-example-code-run-failed/255548

Finally, you can probably work around this issue by running with NCCL_CUMEM_ENABLE=0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants