-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
what is cu:990 error? how to solve this problem? #230
Comments
It looks like you're running a single process test twice and they are both using the same device . You need to compile the nccl-tests with MPI=1 for this to work. |
@AddyLaddy thank you for reply! then you mean instead of 'make' use 'make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl' ? but this command is not working in my system... here my result clearly, install mpi and setup my bashrc file export PATH=/usr/bin:$PATH but is okay just 'make' command. what happened in my case. haha.... |
@AddyLaddy ' make MPI=1 MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi/ CUDA_HOME=/usr/local/cuda NCCL_HOME=/home/heartlab/anaconda3/envs/TF'
and run this command for test mpirun -np 1 ./build/all_reduce_perf -b 8 -e 64M -f 2 -g 2 `# nThread 1 nGpus 2 minBytes 8 maxBytes 67108864 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 Using devicesRank 0 Group 0 Pid 137045 on DESKTOP-GGBQPHK device 0 [0x65] NVIDIA RTX A4000Rank 1 Group 0 Pid 137045 on DESKTOP-GGBQPHK device 1 [0xb3] NVIDIA RTX A4000DESKTOP-GGBQPHK:137045:137045 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 DESKTOP-GGBQPHK:137045:137059 [0] include/alloc.h:123 NCCL WARN Cuda failure 999 'unknown error' DESKTOP-GGBQPHK:137045:137060 [1] include/alloc.h:123 NCCL WARN Cuda failure 999 'unknown error'
|
Ok that looks better. But the same CUDA error. But I don't know which RTX parts now support multi-GPU communications. |
The above looks suspect -- as far as I can tell, A4000 does not support NVLink?! Perhaps the following link is relevant, as it references A4000 and the same error code 999 from Finally, you can probably work around this issue by running with |
thank you for attention this problem.
my workstation spec is
RTX A4000 *2
WSL2_Ubuntu-22.04
cudnn 8.9
(base) heartlab@DESKTOP-GGBQPHK:~/nccl-tests$ nvidia-smi
Fri Jun 28 05:15:17 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 551.61 CUDA Version: 12.4 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A4000 On | 00000000:65:00.0 Off | Off |
| 41% 37C P8 6W / 140W | 17MiB / 16376MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A4000 On | 00000000:B3:00.0 On | Off |
| 41% 37C P8 7W / 140W | 571MiB / 16376MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 31 G /Xwayland N/A |
| 1 N/A N/A 31 G /Xwayland N/A |
+---------------------------------------------------------------------------------------+
and i run this command when i done make command
(base) heartlab@DESKTOP-GGBQPHK:~/nccl-tests$ mpirun -np 2 --allow-run-as-root -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
Using devices
nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
Using devices
Rank 0 Group 0 Pid 64755 on DESKTOP-GGBQPHK device 0 [0x65] NVIDIA RTX A4000
Rank 0 Group 0 Pid 64756 on DESKTOP-GGBQPHK device 0 [0x65] NVIDIA RTX A4000
DESKTOP-GGBQPHK:64755:64755 [0] NCCL INFO Bootstrap : Using eth0:172.30.81.89<0>
DESKTOP-GGBQPHK:64756:64756 [0] NCCL INFO Bootstrap : Using eth0:172.30.81.89<0>
DESKTOP-GGBQPHK:64756:64756 [0] NCCL INFO cudaDriverVersion 12040
DESKTOP-GGBQPHK:64756:64756 [0] NCCL INFO NCCL version 2.22.3+cuda12.0
DESKTOP-GGBQPHK:64755:64755 [0] NCCL INFO cudaDriverVersion 12040
DESKTOP-GGBQPHK:64755:64755 [0] NCCL INFO NCCL version 2.22.3+cuda12.0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO NET/IB : No device found.
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO NET/Socket : Using [0]eth0:172.30.81.89<0> [1]veth9d2d103:fe80::b051:e3ff:febe:607b%veth9d2d103<0>
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Using network Socket
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO NET/IB : No device found.
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO NET/Socket : Using [0]eth0:172.30.81.89<0> [1]veth9d2d103:fe80::b051:e3ff:febe:607b%veth9d2d103<0>
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Using network Socket
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO ncclCommInitRank comm 0x5567a8af3a00 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 65000 commId 0x1e41f00635db9132 - Init START
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO comm 0x5567a8af3a00 rank 0 nRanks 1 nNodes 1 localRanks 1 localRank 0 MNNVL 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 00/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 01/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 02/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 03/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 04/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 05/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 06/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 07/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 08/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 09/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 10/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 11/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 12/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 13/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 14/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 15/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 16/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 17/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 18/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 19/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 20/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 21/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 22/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 23/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 24/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 25/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 26/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 27/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 28/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 29/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 30/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 31/32 : 0
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO P2P Chunksize set to 131072
DESKTOP-GGBQPHK:64756:64771 [0] include/alloc.h:123 NCCL WARN Cuda failure 999 'unknown error'
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO include/alloc.h:215 -> 1
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO channel.cc:42 -> 1
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO init.cc:544 -> 1
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO init.cc:1156 -> 1
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO init.cc:1408 -> 1
DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO group.cc:70 -> 1 [Async thread]
DESKTOP-GGBQPHK:64756:64756 [0] NCCL INFO group.cc:420 -> 1
DESKTOP-GGBQPHK:64756:64756 [0] NCCL INFO group.cc:546 -> 1
DESKTOP-GGBQPHK:64756:64756 [0] NCCL INFO group.cc:101 -> 1
DESKTOP-GGBQPHK:64756:64756 [0] NCCL INFO init.cc:1761 -> 1
DESKTOP-GGBQPHK: Test NCCL failure common.cu:990 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
.. DESKTOP-GGBQPHK pid 64756: Test failure common.cu:876
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO ncclCommInitRank comm 0x5567734f9a00 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 65000 commId 0xd904a9f238296abf - Init START
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO comm 0x5567734f9a00 rank 0 nRanks 1 nNodes 1 localRanks 1 localRank 0 MNNVL 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 00/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 01/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 02/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 03/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 04/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 05/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 06/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 07/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 08/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 09/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 10/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 11/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 12/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 13/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 14/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 15/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 16/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 17/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 18/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 19/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 20/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 21/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 22/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 23/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 24/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 25/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 26/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 27/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 28/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 29/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 30/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 31/32 : 0
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO P2P Chunksize set to 131072
DESKTOP-GGBQPHK:64755:64773 [0] include/alloc.h:123 NCCL WARN Cuda failure 999 'unknown error'
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO include/alloc.h:215 -> 1
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO channel.cc:42 -> 1
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO init.cc:544 -> 1
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO init.cc:1156 -> 1
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO init.cc:1408 -> 1
DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO group.cc:70 -> 1 [Async thread]
DESKTOP-GGBQPHK:64755:64755 [0] NCCL INFO group.cc:420 -> 1
DESKTOP-GGBQPHK:64755:64755 [0] NCCL INFO group.cc:546 -> 1
DESKTOP-GGBQPHK:64755:64755 [0] NCCL INFO group.cc:101 -> 1
DESKTOP-GGBQPHK:64755:64755 [0] NCCL INFO init.cc:1761 -> 1
DESKTOP-GGBQPHK: Test NCCL failure common.cu:990 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
.. DESKTOP-GGBQPHK pid 64755: Test failure common.cu:876
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[26406,1],1]
Exit code: 3
what happend this problem i just want tensorflow multi gpu for reduce vram stress...
The text was updated successfully, but these errors were encountered: