2 Node Nccl Test don’t work #236

SdEnd · 2024-07-21T13:41:11Z

I have two servers, Dell and FusionServer, nccl-test don't work ,but if all servers is same model,the ncct-test can work

my environment

os: ubuntu 22.04
cuda: 12.4
NV drvier: 550

when run command, after wait a hour and no response (different server)

mpirun  --allow-run-as-root -n 16 -N 8 --hostfile host  -x NCCL_DEBUG=INFO   /root/nccl-tests/build/all_reduce_perf -b 128M -e 1g -f 2 -g 1

then terminal show :

But when run ,it can work

mpirun  --allow-run-as-root -n 16 -N 8 --hostfile host  -x NCCL_DEBUG=INFO   /root/nccl-tests/build/all_reduce_perf -b 8 -e 128 -f 2 -g 1

why is this a problem?
Could it be because I'm using a different model of server?

The text was updated successfully, but these errors were encountered:

kiskra-nvidia · 2024-07-21T19:21:08Z

Are you saying that it works for small message sizes (8B-128B) but hangs for larger ones (128B-1GB)? That could very well be; NCCL may choose different algorithm/protocol combinations depending on the message size, and some of them might be working on your systems while others fail.

We'll need a lot more info to diagnose this. In particular, complete outputs of runs with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,ENV,TUNING (TUNING in particular should show us the algorithm/protocol that NCCL is using, so we should see what works and what does not). The output of nvidia-smi topo -m from both server node types would also be helpful. Finally, how's the interconnect between these servers? Are the NICs uniform across the different server types? Are all the NICs wired and can the servers communicate with each other using each NIC pair?

SdEnd · 2024-07-22T03:16:55Z

Are you saying that it works for small message sizes (8B-128B) but hangs for larger ones (128B-1GB)? That could very well be; NCCL may choose different algorithm/protocol combinations depending on the message size, and some of them might be working on your systems while others fail.

We'll need a lot more info to diagnose this. In particular, complete outputs of runs with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,ENV,TUNING (TUNING in particular should show us the algorithm/protocol that NCCL is using, so we should see what works and what does not). The output of nvidia-smi topo -m from both server node types would also be helpful. Finally, how's the interconnect between these servers? Are the NICs uniform across the different server types? Are all the NICs wired and can the servers communicate with each other using each NIC pair?

@kiskra-nvidia
thanks, this log
8-1G.txt
8-128.txt

kiskra-nvidia · 2024-07-23T03:58:02Z

Huh... It appears to hang during the warmup iterations for buffer size 1GB (if I understand correctly, that happens for any buffer size above 128B?).

Did you verify that the IB network is fully operational between the nodes (using IB-specific benchmarks, not NCCL)?

SdEnd · 2024-07-24T04:09:24Z

Huh... It appears to hang during the warmup iterations for buffer size 1GB (if I understand correctly, that happens for any buffer size above 128B?)

Did you verify that the IB network is fully operational between the nodes (using IB-specific benchmarks, not NCCL)?

@kiskra-nvidia
yes, I verify the IB network, all test pass

kiskra-nvidia · 2024-07-24T17:26:36Z

I'm out of ideas then. @sjeaugey, @AddyLaddy, any idea why a run with 128B buffer limit would succeed but larger (1GB) runs hung (during warmup)? NCCL appears to choose tree/LL up to 128B, tree/SIMPLE for 1GB.

jeffreyyjp · 2024-07-25T06:44:03Z

@SdEnd What do you configuare about the IB network

SdEnd · 2024-07-26T02:44:59Z

@SdEnd What do you configuare about the IB network

@jeffreyyjp
my cluster have 128 node, 8GPU per node, use Spine-Leaf IB network ,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2 Node Nccl Test don’t work #236

2 Node Nccl Test don’t work #236

SdEnd commented Jul 21, 2024

kiskra-nvidia commented Jul 21, 2024

SdEnd commented Jul 22, 2024 •

edited

Loading

kiskra-nvidia commented Jul 23, 2024

SdEnd commented Jul 24, 2024 •

edited

Loading

kiskra-nvidia commented Jul 24, 2024

jeffreyyjp commented Jul 25, 2024

SdEnd commented Jul 26, 2024 •

edited

Loading

2 Node Nccl Test don’t work #236

2 Node Nccl Test don’t work #236

Comments

SdEnd commented Jul 21, 2024

kiskra-nvidia commented Jul 21, 2024

SdEnd commented Jul 22, 2024 • edited Loading

kiskra-nvidia commented Jul 23, 2024

SdEnd commented Jul 24, 2024 • edited Loading

kiskra-nvidia commented Jul 24, 2024

jeffreyyjp commented Jul 25, 2024

SdEnd commented Jul 26, 2024 • edited Loading

SdEnd commented Jul 22, 2024 •

edited

Loading

SdEnd commented Jul 24, 2024 •

edited

Loading

SdEnd commented Jul 26, 2024 •

edited

Loading