-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2 Node Nccl Test don’t work #236
Comments
Are you saying that it works for small message sizes (8B-128B) but hangs for larger ones (128B-1GB)? That could very well be; NCCL may choose different algorithm/protocol combinations depending on the message size, and some of them might be working on your systems while others fail. We'll need a lot more info to diagnose this. In particular, complete outputs of runs with |
@kiskra-nvidia |
Huh... It appears to hang during the warmup iterations for buffer size 1GB (if I understand correctly, that happens for any buffer size above 128B?). Did you verify that the IB network is fully operational between the nodes (using IB-specific benchmarks, not NCCL)? |
@kiskra-nvidia |
I'm out of ideas then. @sjeaugey, @AddyLaddy, any idea why a run with 128B buffer limit would succeed but larger (1GB) runs hung (during warmup)? NCCL appears to choose tree/LL up to 128B, tree/SIMPLE for 1GB. |
@SdEnd What do you configuare about the IB network |
@jeffreyyjp |
I have two servers, Dell and FusionServer, nccl-test don't work ,but if all servers is same model,the ncct-test can work
my environment
when run command, after wait a hour and no response (different server)
then terminal show :
But when run ,it can work
why is this a problem?
Could it be because I'm using a different model of server?
The text was updated successfully, but these errors were encountered: