-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test CUDA failure common.cu:941 'invalid device ordinal' when test two nodes with nvhpc #263
Comments
Normally we use |
@AddyLaddy Thanks!
I find an issue may help to solve the problem NVIDIA/nccl#928 ,
And I get a new error:
And I also try to add |
Yeah, the first NET/IB error looks like the typical one when ACS is not disabled and GDRDMA is used. I'd suggest resolving the ACS issue and also using the perftests suite to check that each node can communicate successfully over the NET/IB devices using something like Also be careful with |
I compile nccl-tests with the command:
And run the command to test the
all_reduce_perf
:And I got the error:
NOTE: When I directly use openmpi instead of nvhpc, the test run successfully.
The text was updated successfully, but these errors were encountered: