-
Notifications
You must be signed in to change notification settings - Fork 865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
btl_openib_component.c:3556:handle_wc #4529
Comments
@jsquyres I don't have simple reproducer, but I can try to produce one. It seems this could well be a race condition. |
I vaguely remember fixing this. I don't know if I upstreamed the fix though.... Will take a look when I get back to the office on Friday. |
We just got a relevant system installed. Will try to look at it today. |
@lcebaman could you try testing against one of 3.0.1 release candidates? There were fixes for multi-threaded RMA in this release. |
It seems issue also can be reproduced with Intel/OPA hardware. openmpi3 segment fault. Latest openmpi4 does not segment fault, but hang.
|
Openib btl is removed. Closing. |
Runnning OpenMPI 3.0.0 and RMA (MPI_THREAD_MULTIPLE) I get
[[27217,0],12][btl_openib_component.c:3556:handle_wc] from nvb27 to: nvb27 error polling L P CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 19059c8 opcode 4 vendor e rror 136 qp_idx 3
I've noticed that this happens when the number of MPI processes per node >=4 . Here is some more info that could be (or not) related to this issue:
The text was updated successfully, but these errors were encountered: