-
Notifications
You must be signed in to change notification settings - Fork 865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MTT: several intercomm tests periodically hang in call to PMIx_Connect #8958
Comments
I have observed the same issue when using COMM_SPAWN |
Add an MCA parameter that can be used to set a timeot on the PMIx_Connect operation used to support MPI_Comm_accept/connect and relatives. Related to open-mpi#8958 Signed-off-by: Howard Pritchard <[email protected]>
I have dug into this a bit over in PRRTE - please see openpmix/prrte#964 (comment) for an explanation and suggested fixes. I don't know if that's the issue here or not, but offer it for reference. I can find no problem in the underlying |
thanks @rhc54 ! I'll investigate the dpm.c abuse of PMIx_Spawn |
@bosilca do you think the spawn problems you raised today could be related to this issue and the cited prrte issue? |
The stack looks very much like the one I was seeing, so it is plausible the two are related. However, I don't grow @rhc54 explanation in openpmix/prrte#964 (comment), because:
|
Add an MCA parameter that can be used to set a timeot on the PMIx_Connect operation used to support MPI_Comm_accept/connect and relatives. Related to open-mpi#8958 Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit 038291a)
I'm also seeing this type of hang in our MTT now that we are running multi-node with both
All processes (6 in total) are stuck with the same stack:
The hang is here in the test calling |
Ref #10318 |
Should be fixed by openpmix/prrte#1381 The issue is that prrte will do call allgather uses a signature. The signature generated from
For inter-comm communication, the order of processes are different between processes. Each rank always put the processes in its own communicator into As an result, prrte will call openpmix/prrte#1381 fixed the issue by sorting I consider this a prrte bug, not OMPI bug. Because |
It looks like I was wrong. PMIx document did require the order must be maintained when calling PMIx_Connect. so OMPI need to be change. PR to ompi is: |
back port was also merged. mtt is not impacted by this issue. Closing .... |
In triaging some outstanding problems I've been seeing for a while now in MTT on the 5.0.x and master branches, I've noticed on several platforms that tests in the ibm/collective/intercomm periodically timeout and are marked as failed.
It appears that periodically these tests hang in the MPI_Comm_accept/connect phase with tracebacks like
I'm not seeing these failures in the 4.0.x and 4.1.x but haven't tried to reproduce with these branches.
Note it can take quite a few back to back runs before one of these tests hangs. iscatterv_inter seems to be one of the more reliable producers.
The text was updated successfully, but these errors were encountered: