Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MTT: several intercomm tests periodically hang in call to PMIx_Connect #8958

Closed
hppritcha opened this issue May 13, 2021 · 12 comments
Closed

Comments

@hppritcha
Copy link
Member

In triaging some outstanding problems I've been seeing for a while now in MTT on the 5.0.x and master branches, I've noticed on several platforms that tests in the ibm/collective/intercomm periodically timeout and are marked as failed.

It appears that periodically these tests hang in the MPI_Comm_accept/connect phase with tracebacks like

#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00002b104e30e193 in PMIx_Connect () from /usr/projects/artab/users/hpp/ompi/install_v5.0.x/lib/libpmix.so.0
#2  0x00002b104baa42b1 in ompi_dpm_connect_accept () from /usr/projects/artab/users/hpp/ompi/install_v5.0.x/lib/libmpi.so.0
#3  0x00002b104bae4a32 in PMPI_Comm_accept () from /usr/projects/artab/users/hpp/ompi/install_v5.0.x/lib/libmpi.so.0
#4  0x000000000040159b in main (argc=1, argv=0x7ffc80db1b08) at iscatterv_inter.c:67

I'm not seeing these failures in the 4.0.x and 4.1.x but haven't tried to reproduce with these branches.

Note it can take quite a few back to back runs before one of these tests hangs. iscatterv_inter seems to be one of the more reliable producers.

@abouteiller
Copy link
Member

I have observed the same issue when using COMM_SPAWN

hppritcha added a commit to hppritcha/ompi that referenced this issue May 13, 2021
Add an MCA parameter that can be used to set a timeot on the PMIx_Connect
operation used to support MPI_Comm_accept/connect and relatives.

Related to open-mpi#8958

Signed-off-by: Howard Pritchard <[email protected]>
@hppritcha hppritcha added the RTE Issue likely is in RTE or PMIx areas label May 18, 2021
@rhc54
Copy link
Contributor

rhc54 commented May 21, 2021

I have dug into this a bit over in PRRTE - please see openpmix/prrte#964 (comment) for an explanation and suggested fixes. I don't know if that's the issue here or not, but offer it for reference. I can find no problem in the underlying PMIx_Connect function.

@hppritcha hppritcha removed the RTE Issue likely is in RTE or PMIx areas label May 21, 2021
@hppritcha
Copy link
Member Author

thanks @rhc54 ! I'll investigate the dpm.c abuse of PMIx_Spawn

@hppritcha hppritcha self-assigned this May 25, 2021
@hppritcha
Copy link
Member Author

@bosilca do you think the spawn problems you raised today could be related to this issue and the cited prrte issue?

@bosilca
Copy link
Member

bosilca commented Nov 17, 2021

The stack looks very much like the one I was seeing, so it is plausible the two are related. However, I don't grow @rhc54 explanation in openpmix/prrte#964 (comment), because:

  • some executions complete successfully, the hangs are sporadic
  • I do not see any extra or lingering processes outside the exact number of processes that the RTE has been asked to create.

awlauria pushed a commit to awlauria/ompi that referenced this issue Dec 6, 2021
Add an MCA parameter that can be used to set a timeot on the PMIx_Connect
operation used to support MPI_Comm_accept/connect and relatives.

Related to open-mpi#8958

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 038291a)
@jjhursey
Copy link
Member

jjhursey commented May 13, 2022

I'm also seeing this type of hang in our MTT now that we are running multi-node with both main and (more often) v5.0.x. With 3 nodes, running the following in a loop will eventually hang (about 10 iterations):

mpirun --hostfile /opt/mpi/etc/hostfile --map-by ppr:2:node  --mca pml ucx --mca osc ucx,sm --mca btl ^openib  collective/intercomm/allgather_inter

All processes (6 in total) are stuck with the same stack:

Thread 3 (Thread 0x7fffadfdef60 (LWP 597)):
#0  0x00007fffb093343c in epoll_wait () from /lib64/libc.so.6
#1  0x00007fffb0611a90 in ucs_event_set_wait (event_set=0x1001fe2f0c0, num_events=0x7fffadfde510, timeout_ms=<optimized out>, event_set_handler=0x7fffb05eac80 <ucs_async_thread_ev_handler>, arg=<optimized out>) at sys/event_set.c:198
#2  0x00007fffb05eb088 in ucs_async_thread_func (arg=0x1001fe22260) at async/thread.c:130
#3  0x00007fffb0a28878 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fffb0932f68 in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fffaf61ef60 (LWP 593)):
#0  0x00007fffb093343c in epoll_wait () from /lib64/libc.so.6
#1  0x00007fffafe59328 in epoll_dispatch (base=0x1001fd96c00, tv=<optimized out>) at epoll.c:465
#2  0x00007fffafe4a660 in event_base_loop (base=0x1001fd96c00, flags=<optimized out>) at event.c:1992
#3  0x00007fffafadb1e4 in progress_engine (obj=0x1001fd96ad8) at runtime/pmix_progress_threads.c:228
#4  0x00007fffb0a28878 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fffb0932f68 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fffb109df40 (LWP 592)):
#0  0x00007fffb0a3174c in pthread_cond_wait@@GLIBC_2.17 () from /lib64/libpthread.so.0
#1  0x00007fffafa55400 in PMIx_Connect (procs=0x1002001b930, nprocs=6, info=0x7fffd9bb1400, ninfo=1) at client/pmix_client_connect.c:99
#2  0x00007fffb0b22990 in ompi_dpm_connect_accept (comm=0x10020005330, root=0, port_string=0x1001fffd800 "2935947265.0:3138776032", send_first=true, newcomm=0x7fffd9bb1ea0) at dpm/dpm.c:381
#3  0x00007fffb0b9e3b0 in PMPI_Comm_connect (port_name=0x1001fffd800 "2935947265.0:3138776032", info=0x7fffb0ffde40 <ompi_mpi_info_null>, root=0, comm=0x10020005330, newcomm=0x7fffd9bb1f48) at comm_connect.c:116
#4  0x0000000010001764 in main (argc=1, argv=0x7fffd9bb2378) at allgather_inter.c:70

The hang is here in the test calling MPI_Comm_connect(port_name, MPI_INFO_NULL, 0, comm, &intercomm);

@jjhursey
Copy link
Member

Ref #10318

@wzamazon
Copy link
Contributor

Should be fixed by openpmix/prrte#1381

The issue is that prrte will do call grpcomm.allgather between all servers when handling PMIx_Connect.

allgather uses a signature. The signature generated from procs from the client when calling PMIx_Connect.

procs is an array of pmix_proc_t, each pmix_proc_t is the information of one process (namespace and rank).

For inter-comm communication, the order of processes are different between processes.

Each rank always put the processes in its own communicator into procs first, then put processes of the other communicator.

As an result, prrte will call grpcomm.allgather with different signature.

openpmix/prrte#1381 fixed the issue by sorting procs before using it as signature.

I consider this a prrte bug, not OMPI bug. Because PMIx_connect dcoument did not mention that client must call it with procs in same order.

@wzamazon
Copy link
Contributor

Because PMIx_connect dcoument did not mention that client must call it with procs in same order.

It looks like I was wrong. PMIx document did require the order must be maintained when calling PMIx_Connect. so OMPI need to be change.

PR to ompi is:

#10557

@wzamazon
Copy link
Contributor

wzamazon commented Jul 14, 2022

#10557 has been merged to main, and fixed the hang with IBM CI.
link

#10564 back port it to v5.0.x, waiting to be merged.

This is not required for 4.x branch, because orted and prted implement the pmix_server_connect_fn differently. prted used grpcomm.allgather. orted did not.

@wzamazon
Copy link
Contributor

back port was also merged. mtt is not impacted by this issue. Closing ....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants