MTT: several intercomm tests periodically hang in call to PMIx_Connect #8958

hppritcha · 2021-05-13T15:11:39Z

In triaging some outstanding problems I've been seeing for a while now in MTT on the 5.0.x and master branches, I've noticed on several platforms that tests in the ibm/collective/intercomm periodically timeout and are marked as failed.

It appears that periodically these tests hang in the MPI_Comm_accept/connect phase with tracebacks like

#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00002b104e30e193 in PMIx_Connect () from /usr/projects/artab/users/hpp/ompi/install_v5.0.x/lib/libpmix.so.0
#2  0x00002b104baa42b1 in ompi_dpm_connect_accept () from /usr/projects/artab/users/hpp/ompi/install_v5.0.x/lib/libmpi.so.0
#3  0x00002b104bae4a32 in PMPI_Comm_accept () from /usr/projects/artab/users/hpp/ompi/install_v5.0.x/lib/libmpi.so.0
#4  0x000000000040159b in main (argc=1, argv=0x7ffc80db1b08) at iscatterv_inter.c:67

I'm not seeing these failures in the 4.0.x and 4.1.x but haven't tried to reproduce with these branches.

Note it can take quite a few back to back runs before one of these tests hangs. iscatterv_inter seems to be one of the more reliable producers.

The text was updated successfully, but these errors were encountered:

abouteiller · 2021-05-13T16:45:46Z

I have observed the same issue when using COMM_SPAWN

Add an MCA parameter that can be used to set a timeot on the PMIx_Connect operation used to support MPI_Comm_accept/connect and relatives. Related to open-mpi#8958 Signed-off-by: Howard Pritchard <[email protected]>

rhc54 · 2021-05-21T21:55:05Z

I have dug into this a bit over in PRRTE - please see openpmix/prrte#964 (comment) for an explanation and suggested fixes. I don't know if that's the issue here or not, but offer it for reference. I can find no problem in the underlying PMIx_Connect function.

hppritcha · 2021-05-21T22:35:08Z

thanks @rhc54 ! I'll investigate the dpm.c abuse of PMIx_Spawn

hppritcha · 2021-11-16T22:24:01Z

@bosilca do you think the spawn problems you raised today could be related to this issue and the cited prrte issue?

bosilca · 2021-11-17T16:27:58Z

The stack looks very much like the one I was seeing, so it is plausible the two are related. However, I don't grow @rhc54 explanation in openpmix/prrte#964 (comment), because:

some executions complete successfully, the hangs are sporadic
I do not see any extra or lingering processes outside the exact number of processes that the RTE has been asked to create.

Add an MCA parameter that can be used to set a timeot on the PMIx_Connect operation used to support MPI_Comm_accept/connect and relatives. Related to open-mpi#8958 Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit 038291a)

jjhursey · 2022-05-13T19:04:15Z

I'm also seeing this type of hang in our MTT now that we are running multi-node with both main and (more often) v5.0.x. With 3 nodes, running the following in a loop will eventually hang (about 10 iterations):

mpirun --hostfile /opt/mpi/etc/hostfile --map-by ppr:2:node  --mca pml ucx --mca osc ucx,sm --mca btl ^openib  collective/intercomm/allgather_inter

All processes (6 in total) are stuck with the same stack:

Thread 3 (Thread 0x7fffadfdef60 (LWP 597)):
#0  0x00007fffb093343c in epoll_wait () from /lib64/libc.so.6
#1  0x00007fffb0611a90 in ucs_event_set_wait (event_set=0x1001fe2f0c0, num_events=0x7fffadfde510, timeout_ms=<optimized out>, event_set_handler=0x7fffb05eac80 <ucs_async_thread_ev_handler>, arg=<optimized out>) at sys/event_set.c:198
#2  0x00007fffb05eb088 in ucs_async_thread_func (arg=0x1001fe22260) at async/thread.c:130
#3  0x00007fffb0a28878 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fffb0932f68 in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fffaf61ef60 (LWP 593)):
#0  0x00007fffb093343c in epoll_wait () from /lib64/libc.so.6
#1  0x00007fffafe59328 in epoll_dispatch (base=0x1001fd96c00, tv=<optimized out>) at epoll.c:465
#2  0x00007fffafe4a660 in event_base_loop (base=0x1001fd96c00, flags=<optimized out>) at event.c:1992
#3  0x00007fffafadb1e4 in progress_engine (obj=0x1001fd96ad8) at runtime/pmix_progress_threads.c:228
#4  0x00007fffb0a28878 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fffb0932f68 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fffb109df40 (LWP 592)):
#0  0x00007fffb0a3174c in pthread_cond_wait@@GLIBC_2.17 () from /lib64/libpthread.so.0
#1  0x00007fffafa55400 in PMIx_Connect (procs=0x1002001b930, nprocs=6, info=0x7fffd9bb1400, ninfo=1) at client/pmix_client_connect.c:99
#2  0x00007fffb0b22990 in ompi_dpm_connect_accept (comm=0x10020005330, root=0, port_string=0x1001fffd800 "2935947265.0:3138776032", send_first=true, newcomm=0x7fffd9bb1ea0) at dpm/dpm.c:381
#3  0x00007fffb0b9e3b0 in PMPI_Comm_connect (port_name=0x1001fffd800 "2935947265.0:3138776032", info=0x7fffb0ffde40 <ompi_mpi_info_null>, root=0, comm=0x10020005330, newcomm=0x7fffd9bb1f48) at comm_connect.c:116
#4  0x0000000010001764 in main (argc=1, argv=0x7fffd9bb2378) at allgather_inter.c:70

The hang is here in the test calling MPI_Comm_connect(port_name, MPI_INFO_NULL, 0, comm, &intercomm);

jjhursey · 2022-05-17T15:08:48Z

Ref #10318

jjhursey · 2022-05-17T15:41:13Z

Ref:

wzamazon · 2022-07-11T14:40:57Z

Should be fixed by openpmix/prrte#1381

The issue is that prrte will do call grpcomm.allgather between all servers when handling PMIx_Connect.

allgather uses a signature. The signature generated from procs from the client when calling PMIx_Connect.

procs is an array of pmix_proc_t, each pmix_proc_t is the information of one process (namespace and rank).

For inter-comm communication, the order of processes are different between processes.

Each rank always put the processes in its own communicator into procs first, then put processes of the other communicator.

As an result, prrte will call grpcomm.allgather with different signature.

openpmix/prrte#1381 fixed the issue by sorting procs before using it as signature.

I consider this a prrte bug, not OMPI bug. Because PMIx_connect dcoument did not mention that client must call it with procs in same order.

wzamazon · 2022-07-11T20:40:50Z

Because PMIx_connect dcoument did not mention that client must call it with procs in same order.

It looks like I was wrong. PMIx document did require the order must be maintained when calling PMIx_Connect. so OMPI need to be change.

PR to ompi is:

#10557

wzamazon · 2022-07-14T13:20:21Z

#10557 has been merged to main, and fixed the hang with IBM CI.
link

#10564 back port it to v5.0.x, waiting to be merged.

This is not required for 4.x branch, because orted and prted implement the pmix_server_connect_fn differently. prted used grpcomm.allgather. orted did not.

wzamazon · 2022-07-21T16:38:17Z

back port was also merged. mtt is not impacted by this issue. Closing ....

hppritcha added Target: main Target: v5.0.x labels May 13, 2021

abouteiller mentioned this issue May 13, 2021

PMIX_Connect deadlock/crashes with more than 1 node openpmix/prrte#964

Closed

hppritcha mentioned this issue May 13, 2021

PMIx_Connect usage: add optional timeout #8959

Merged

hppritcha added the RTE Issue likely is in RTE or PMIx areas label May 18, 2021

hppritcha removed the RTE Issue likely is in RTE or PMIx areas label May 21, 2021

hppritcha self-assigned this May 25, 2021

bosilca mentioned this issue Nov 30, 2021

Intermittent hangs with calling MPI_Comm_spawn() #9713

Closed

jjhursey mentioned this issue May 17, 2022

Collective hanging ibm/allgather on main branch #10318

Closed

wzamazon mentioned this issue Jul 11, 2022

Make allgather signature consistent for server_connect_fn() openpmix/prrte#1381

Closed

wzamazon mentioned this issue Jul 12, 2022

ompi/dpm: make procs consistent before calling PMIx_Connect() #10557

Merged

wzamazon closed this as completed Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MTT: several intercomm tests periodically hang in call to PMIx_Connect #8958

MTT: several intercomm tests periodically hang in call to PMIx_Connect #8958

hppritcha commented May 13, 2021

abouteiller commented May 13, 2021

rhc54 commented May 21, 2021

hppritcha commented May 21, 2021

hppritcha commented Nov 16, 2021

bosilca commented Nov 17, 2021

jjhursey commented May 13, 2022 •

edited

Loading

jjhursey commented May 17, 2022

jjhursey commented May 17, 2022

wzamazon commented Jul 11, 2022

wzamazon commented Jul 11, 2022

wzamazon commented Jul 14, 2022 •

edited

Loading

wzamazon commented Jul 21, 2022

MTT: several intercomm tests periodically hang in call to PMIx_Connect #8958

MTT: several intercomm tests periodically hang in call to PMIx_Connect #8958

Comments

hppritcha commented May 13, 2021

abouteiller commented May 13, 2021

rhc54 commented May 21, 2021

hppritcha commented May 21, 2021

hppritcha commented Nov 16, 2021

bosilca commented Nov 17, 2021

jjhursey commented May 13, 2022 • edited Loading

jjhursey commented May 17, 2022

jjhursey commented May 17, 2022

wzamazon commented Jul 11, 2022

wzamazon commented Jul 11, 2022

wzamazon commented Jul 14, 2022 • edited Loading

wzamazon commented Jul 21, 2022

jjhursey commented May 13, 2022 •

edited

Loading

wzamazon commented Jul 14, 2022 •

edited

Loading