Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMPI MTL PSM2 hangs under flux-core 0.11.1 #2173

Open
SteVwonder opened this issue May 30, 2019 · 21 comments
Open

OpenMPI MTL PSM2 hangs under flux-core 0.11.1 #2173

SteVwonder opened this issue May 30, 2019 · 21 comments

Comments

@SteVwonder
Copy link
Member

A multi-task mpi-hello-world program run under a single-node flux instance with the openmpi 3.0.1 installed in /usr/tce hangs until the psm2 initialization times out.

Backtrace from pid 17986{2,4}:

(gdb) bt
#0  0x00002aaaaafdf6fd in read () at ../sysdeps/unix/syscall-template.S:81
#1  0x00002aaaad8d0550 in read (__nbytes=1, __buf=0x66cc40, __fd=15) at /usr/include/bits/unistd.h:44
#2  dgetline (fd=15,
    buf=0x66cc40 "cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTM1AHBtaXguaG5hbWUAMDMAMDAwOABvcGFsMTg2AE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLm9wZW5pYi4zLjAAMTQAMDAyNgABAAAAAAAAgP7CAAAABQAAAHURAADwJAAAAAEDPw"..., len=1216)
    at dgetline.c:24
#3  0x00002aaaad8cc65a in pmi_simple_client_barrier (impl=0x665ff0) at simple_client.c:217
#4  0x00002aaaad8cdb63 in PMI_Barrier () at pmi.c:259
#5  0x00002aaaad6c4eb3 in flux_fence () from /usr/tce/packages/openmpi/openmpi-3.0.1-gcc-4.9.3/lib/openmpi/mca_pmix_flux.so
#6  0x00002aaaaad1962a in ompi_mpi_init () from /usr/tce/packages/openmpi/openmpi-3.0.1-gcc-4.9.3/lib/libmpi.so.40
#7  0x00002aaaaad3f708 in PMPI_Init () from /usr/tce/packages/openmpi/openmpi-3.0.1-gcc-4.9.3/lib/libmpi.so.40
#8  0x00000000004007c8 in main ()

Backtrace from pid 179863:

(gdb) bt
#0  0x00002aaaab2cfd47 in sched_yield () at ../sysdeps/unix/syscall-template.S:81
#1  0x00002aaabb4c8955 in amsh_ep_connreq_wrap () from /usr/lib64/libpsm2.so.2
#2  0x00002aaabb4c92d4 in amsh_ep_connect () from /usr/lib64/libpsm2.so.2
#3  0x00002aaabb4d1965 in psm2_ep_connect () from /usr/lib64/libpsm2.so.2
#4  0x00002aaabb2bb7b9 in ompi_mtl_psm2_add_procs ()
   from /usr/tce/packages/openmpi/openmpi-3.0.1-gcc-4.9.3/lib/openmpi/mca_mtl_psm2.so
#5  0x00002aaaaad1934d in ompi_mpi_init () from /usr/tce/packages/openmpi/openmpi-3.0.1-gcc-4.9.3/lib/libmpi.so.40
#6  0x00002aaaaad3f708 in PMPI_Init () from /usr/tce/packages/openmpi/openmpi-3.0.1-gcc-4.9.3/lib/libmpi.so.40
#7  0x00000000004007c8 in main ()

Output after timeout:

[opal186:179863] PSM2 returned unhandled/unknown connect error: Operation timed out
[opal186:179863] PSM2 EP connect error (unknown connect error):
[opal186:179863]  opal186
[opal186:179863]  opal186
[opal186:179863]
[opal186:179863] PSM2 EP connect error (unknown connect error):
[opal186:179863]  opal186
[opal186:179863]
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[opal186:179863] *** An error occurred in MPI_Init
[opal186:179863] *** reported by process [1,1]
[opal186:179863] *** on a NULL communicator
[opal186:179863] *** Unknown error
[opal186:179863] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[opal186:179863] ***    and potentially your MPI job)

It works if you set OMPI_MCA_mtl=psm or OMPI_MCA_mtl=^psm before running. Maybe we should add an openmpi mpi "personality" just like we have for Spectrum.

@dongahn
Copy link
Member

dongahn commented May 30, 2019

BTW, this would be a good test case to try our upcoming debugging tools support.

@SteVwonder
Copy link
Member Author

SteVwonder commented May 30, 2019

I just tried PSM2 on multiple nodes, and it worked. I believe this is related to this OpenMPI bug: open-mpi/ompi#1559

@dongahn
Copy link
Member

dongahn commented May 30, 2019

@SteVwonder: could this be related to the problems @damora is having at all?

@SteVwonder
Copy link
Member Author

I am not sure that it is related to #2170 or not, but once #2170 is fixed, I'm sure @damora and others will run into this bug too (depending on the version of OpenMPI they are using).

@dongahn
Copy link
Member

dongahn commented May 30, 2019

Sigh Good to know. Glad that you discovered this before others do.

@briadam
Copy link

briadam commented Jan 12, 2022

(Separate discussion with @garlick led me here...) I see what I believe to be a similar issue running on DOE CTS-1 with OpenMPI 4.x applications, notably 4.1.1.

Summary:

  • Assume: An MPI application run that requires cores <= cores_per_node. Flux (master from last week) daemons started with srun -N 2 -n 2, that is, one per each of two nodes (doesn't matter if --mpi=none vs. defaults per @dongahn suggestion).
  • If I flux mini run or submit on 1 node (or unspecified number of nodes), e.g., flux mini run -N 1 -n 4 mpi_hello_world, then MPI hangs with stack traces similar to @SteVwonder above.
  • If I flux mini run or submit forcing 2 nodes, e.g., flux mini run -N 2 -n 4 mpi_hello_world, or the MPI application requires cores > cores_per_node, life is good.

This is a blocker as I'm attempting to use flux mini submit to dispatch an ensemble of heterogeneous openmpi-4.x application jobs, e.g., on n=1,4,36,48 cores, most of which fit within a node. I can work around this by forcing multi-core jobs to use at least two nodes, likely incurring more interconnect overhead.

Also, in case helpful, some wacky neutering of PSM2 via disabling the cm PML: export OMPI_MCA_pml=^cm does also seem to work around the issue, but I still need to dig in and figure out if my MPI jobs are falling back to any reasonably performant transport.

Are there any other known remedies or workarounds? Should I expand on my environment and reproducing in this issue, in another issue/discussion, or is that not needed to help debug?

@garlick
Copy link
Member

garlick commented Jan 12, 2022

Thanks @briadam ! Yeah, let's track the details in this issue. Here are a couple you sent in email that I think are helpful.

When a 1n4p task fails, all four tasks are still running, with three of four having entered the PMI1 barrier, and a fourth stuck in PSM2:

#0  0x00002aaaab2e4807 in sched_yield ()
    at ../sysdeps/unix/syscall-template.S:81
#1  0x00002aaab596cec5 in amsh_ep_connreq_wrap () from /lib64/libpsm2.so.2
#2  0x00002aaab596d834 in amsh_ep_connect () from /lib64/libpsm2.so.2
#3  0x00002aaab59783ba in psm2_ep_connect () from /lib64/libpsm2.so.2
#4  0x00002aaab575ecb4 in ompi_mtl_psm2_add_procs (mtl=<optimized out>,
    nprocs=<optimized out>, procs=0x74d490) at mtl_psm2.c:301
#5  0x00002aaaaad9a185 in ompi_mpi_init (argc=1, argv=0x7fffffffc258,
    requested=0, provided=0x7fffffffbee4, reinit_ok=<optimized out>)
    at runtime/ompi_mpi_init.c:854
#6  0x00002aaaaad4737b in PMPI_Init (argc=0x7fffffffbf3c, argv=0x7fffffffbf30)
    at pinit.c:67
#7  0x0000000000400991 in main ()

The OMPI_ environment for the failing case is

OMPI_MCA_mtl=psm2
OMPI_MCA_btl_openib_allow_ib=true
OMPI_MCA_pml=cm
OMPI_MCA_btl_openib_ib_retry_count=7
OMPI_MCA_btl_openib_ib_timeout=21
OMPI_MCA_btl=^openib

The ompi_mtl_psm2_add_procs function is defined here

If you have any other insights or data please append to this issue. Thanks!

@garlick
Copy link
Member

garlick commented Jan 12, 2022

When you run the 1n4p case under slurm, and it works, what options do you need to use?

Would it be possible to get an environment dump of OMPI_* and PMI* from slurm for comparsion? Maybe we've got something wrong in the environment set up by flux.

@briadam
Copy link

briadam commented Jan 12, 2022

Nothing special seems needed to run. Here are some abbreviated notes from a clean salloc/direct launch, where the mpirun variant default to running the tests on one node and both that and srun launch ran fine.

salloc --nodes=2
srun -N 1 -n 4 `pwd`/mpi_hello_debug
mpirun -n4 `pwd`/mpi_hello_debug

$ env | egrep 'OMPI_|PMI'
OMPI_MCA_mtl=psm2
OMPI_MCA_btl_openib_allow_ib=true
OMPI_MCA_pml=cm
OMPI_MCA_btl_openib_ib_retry_count=7
OMPI_MCA_btl_openib_ib_timeout=21
OMPI_MCA_btl=^openib

Here's the environment after I first start the flux daemons, where the only difference is the addition of the PMI library path:

srun -N2 -n2 --mpi=none --pty /path/to/bin/flux start

$ env | egrep 'OMPI_|PMI'
OMPI_MCA_mtl=psm2
OMPI_MCA_btl_openib_allow_ib=true
OMPI_MCA_pml=cm
OMPI_MCA_btl_openib_ib_retry_count=7
FLUX_PMI_LIBRARY_PATH=/path/to/lib/flux/libpmi.so
OMPI_MCA_btl_openib_ib_timeout=21
OMPI_MCA_btl=^openib

From within that Flux pty session, I did this to see what's set when running a job:

$ flux mini run -N 1 -n 2 env | egrep 'OMPI_|PMI' | sort -u
FLUX_PMI_LIBRARY_PATH=/path/to/lib/flux/libpmi.so
OMPI_MCA_btl=^openib
OMPI_MCA_btl_openib_allow_ib=true
OMPI_MCA_btl_openib_ib_retry_count=7
OMPI_MCA_btl_openib_ib_timeout=21
OMPI_MCA_btl_vader_backing_directory=/tmp/flux-0UyKF8/jobtmp-0-ƒ2p32wpiT
OMPI_MCA_mtl=psm2
OMPI_MCA_pmix=flux
OMPI_MCA_pml=cm
OMPI_MCA_schizo=flux
PMI_FD=22
PMI_FD=26
PMI_RANK=0
PMI_RANK=1
PMI_SIZE=2

@garlick
Copy link
Member

garlick commented Jan 12, 2022

Thanks - could you run env from srun the way you did with flux mini run above? I'm just wondering if slurm is doing anything to the environment that we can learn from, and also wondering which PMI your slurm is providing in the default case as maybe it's something MPI gets from PMI that makes the difference.

@briadam
Copy link

briadam commented Jan 12, 2022

Good call. Nothing special additional with srun:

$ srun -N 1 -n 2 env | egrep 'OMPI_|PMI' | sort -u
OMPI_MCA_btl=^openib
OMPI_MCA_btl_openib_allow_ib=true
OMPI_MCA_btl_openib_ib_retry_count=7
OMPI_MCA_btl_openib_ib_timeout=21
OMPI_MCA_mtl=psm2
OMPI_MCA_pml=cm

I also did same with raw mpirun in case useful, with apologies for the verbose output. I replaced some names and numbers in this with <...>:

$ mpirun -n 2 env | egrep 'OMPI_|PMI' | sort -u
OMPI_APP_CTX_NUM_PROCS=2
OMPI_COMMAND=env
OMPI_COMM_WORLD_LOCAL_RANK=0
OMPI_COMM_WORLD_LOCAL_RANK=1
OMPI_COMM_WORLD_LOCAL_SIZE=2
OMPI_COMM_WORLD_NODE_RANK=0
OMPI_COMM_WORLD_NODE_RANK=1
OMPI_COMM_WORLD_RANK=0
OMPI_COMM_WORLD_RANK=1
OMPI_COMM_WORLD_SIZE=2
OMPI_FILE_LOCATION=/tmp/ompi.<...>.<...>/jf.14210/0/1
OMPI_FIRST_RANKS=0
OMPI_MCA_btl=^openib
OMPI_MCA_btl_openib_allow_ib=true
OMPI_MCA_btl_openib_ib_retry_count=7
OMPI_MCA_btl_openib_ib_timeout=21
OMPI_MCA_ess_base_jobid=931266561
OMPI_MCA_ess_base_num_procs=3
OMPI_MCA_ess_base_vpid=0
OMPI_MCA_ess_base_vpid=1
OMPI_MCA_ess=^singleton
OMPI_MCA_initial_wdir=/home/<...>
OMPI_MCA_mpi_oversubscribe=0
OMPI_MCA_mtl=psm2
OMPI_MCA_orte_app_num=0
OMPI_MCA_orte_bound_at_launch=1
OMPI_MCA_orte_ess_node_rank=0
OMPI_MCA_orte_ess_node_rank=1
OMPI_MCA_orte_ess_num_procs=2
OMPI_MCA_orte_hnp_uri=931266560.0;tcp://<...>
OMPI_MCA_orte_jobfam_session_dir=/tmp/ompi.<...>.<...>/jf.14210
OMPI_MCA_orte_launch=1
OMPI_MCA_orte_local_daemon_uri=931266560.1;tcp://<...>
OMPI_MCA_orte_node_regex=<...>
OMPI_MCA_orte_num_nodes=1
OMPI_MCA_orte_precondition_transports=8e9942de8a20963a-0100e5d230539058
OMPI_MCA_orte_tmpdir_base=/tmp
OMPI_MCA_orte_top_session_dir=/tmp/ompi.<...>.<...>
OMPI_MCA_pmix=^s1,s2,cray,isolated
OMPI_MCA_pml=cm
OMPI_MCA_shmem_RUNTIME_QUERY_hint=mmap
OMPI_NUM_APP_CTX=1
OMPI_UNIVERSE_SIZE=72
PMIX_BFROP_BUFFER_TYPE=PMIX_BFROP_BUFFER_NON_DESC
PMIX_DSTORE_21_BASE_PATH=/tmp/ompi.<...>.<...>/jf.14210/pmix_dstor_ds21_228991
PMIX_DSTORE_ESH_BASE_PATH=/tmp/ompi.<...>.<...>/jf.14210/pmix_dstor_ds12_228991
PMIX_GDS_MODULE=ds21,ds12,hash
PMIX_HOSTNAME=<...>
PMIX_ID=931266561.0
PMIX_ID=931266561.1
PMIX_MCA_mca_base_component_show_load_errors=1
PMIX_NAMESPACE=931266561
PMIX_PTL_MODULE=tcp,usock
PMIX_RANK=0
PMIX_RANK=1
PMIX_SECURITY_MODE=native
PMIX_SERVER_TMPDIR=/tmp/ompi.<...>.<...>/jf.14210
PMIX_SERVER_URI21=931266560.1;tcp4://127.0.0.1:41331
PMIX_SERVER_URI2=931266560.1;tcp4://127.0.0.1:41331
PMIX_SERVER_URI3=931266560.1;tcp4://127.0.0.1:41331
PMIX_SYSTEM_TMPDIR=/tmp
PMIX_VERSION=3.2.3

@garlick
Copy link
Member

garlick commented Jan 13, 2022

The team at LLNL kindly installed openmpi-4.1.2 on our opal CTS-1 system for us, and it seems to work out of the box with flux. However it is using the openib btl, and was not built with psm2 enabled.

$ flux mini run -N2 -n4 ./hello
f921Tje7: completed MPI_Init in 0.379s.  There are 4 tasks
f921Tje7: completed first barrier in 0.030s
f921Tje7: completed MPI_Finalize in 0.312s
$ flux mini run -N1 -n4 ./hello
fD3npV3u: completed MPI_Init in 0.286s.  There are 4 tasks
fD3npV3u: completed first barrier in 0.000s
fD3npV3u: completed MPI_Finalize in 0.270s

The only environment vars we have set for OMPI are the ones set by flux (captured by running flux mini run -n1 printenv | grep OMPI_

OMPI_MCA_btl_vader_backing_directory=/var/tmp/garlick/flux-VOqqL3/jobtmp-0-f3w2WPmZq
OMPI_MCA_pmix=flux
OMPI_MCA_schizo=flux

I think we tried that in your environment and it failed, correct?

Back to the drawing board - maybe to build 4.1.1 --with-psm2.

@briadam
Copy link

briadam commented Jan 13, 2022

Sorry, tried which in our environment? I haven't tried wiping out the system module-set OMPI_* variables in our environment, but if that's what you mean, can give it a try.

Depending on that, I can also see if it works for me with openmpi-4.1.2 if I build it (though I probably don't have a prayer of building it exactly the way the sys admins did...).

In case relevant to your experiments, the only interesting parts of the system-installed openmpi-4.1.1 configuration (from "Configure command line" in ompi_info) seem to be

'--program-prefix='
'--disable-dependency-tracking'
'--with-io-romio-flags=--with-file-system=ufs+nfs+lustre'
'--with-cuda=/opt/cudatoolkit/10.2/include'
'--with-pmi'
'--enable-mpi-thread-multiple'

I'm happy to provide any other configuration info that's relevant.

@garlick
Copy link
Member

garlick commented Jan 13, 2022

Sorry, tried which in our environment? I haven't tried wiping out the system module-set OMPI_* variables in our environment, but if that's what you mean, can give it a try.

Sorry, that wasn't clear. Right, that's what I meant. Just wondering if openib just works for you like it appears to do for us. Though we still probably want to know why psm2 works for slurm and not flux.

I forgot that ompi_info could provide the configure options. Nice tip!

@briadam
Copy link

briadam commented Jan 13, 2022

wiping out the system module-set OMPI_* variables in our environment

Didn't change any behavior, one rank still hangs when run with -N 1. (I unset any OMPI_* variables both before flux start and again before flux mini run as they were re-created in the sub-shell.)

FWIW for your debugging, the target application is using openmpi-4.0.5, but I'm happy to test whatever version/options are helpful. Whatever difference in our environments seems tricky to chase down.

@garlick
Copy link
Member

garlick commented Jan 13, 2022

Thanks. To clarify, 4.0.5 just now, 4.1.1 before?

By chance do you have 4.1.2 available? That is the only version I have that works on this system right now. (Self-built ompi thus far is not going great...)

@briadam
Copy link

briadam commented Jan 13, 2022

Now I caused confusion... All my experiments reported in this issue are with 4.1.1.

My team member who asked me to demo this is ultimately aiming to use Dakota + Flux with an application built with intel-20.x and openmpi-4.0.5. I realize they may need to rebuild the application depending on what we find is the cause of this hang issue and whether it's openmpi version related, or an issue with the runtime environment, hardware, etc.

@briadam
Copy link

briadam commented Jan 14, 2022

Progress! It occurred to me that with my previous experiment clearing the OMPI_* env vars, the runtime could still fall-back to the default best transports and maybe pick cm/psm2. If I explicitly select openib (no idea if this is a valid way to do this as I'm in way over my head here):

unset OMPI_MCA_mtl
export OMPI_MCA_pml=ob1
export OMPI_MCA_btl=openib

yielding

OMPI_MCA_btl_openib_allow_ib=true
OMPI_MCA_pml=ob1
OMPI_MCA_btl_openib_ib_retry_count=7
OMPI_MCA_btl_openib_ib_timeout=21
OMPI_MCA_btl=openib

the -N 1 test works. It also works if I clear my environment and only set these two variables:

OMPI_MCA_mtl=^psm2
OMPI_MCA_pml=^cm

So this seems to further support that the issue relates to PSM2.

@garlick
Copy link
Member

garlick commented Jan 14, 2022

Excellent. Well we should root out the psm2 issue, but it's good to know something works with a high speed interconnect!

@garlick
Copy link
Member

garlick commented Jan 14, 2022

Not making a lot of progress here, although just to add some data points, I got 4.1.1 working on our system with psm2 built, and it seems to work out of the box with flux.

I note that when I run with -N1 -n4 (single node), it selects shared memory, not the interconnect, for communications, which is what I would expect but apparently not what is happening at sandia.

I found I was able to force only psm2 to be used (over shared memory, tcp etc) with the following:

OMPI_MCA_btl=ofi
OMPI_MCA_mtl=psm2
OMPI_MCA_pml=cm

but still no joy recreating the problem.

Just for the record, I was able to see what modules are actually being used by setting the following debug variables:

OMPI_MCA_btl_base_verbose=100
OMPI_MCA_mtl_base_verbose=100
OMPI_MCA_mtl_ofi_verbose=100

@briadam
Copy link

briadam commented Jan 14, 2022

Just to verify, I ran my -N1 test case with that increased verbosity with both the default openmpi-4.1.1 environment as well as in a clean one where only those verbosity controls were set. I believe the attached logs confirm that psm2 is selected, as we've thought.
briadam_mca_debug_N1_n4_cleanenv.log
briadam_mca_debug_N1_n4_defaults.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants