Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI_Init segfaults when using OpenMPI 3.1.3 with flux-core 0.11.0 #2170

Closed
damora opened this issue May 29, 2019 · 46 comments
Closed

MPI_Init segfaults when using OpenMPI 3.1.3 with flux-core 0.11.0 #2170

damora opened this issue May 29, 2019 · 46 comments
Labels

Comments

@damora
Copy link

damora commented May 29, 2019

Built the following:
flux-core-0.11.0
flux-sched-0.7.0
OpenMPI 3.1.3
PMIX 2.1 (using libpmi)
can launch flux with mpirun without any errors
basic flux informational commands seem to work, but attempting to run an MPI application I get the following segfault:
[c699c056:139471:0] Caught signal 11 (Segmentation fault)
==== backtrace ====
2 0x00000000000745c4 mxm_handle_error() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/util/debug/debug.c:641
3 0x0000000000074d04 mxm_error_signal_handler() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/util/debug/debug.c:616
4 0x000000000000fe38 _dl_lookup_symbol_x() :0
5 0x0000000000182c98 do_sym() dl-sym.c:0
6 0x000000000000139c dlsym_doit() dlsym.c:0
7 0x00000000000170d0 _dl_catch_error() :0
8 0x0000000000001c18 _dlerror_run() :0
9 0x0000000000001448 __dlsym() :0
10 0x00000000000026dc PMI_KVS_Put() /home/damora/openmpi-3.1.3/build/opal/mca/pmix/flux/../../../../../opal/mca/pmix/flux/pmix_flux.c:245
11 0x0000000000002b84 kvs_put() /home/damora/openmpi-3.1.3/build/opal/mca/pmix/flux/../../../../../opal/mca/pmix/flux/pmix_flux.c:294
12 0x0000000000108190 opal_pmix_base_partial_commit_packed() /home/damora/openmpi-3.1.3/build/opal/mca/pmix/../../../../opal/mca/pmix/base/pmix_base_fns.c:384
13 0x0000000000003fe8 flux_put() /home/damora/openmpi-3.1.3/build/opal/mca/pmix/flux/../../../../../opal/mca/pmix/flux/pmix_flux.c:654
14 0x0000000000005084 mca_pml_ucx_send_worker_address() /home/damora/openmpi-3.1.3/build/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx.c:95
15 0x0000000000005898 mca_pml_ucx_init() /home/damora/openmpi-3.1.3/build/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx.c:230
16 0x000000000000b988 mca_pml_ucx_component_init() /home/damora/openmpi-3.1.3/build/ompi/mca/pml/ucx/../../../../../ompi/mca/pml/ucx/pml_ucx_component.c:134
17 0x000000000018003c mca_pml_base_select() /home/damora/openmpi-3.1.3/build/ompi/mca/pml/../../../../ompi/mca/pml/base/pml_base_select.c:126
18 0x0000000000071a4c ompi_mpi_init() /home/damora/openmpi-3.1.3/build/ompi/../../ompi/runtime/ompi_mpi_init.c:640
19 0x00000000000e34d4 PMPI_Init() /home/damora/openmpi-3.1.3/build/ompi/mpi/c/profile/pinit.c:66
20 0x000000001000114c main() ??:0
21 0x0000000000025100 generic_start_main.isra.0() libc-start.c:0
22 0x00000000000252f4 __libc_start_main() ??:0

@garlick
Copy link
Member

garlick commented May 29, 2019

The code inside the openmpi Flux plugin in opal/mca/pmix/flux/pmix_flux.c looks like this:

static int PMI_KVS_Put (const char *kvsname, const char *key, const char *value)
{
    int (*f)(const char *, const char *, const char *);
    *(void **)(&f) = dso ? dlsym (dso, "PMI_KVS_Put") : NULL;
    return f ? f (kvsname, key, value) : PMI_FAIL;
}

and dlsym() is internally segfaulting, with the SIGSEGV being caught by openmpi and reported. I'm not sure what could be causing dlsym() to segfault here. Importantly, it's not segfaulting in the call into the Flux libpmi.so, it's segfaulting in dlsym() itself.

Also, what is with that cast? It seems like

f = dso ? dlsym (dso, "PMI_KVS_Put") : NULL

would have worked just fine. Did I write that? Ugh.

Plan of attack. Well, it's hard to know what might be going on, although our other efforts to abuse dlopen() / dlsym() in spectrum MPI have been...perilous...due to spectrum's internal LD_PRELOADs, I hear. I am sorry that I do not understand shared library semantics well enough to be of much help.

@dongahn
Copy link
Member

dongahn commented May 29, 2019

It seems OpenMPI dies trying to dlopen/dlsym a symbol from a PMI library. Is flux's libpmi in the library search path?

Is there an easy way to check which libpmi your OpenMPI dlopen'ed?

@dongahn
Copy link
Member

dongahn commented May 29, 2019

@damora: You are not using Spectrum MPI at all due to an issue with the recent bug in PMIx?

@dongahn
Copy link
Member

dongahn commented May 29, 2019

Final thought before taking my kids to school: for spectrum MPI we had to nuke some environment variables to make it work and we capture that logic in our wreck plugin. We might need to use a similar trick to redirect OpenMPI's requests from its PMIx to Flux.

This is in case you are really using OpenMPI.

This whole thing about PMIx, PMI and bootstrapped is becoming increasingly complex and convoluted and I do hope the standardization effort to fix the interoperability issues for all. sigh.

@damora
Copy link
Author

damora commented May 29, 2019

@dongahn not using Spectrum MPI. I checked out PMIX 2.1 and built it. I then checked out OpenMPI 3.1.2 and configured it with --with-pmix=/pmix2.1
I set PMI_LIBRARY=/pmix2.1/lib/libpmi.so when launching flux with openmpi. I used mpirun ... -x PMI_LIBRARY=/pmix2.1/lib/libpmi.so to make sure it was picked up my flux exec.

@dongahn
Copy link
Member

dongahn commented May 29, 2019

@damora: yes that should be good to launch flux with mpirun/openmpi. But I am not sure that is good enough for flux to bootstrap an MPI app that was build with the openmpi configured this way.

Perhaps you can build another OpenMPI with a non PMIx option which is more compatible with flux. I forgot. Was it --with-flux? @garlick?

If you build your MPI applications with that version of OpenMPI, you would have a better chance to launch them under flux IMHO.

@garlick
Copy link
Member

garlick commented May 29, 2019

not using Spectrum MPI.

Ah that is useful info.

I set PMI_LIBRARY=/pmix2.1/lib/libpmi.so when launching flux with openmpi. I used mpirun ... -x PMI_LIBRARY=/pmix2.1/lib/libpmi.so to make sure it was picked up my flux exec

Your trace appears to be from an application linked against openmpi that is trying to bootstrap under Flux and failing in MPI_Init(). So whatever you are doing to start Flux (mpirun) is working?

@garlick
Copy link
Member

garlick commented May 29, 2019

I think the flux mca plugin gets built by default. The trace shows it is being used. I guess the question is whether building openmpi without pmix results in an mpirun that can start flux. I haven't tried that.

@damora
Copy link
Author

damora commented May 29, 2019

@garlick yes, using (open) mpirun seems to start flux correctly...at least I see no errors in the logs. I can also do flux hwloc info and it returns the correct core counts. After I launch the jobs, I can use flux wreck commands to query and look at output. I don't actually see any errors in flux.log

Should I have built openmpi without specifying pmix? Should I configure openmpi to use --with-flux-pmi and --with-flux-pmi-library

@dongahn
Copy link
Member

dongahn commented May 29, 2019

@damora:

Before going after a big hammer solution like building a differently configured OpenMPI, could you quickly check which libpmi.so that is being dlopen'ed here?

I am wondering if another libpmi or similar has been dlopen'ed that led to this error.

If you set your LD_LIBRARY_PATH to have flux's libpmi.so appear first in the lib search path, what happen to the error?

@damora
Copy link
Author

damora commented May 29, 2019

@dongahn same error. I can try rebuilding openmpi using --with-flux-pmi-lib

@damora
Copy link
Author

damora commented May 29, 2019

flux builds its own libpmi.so. When configuring openmpi --with-flux-pmi=yes (which is default) it will automatically use the libpmi.so that got built with flux? If so, I wonder if that is the problem ? if the default is --with-flux-pmi=yes, is mpirun trying to use the libpmi.so built with flux rather than the one I provided via pmix 2.1?

@dongahn
Copy link
Member

dongahn commented May 29, 2019

flux builds its own libpmi.so. When configuring openmpi --with-flux-pmi=yes (which is default) it will automatically use the libpmi.so that got built with flux? If so, I wonder if that is the problem ? if the default is --with-flux-pmi=yes, is mpirun trying to use the libpmi.so built with flux rather than the one I provided via pmix 2.1?

So here is the convoluted nature of this problem.

  • When you use OpenMPI/mpirun to launch flux itself, you want to make sure your libpmi backward compatible library built against the PMIx is used.

  • When you want to launch an OpenMPI application under flux, you want to make sure the libpmi built with flux is used instead.

That's why I suggested to build one more OpenMPI configuration (built with --with-flux) and use that to build your OpenMPI application.

Does this make sense?

@damora
Copy link
Author

damora commented May 29, 2019

@dongahn yes, I tried this combination, but still did not work.
I'll double check tonight. It is clearly an MPI issue because I have cuda calls before I execute MPI_Init and they work fine.
I think I need to confirm which libpmi I'm opening and when.

@dongahn
Copy link
Member

dongahn commented May 29, 2019

@damora: Thank you for the report. Some of us may need to look at whether flux can still bootstrap OpenMPI.

I think I need to confirm which libpmi I'm opening and when.

Yes. Great thanks.

@damora
Copy link
Author

damora commented May 30, 2019

@dongahn I was reviewing how I configured flux to work in containers where I can launch in containers, exec into container and run an mpi application across multiple containers using flux wreckrun and I noticed a couple things:

  1. I was using OpenMPI 3.0.0 and configuring with external pmix flag as well as --with-flux flag
  2. I was using flux-core v0.10.0 which was configured using a --with-pmix flag. I think that flag has been deprecated in v0.11.0. I know this works in containers.

@dongahn
Copy link
Member

dongahn commented May 30, 2019

@damora: at this point, i propose we hold a call to try to resolve your issue.

@damora
Copy link
Author

damora commented May 30, 2019

@dongahn how about if I setup a Webex so that way I can show you the steps I'm taking?

@dongahn
Copy link
Member

dongahn commented May 30, 2019

Sounds good. Please send that info to me and @SteVwonder. Thanks

@SteVwonder
Copy link
Member

I have been able to reproduce (or at least create a very similar scenario) on an LC system (opal) using the OpenMPI 4.0.0 installed in /usr/tce. So it appears that this is a more general problem with OpenMPI that we first anticipated

Backtrace from my segfault:

(gdb) #0  _dl_lookup_symbol_x (undef_name=0x2aaaad6e331c "PMI_KVS_Put",
    undef_map=<optimized out>, ref=0x7fffffff7ec0, symbol_scope=0x667ac8,
    version=<optimized out>, type_class=<optimized out>, flags=2, skip_map=0x0)
    at dl-lookup.c:783
#1  0x00002aaaab33fd49 in do_sym (handle=0x667740,
    name=0x2aaaad6e331c "PMI_KVS_Put", who=0x2aaaad6e2119 <kvs_put+41>,
    vers=vers@entry=0x0, flags=flags@entry=2) at dl-sym.c:178
#2  0x00002aaaab3402ad in _dl_sym (handle=<optimized out>,
    name=<optimized out>, who=<optimized out>) at dl-sym.c:283
#3  0x00002aaaabb99004 in dlsym_doit (a=a@entry=0x7fffffff80c0) at dlsym.c:50
#4  0x00002aaaaaaba704 in _dl_catch_error (objname=0x64b700,
    errstring=0x64b708, mallocedp=0x64b6f8,
    operate=0x2aaaabb98ff0 <dlsym_doit>, args=0x7fffffff80c0) at dl-error.c:177
#5  0x00002aaaabb994ed in _dlerror_run (
    operate=operate@entry=0x2aaaabb98ff0 <dlsym_doit>,
    args=args@entry=0x7fffffff80c0) at dlerror.c:163
#6  0x00002aaaabb99058 in __dlsym (handle=<optimized out>,
    name=<optimized out>) at dlsym.c:70
#7  0x00002aaaad6e2119 in kvs_put ()
   from /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_flux.so
#8  0x00002aaaab952ae0 in opal_pmix_base_commit_packed ()
   from /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40
#9  0x00002aaaad6e1fac in flux_commit ()
   from /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_flux.so
#10 0x00002aaaaad1bd78 in ompi_mpi_init ()
   from /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libmpi.so.40
#11 0x00002aaaaad4972b in PMPI_Init ()
   from /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libmpi.so.40
#12 0x00000000004007c8 in main ()

@dongahn
Copy link
Member

dongahn commented May 30, 2019

_dl_lookup_symbol_x (undef_name=0x2aaaad6e331c "PMI_KVS_Put",
undef_map=, ref=0x7fffffff7ec0, symbol_scope=0x667ac8,
version=, type_class=, flags=2, skip_map=0x0)
at dl-lookup.c:783

Does it mean PMI_KVS_PUT is not defined? As we spoke at our call earlier, I'm wondering what LD_DEBUG say about the dlsym crash...

@SteVwonder
Copy link
Member

PMI_Finalize is called before the segfault occurs (confirmed by setting a breakpoint with gdb). So I wonder if the PMI shared library is closed during finalize, and then the dlsym is being called on an invalid handle.

→ flux wreckrun -N1 -n1 env FLUX_PMI_DEBUG=1 ./mpi-hello-opal-ompi4
PMI_Init: PMI_FD is set, selecting simple_client
0: PMI_Init rc=0
0: PMI_KVS_Get_value_length_max rc=0
0: PMI_KVS_Get_name_length_max rc=0
0: PMI_KVS_Get_key_length_max rc=0
0: PMI_Get_rank rc=0
0: PMI_KVS_Get_my_name ("lwj.0.0.14.pmi") rc=0
0: PMI_Get_rank rc=0
0: PMI_Get_size rc=0
0: PMI_KVS_Get_name_length_max rc=0
0: PMI_KVS_Get_value_length_max rc=0
0: PMI_KVS_Get_my_name ("lwj.0.0.14.pmi") rc=0
0: PMI_KVS_Get ("lwj.0.0.14.pmi", "PMI_process_mapping", "(vector,(0,1,1))") rc=0
0: PMI_Get_clique_size rc=0
0: PMI_Get_rank rc=0
0: PMI_Get_size rc=0
0: PMI_KVS_Get_name_length_max rc=0
0: PMI_KVS_Get_value_length_max rc=0
0: PMI_KVS_Get_my_name ("lwj.0.0.14.pmi") rc=0
0: PMI_KVS_Get ("lwj.0.0.14.pmi", "PMI_process_mapping", "(vector,(0,1,1))") rc=0
0: PMI_Get_clique_ranks rc=0
0: PMI_Get_universe_size rc=0
0: PMI_Get_size rc=0
0: PMI_Get_appnum rc=0
0: PMI_KVS_Get ("lwj.0.0.14.pmi", "14-0-key0") rc=4 invalid key argument
0: PMI_Finalize rc=0
[opal82:174246] *** Process received signal ***
[opal82:174246] Signal: Segmentation fault (11)

@dongahn
Copy link
Member

dongahn commented May 30, 2019

I am confused. One is dying in MPI_Init() and the other died after MPI_Finalize(). This is as if MPI_Init() was called twice.

@SteVwonder
Copy link
Member

PMI_Finalize not MPI_Finalize 😄 aren't acronyms with an edit-distance of 1 great!

@dongahn
Copy link
Member

dongahn commented May 30, 2019

Ah... then your theory makes sense to me. LD_DEBUG might tell you whether dlclose had occurred on the libpami DSO. Still the question is why OpenMPI does this though...

@SteVwonder
Copy link
Member

Idk if this helps, but here is a subset (removed most of the bindings and left the init/fini) of the output from running with LD_DEBUG=binding:

    176970: calling init: /usr/lib64/flux/libpmi.so
    176970:
PMI_Init: PMI_FD is set, selecting simple_client
0: PMI_Init rc=0
0: PMI_KVS_Get_value_length_max rc=0
0: PMI_KVS_Get_name_length_max rc=0
0: PMI_KVS_Get_key_length_max rc=0
0: PMI_Get_rank rc=0
0: PMI_KVS_Get_my_name ("lwj.0.0.23.pmi") rc=0
0: PMI_Get_rank rc=0
0: PMI_Get_size rc=0
0: PMI_KVS_Get_name_length_max rc=0
0: PMI_KVS_Get_value_length_max rc=0
0: PMI_KVS_Get_my_name ("lwj.0.0.23.pmi") rc=0
0: PMI_KVS_Get ("lwj.0.0.23.pmi", "PMI_process_mapping", "(vector,(0,1,1))") rc=0
0: PMI_Get_clique_size rc=0
0: PMI_Get_rank rc=0
0: PMI_Get_size rc=0
0: PMI_KVS_Get_name_length_max rc=0
0: PMI_KVS_Get_value_length_max rc=0
0: PMI_KVS_Get_my_name ("lwj.0.0.23.pmi") rc=0
0: PMI_KVS_Get ("lwj.0.0.23.pmi", "PMI_process_mapping", "(vector,(0,1,1))") rc=0
0: PMI_Get_clique_ranks rc=0
0: PMI_Get_universe_size rc=0
0: PMI_Get_size rc=0
0: PMI_Get_appnum rc=0

<snip a bunch of OpenMPI mca shared libraries>

0: PMI_KVS_Get ("lwj.0.0.23.pmi", "23-0-key0") rc=4 invalid key argument
0: PMI_Finalize rc=0

<snip a bunch of opal_hwloc binding calls>

    176970:
    176970: calling fini: /usr/lib64/flux/libpmi.so [0]
    176970:
    176970:
 calling fini: /usr/lib64/libczmq.so.3 [0]
    176970:
    176970:
    176970: calling fini: /usr/lib64/libzmq.so.5 [0]
    176970:
    176970:
    176970: calling fini: /usr/lib64/libsodium.so.23 [0]
    176970:
    176970:
    176970: calling fini: /usr/lib64/libpgm-5.2.so.0 [0]
    176970:
    176970:

<snip more system and openmpi shared libraries> 

    176970: binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libmpi.so.40 [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libmpi.so.40 [0]: normal symbol `mca_pml_base_pml_selected'
    176970: binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libmpi.so.40 [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0]: normal symbol `mca_base_component_to_string'
    176970: binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_flux.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0]: normal symbol `opal_pmix_base_commit_packed'
[opal82:176970] *** Process received signal ***
    176970: binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0] to /usr/lib64/libc.so.6 [0]: normal symbol `strsignal' [GLIBC_2.2.5]
[opal82:176970] Signal: Segmentation fault (11)
[opal82:176970] Signal code: Address not mapped (1)
[opal82:176970] Failing at address: 0x129
    176970: binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0]: normal symbol `opal_backtrace_print'
    176970: binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0] to /usr/lib64/libc.so.6 [0]: normal symbol `backtrace' [GLIBC_2.2.5]
    176970: binding file /usr/lib64/libgcc_s.so.1 [0] to /usr/lib64/libgcc_s.so.1 [0]: normal symbol `_Unwind_Backtrace'
    176970: binding file /usr/lib64/libgcc_s.so.1 [0] to /usr/lib64/libgcc_s.so.1 [0]: normal symbol `_Unwind_GetIP'
    176970: binding file /usr/lib64/libgcc_s.so.1 [0] to /usr/lib64/libgcc_s.so.1 [0]: normal symbol `_Unwind_GetCFA'
[opal82:176970] [ 0]     176970:    binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0] to /usr/lib64/libc.so.6 [0]: normal symbol `backtrace_symbols_fd' [GLIBC_2.2.5]
/usr/lib64/libpthread.so.0(+0xf5d0)[0x2aaaaaff65d0]
[opal82:176970] [ 1] /lib64/ld-linux-x86-64.so.2(+0x95f7)[0x2aaaaaab45f7]
[opal82:176970] [ 2] /lib64/ld-linux-x86-64.so.2(+0x9fcf)[0x2aaaaaab4fcf]
[opal82:176970] [ 3] /usr/lib64/libc.so.6(+0x13cd49)[0x2aaaab33fd49]
[opal82:176970] [ 4] /usr/lib64/libdl.so.2(+0x1004)[0x2aaaabb99004]
[opal82:176970] [ 5] /lib64/ld-linux-x86-64.so.2(+0xf704)[0x2aaaaaaba704]
[opal82:176970] [ 6] /usr/lib64/libdl.so.2(+0x14ed)[0x2aaaabb994ed]
[opal82:176970] [ 7] /usr/lib64/libdl.so.2(dlsym+0x48)[0x2aaaabb99058]
[opal82:176970] [ 8] /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_flux.so(+0x2119)[0x2aaaad6e2119]
[opal82:176970] [ 9] /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40(opal_pmix_base_commit_packed+0x2c0)[0x2aaaab952ae0]
[opal82:176970] [10] /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_flux.so(+0x1fac)[0x2aaaad6e1fac]
[opal82:176970] [11] /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libmpi.so.40(ompi_mpi_init+0x6a8)[0x2aaaaad1bd78]
[opal82:176970] [12] /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libmpi.so.40(MPI_Init+0x5b)[0x2aaaaad4972b]
[opal82:176970] [13] ./mpi-hello-opal-ompi4[0x4007c8]
[opal82:176970] [14] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaab2253d5]
[opal82:176970] [15] ./mpi-hello-opal-ompi4[0x4006d9]
[opal82:176970] *** End of error message ***
    176970: binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0]: normal symbol `opal_delay_abort'
    176970: binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0] to /usr/lib64/libc.so.6 [0]: normal symbol `signal' [GLIBC_2.2.5]
    176970: binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0] to /usr/lib64/libpthread.so.0 [0]: normal symbol `raise' [GLIBC_2.2.5]
wreckrun: task 0: exited with signal 11

It is worth noting that there is another set of PMI_Init and Finalize called earlier on in the trace:

   177371:     calling init: /usr/lib64/libpmi.so.0
    177371:
    177371:
    177371:     calling init: /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s1.so
    177371:
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s1.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s1.so [0]: normal symbol `mca_pmix_s1_component'
    177371:     binding file /usr/lib64/libpmi2.so.0 [0] to /usr/lib64/libc.so.6 [0]: normal symbol `__cxa_finalize' [GLIBC_2.2.5]
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0]: normal symbol `opal_pmix_base_register_handler'
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0]: normal symbol `opal_pmix_base_deregister_handler'
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so [0]: normal symbol `mca_pmix_s2_component'
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0]: normal symbol `opal_hwloc_topology'
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0]: normal symbol `opal_jobid_print'
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0]: normal symbol `opal_process_info'
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0]: normal symbol `opal_pmix_base'
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0]: normal symbol `opal_process_name_print'
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0]: normal symbol `opal_class_init_epoch'
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0]: normal symbol `opal_show_help'
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0]: normal symbol `opal_pmix_base_framework'
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so [0]: normal symbol `opal_pmix_s2_module'
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0]: normal symbol `opal_uses_threads'
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0]: normal symbol `opal_value_t_class'
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so [0] to /usr/lib64/libc.so.6 [0]: normal symbol `__cxa_finalize' [GLIBC_2.2.5]
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0]: normal symbol `opal_object_t_class'
    177371:
    177371:     calling init: /usr/lib64/libpmi2.so.0
    177371:
    177371:
    177371:     calling init: /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so
    177371:
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so [0]: normal symbol `mca_pmix_s2_component'
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_flux.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0]: normal symbol `mca_base_component_var_register'
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_pmix3x.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0]: normal symbol `mca_base_component_var_register'
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_pmix3x.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_pmix3x.so [0]: normal symbol `OPAL_MCA_PMIX3X_PMIx_Get_version'
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_pmix3x.so [0] to /usr/lib64/libc.so.6 [0]: normal symbol `asprintf' [GLIBC_2.2.5]
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_pmix3x.so [0] to /usr/lib64/libc.so.6 [0]: normal symbol `free' [GLIBC_2.2.5]
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s1.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0]: normal symbol `mca_base_component_var_register'
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0]: normal symbol `mca_base_component_var_register'
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_ess_pmi.so [0] to /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/libopen-pal.so.40 [0]: normal symbol `opal_pmix_base_select'
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_flux.so [0] to /usr/lib64/libc.so.6 [0]: normal symbol `getenv' [GLIBC_2.2.5]
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_pmix3x.so [0] to /usr/lib64/libc.so.6 [0]: normal symbol `getenv' [GLIBC_2.2.5]
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s1.so [0] to /usr/lib64/libc.so.6 [0]: normal symbol `getenv' [GLIBC_2.2.5]
    177371:     binding file /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s2.so [0] to /usr/lib64/libc.so.6 [0]: normal symbol `getenv' [GLIBC_2.2.5]
    177371:
    177371:     calling fini: /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_isolated.so [0]
    177371:
    177371:
    177371:     calling fini: /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_pmix3x.so [0]
    177371:
    177371:
    177371:     calling fini: /usr/tce/packages/openmpi/openmpi-4.0.0-gcc-4.9.3/lib/openmpi/mca_pmix_s1.so [0]
    177371:
    177371:
    177371:     calling fini: /usr/lib64/libpmi.so.0 [0]

@dongahn
Copy link
Member

dongahn commented May 31, 2019

    176970: calling fini: /usr/lib64/flux/libpmi.so [0]

If the outputs are printed in chronological order, it appears the finalizer function of our libpmi.so was called before the crash. It seems to corroborate our theory: OpenMPI performs illegal dlsym on the finalized and thus invalid DSO.

It is worth noting that there is another set of PMI_Init and Finalize called earlier on in the trace:

Maybe OpenMPI dlopens a series of PMIs (available in the library search path) and registers them in its DSO table and activate whatever is required given environment variables and such.

To debug this further, I would think that we may need to install a debug version of OpenMPI and stepping through the source codes.

My current debug support branch may be of help here.

@dongahn
Copy link
Member

dongahn commented May 31, 2019

@SteVwonder: we can try to debug this using the new debugger support branch branch tomorrow if you have some time. flux-framework/flux-core-v0.11#12 (comment). It is less likely I will be able to spend time for this next week as I will be on travel for a week.

@damora
Copy link
Author

damora commented May 31, 2019

@SteVwonder @dongahn I tried setting LD_DEBUG=cat all and then searched for PMI_KVS_put:
grep PMI_KVS_Put ld_debug*
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=./hello [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/usr/local/cuda-10.1/targets/ppc64le-linux/lib/libcudart.so.10.1 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/shared/flux/openmpi-3.1.3/gnu/lib/libmpi.so.40 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/shared/flux/openmpi-3.1.3/gnu/lib/libopen-rte.so.40 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/shared/flux/openmpi-3.1.3/gnu/lib/libopen-pal.so.40 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/usr/lib64/librt.so.1 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/usr/lib64/libpthread.so.0 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/usr/lib64/libdl.so.2 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/usr/lib64/libstdc++.so.6 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/usr/lib64/libm.so.6 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/usr/lib64/libgcc_s.so.1 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/usr/lib64/libc.so.6 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/lib64/ld64.so.2 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/usr/lib64/libnuma.so.1 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/usr/lib64/libudev.so.1 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/usr/lib64/libutil.so.1 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/usr/lib64/libz.so.1 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/usr/lib64/libevent-2.0.so.5 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/usr/lib64/libevent_pthreads-2.0.so.5 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/usr/lib64/libcap.so.2 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/usr/lib64/libdw.so.1 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/usr/lib64/libattr.so.1 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/usr/lib64/libelf.so.1 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/usr/lib64/liblzma.so.5 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/usr/lib64/libbz2.so.1 [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/shared/flux/openmpi-3.1.3/gnu/lib/openmpi/mca_shmem_mmap.so [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/shared/flux/openmpi-3.1.3/gnu/lib/openmpi/mca_reachable_weighted.so [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/shared/flux/openmpi-3.1.3/gnu/lib/openmpi/mca_schizo_flux.so [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/shared/flux/openmpi-3.1.3/gnu/lib/openmpi/mca_schizo_ompi.so [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/shared/flux/openmpi-3.1.3/gnu/lib/openmpi/mca_schizo_orte.so [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/shared/flux/openmpi-3.1.3/gnu/lib/openmpi/mca_schizo_slurm.so [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/shared/flux/openmpi-3.1.3/gnu/lib/openmpi/mca_ess_pmi.so [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/shared/flux/openmpi-3.1.3/gnu/lib/openmpi/mca_pmix_flux.so [0]
ld_debug.txt.180186: 180186: symbol=PMI_KVS_Put; lookup in file=/shared/flux/lib/flux/libpmi.so [0]
ld_debug.txt.180186: 180186: binding file /shared/flux/lib/flux/libpmi.so [0] to /shared/flux/lib/flux/libpmi.so [0]: normal symbol `PMI_KVS_Put'

It seems like it is looking in the right library

@dongahn
Copy link
Member

dongahn commented Jun 1, 2019

@damora: @SteVwonder can give more details, but he and I had a debugging session with the new parallel debugger support I added. We determined that this was caused by what appears to be a bug within OpenMPI. Essentially, they seemed to introduce a bug in an upper layer with respect to PMI init/fini reference counting such that Flux's PMI library was prematurely finalized and closed with dlclose(3). So the segfault within dlsym(3) is the side effect of passing an invalid DSO object into that function.

Real trouble is there are a number of OpenMPI releases around 4.0.0 that break Flux this way. So your W/R should be ether avoid these versions or maybe patch your local flux source code to make its PMI_Finalize() a NOOP.

@SteVwonder has a good idea about which OpenMPI commit introduced this bug. So he may file a issue ticket to OpenMPI repo.

@damora
Copy link
Author

damora commented Jun 1, 2019

@dongahn @SteVwonder I saw Steve's git issues regarding this, but it seemed like he was using 4.x OpenMPI, but the error I'm seeing is with OpenMPI 3.1.3. I can easily patch my local flux source code though to see if it fixes

@dongahn
Copy link
Member

dongahn commented Jun 1, 2019

@damora: I don't remember which version at which this bug was introduced but I wouldn't be surprised if this goes all the back to the 3.1.3 version. Yes make PMI_Finalize noop and see if that works around it.

@SteVwonder
Copy link
Member

don't remember which version at which this bug was introduced but I wouldn't be surprised if this goes all the back to the 3.1.3 version

Yeah. We believe this commit is the one that introduced the problem, which means OpenMPI 3.1.0+ are all affected.

@SteVwonder
Copy link
Member

Just posted an issue on the OpenMPI GitHub issue tracker. I also think I found the source of the reference counting error. After fixing it, OpenMPI now segfaults when finalizing. So it is at least progress towards a complete solution.

@garlick
Copy link
Member

garlick commented Jun 2, 2019

Is there anything we need to fix in our pmi-1 library, or maybe something we can do to make it more robust with respect to how OpenMPI is using it?

@dongahn
Copy link
Member

dongahn commented Jun 2, 2019

@garlick: It wasn't clear if there was anything we could do in flux. To work around I suggested @damora to make our PMI_Finalize a noop (i.e. comment out the inside of the function for now. I know bad... but at least this may allow him to run MuMMI with a OMPi version he is using.

One of the things that our PMI_Finalize does is to close the PMI file descriptor. So if there is anything that the server side does that assumes close (PMI_FD) from the client, we can make it a bit robust by not assuming it for a hack like this. I don't think it would be the case though.

@garlick
Copy link
Member

garlick commented Jun 2, 2019

That will leak some state on the server side (in wrexecd in 0.11). Not ideal but not the end of the world either. Maybe we could introduce an environment variable that turns PMI_Finalize() into a no-op and then add an "opt-in" MPI personality for ompi that sets it?

@dongahn
Copy link
Member

dongahn commented Jun 2, 2019

Yes I like that idea a lot @garlick! I would say though, let's wait the testing results of @damora to make sure this is a valid WR for a complex code like he is using.

@damora
Copy link
Author

damora commented Jun 3, 2019

@dongahn I actually reverted to ompi 3.0.4. I can run cuda mpi single task app:

flux submit ./hello
rank:0 gpu:3 Hello World!
Job 1 status: complete
task0: exited with exit code 0

segfaults for 2 mpi tasks, i.e flux submit -n2 ./hello
[c699c011:47150:0] Caught signal 11 (Segmentation fault)
[c699c011:47149:0] Caught signal 11 (Segmentation fault)
==== backtrace ====
 2 0x00000000000745c4 mxm_handle_error()  /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/util/debug/debug.c:641
 3 0x0000000000074d04 mxm_error_signal_handler()  /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/util/debug/debug.c:616
 4 0x000000000001731c _dl_init_internal()  :0
==== backtrace ====
 2 0x00000000000745c4 mxm_handle_error()  /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/util/debug/debug.c:641
 3 0x0000000000074d04 mxm_error_signal_handler()  /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/util/debug/debug.c:616
 4 0x000000000001731c _dl_init_internal()  :0
 5 0x000000000001d83c dl_open_worker()  dl-open.c:0
 6 0x00000000000170d0 _dl_catch_error()  :0
 7 0x000000000001ca0c _dl_open()  :0
 8 0x0000000000001138 dlopen_doit()  dlopen.c:0
 9 0x00000000000170d0 _dl_catch_error()  :0
10 0x0000000000001c18 _dlerror_run()  :0
11 0x0000000000001238 __dlopen_check()  :0
12 0x0000000000029290 vm_open()  dlopen.c:0
13 0x000000000001f650 tryall_dlopen()  ltdl.c:0
14 0x0000000000022c84 try_dlopen()  ltdl.c:0
15 0x0000000000026258 lt_dlopenext()  :0
16 0x000000000000f044 open_component()  mca_base_component_find.c:0
17 0x0000000000010720 ocoms_mca_base_component_find()  ??:0
18 0x0000000000011b50 ocoms_mca_base_framework_components_open()  ??:0
19 0x00000000000c19fc hmca_rcache_base_framework_open()  rcache_base.c:0
20 0x000000000001dd58 ocoms_mca_base_framework_open()  ??:0
21 0x00000000000c1b54 hmca_rcache_base_open()  ??:0
22 0x00000000000560d4 hcoll_ml_open()  ??:0
23 0x00000000000b8e10 hcoll_init_with_opts()  ??:0
24 0x0000000000006f08 mca_coll_hcoll_comm_query()  /home/damora/ompi/build/ompi/mca/coll/hcoll/../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:302
25 0x000000000013c2bc query_2_0_0()  /home/damora/ompi/build/ompi/mca/coll/../../../../ompi/mca/coll/base/coll_base_comm_select.c:407
26 0x000000000013c240 query()  /home/damora/ompi/build/ompi/mca/coll/../../../../ompi/mca/coll/base/coll_base_comm_select.c:390
27 0x000000000013c0d4 check_one_component()  /home/damora/ompi/build/ompi/mca/coll/../../../../ompi/mca/coll/base/coll_base_comm_select.c:352
28 0x000000000013beb0 check_components()  /home/damora/ompi/build/ompi/mca/coll/../../../../ompi/mca/coll/base/coll_base_comm_select.c:302
29 0x000000000013892c mca_coll_base_comm_select()  /home/damora/ompi/build/ompi/mca/coll/../../../../ompi/mca/coll/base/coll_base_comm_select.c:125
30 0x0000000000070024 ompi_mpi_init()  /home/damora/ompi/build/ompi/../../ompi/runtime/ompi_mpi_init.c:903
31 0x00000000000c7d58 PMPI_Init()  /home/damora/ompi/build/ompi/mpi/c/profile/pinit.c:66
32 0x000000001000114c main()  ??:0
33 0x0000000000025100 generic_start_main.isra.0()  libc-start.c:0
34 0x00000000000252f4 __libc_start_main()  ??:0
===================
 5 0x000000000001d83c dl_open_worker()  dl-open.c:0
 6 0x00000000000170d0 _dl_catch_error()  :0
 7 0x000000000001ca0c _dl_open()  :0
 8 0x0000000000001138 dlopen_doit()  dlopen.c:0
 9 0x00000000000170d0 _dl_catch_error()  :0
10 0x0000000000001c18 _dlerror_run()  :0
11 0x0000000000001238 __dlopen_check()  :0
12 0x0000000000029290 vm_open()  dlopen.c:0
13 0x000000000001f650 tryall_dlopen()  ltdl.c:0
14 0x0000000000022c84 try_dlopen()  ltdl.c:0
15 0x0000000000026258 lt_dlopenext()  :0
16 0x000000000000f044 open_component()  mca_base_component_find.c:0
17 0x0000000000010720 ocoms_mca_base_component_find()  ??:0
18 0x0000000000011b50 ocoms_mca_base_framework_components_open()  ??:0
19 0x00000000000c19fc hmca_rcache_base_framework_open()  rcache_base.c:0
20 0x000000000001dd58 ocoms_mca_base_framework_open()  ??:0
21 0x00000000000c1b54 hmca_rcache_base_open()  ??:0
22 0x00000000000560d4 hcoll_ml_open()  ??:0
23 0x00000000000b8e10 hcoll_init_with_opts()  ??:0
24 0x0000000000006f08 mca_coll_hcoll_comm_query()  /home/damora/ompi/build/ompi/mca/coll/hcoll/../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:302
25 0x000000000013c2bc query_2_0_0()  /home/damora/ompi/build/ompi/mca/coll/../../../../ompi/mca/coll/base/coll_base_comm_select.c:407
26 0x000000000013c240 query()  /home/damora/ompi/build/ompi/mca/coll/../../../../ompi/mca/coll/base/coll_base_comm_select.c:390
27 0x000000000013c0d4 check_one_component()  /home/damora/ompi/build/ompi/mca/coll/../../../../ompi/mca/coll/base/coll_base_comm_select.c:352
28 0x000000000013beb0 check_components()  /home/damora/ompi/build/ompi/mca/coll/../../../../ompi/mca/coll/base/coll_base_comm_select.c:302
29 0x000000000013892c mca_coll_base_comm_select()  /home/damora/ompi/build/ompi/mca/coll/../../../../ompi/mca/coll/base/coll_base_comm_select.c:125
30 0x0000000000070024 ompi_mpi_init()  /home/damora/ompi/build/ompi/../../ompi/runtime/ompi_mpi_init.c:903
31 0x00000000000c7d58 PMPI_Init()  /home/damora/ompi/build/ompi/mpi/c/profile/pinit.c:66
32 0x000000001000114c main()  ??:0
33 0x0000000000025100 generic_start_main.isra.0()  libc-start.c:0
34 0x00000000000252f4 __libc_start_main()  ??:0
===================
Job 2 status: complete
task[0-1]: exited with signal 11

@dongahn
Copy link
Member

dongahn commented Jun 3, 2019

@dongahn I actually reverted to ompi 3.0.4. I can run cuda mpi single task app:

When @SteVwonder and I debugged this bug, we actually used 1 process hello world. Ugh... great another bug... Would this have anything to do with CUDA-aware MPI? How did you compile your code?

7 0x000000000001ca0c _dl_open() :0

Since it is dying within dl_open, it could be a DSO not being found? We may need to do another debugging. I am on travel and it will be pretty difficult to do this debugging remotely though not impossible. It would be great if we can collect as much evidence as possible.

BTW if you still have 3.1.3 around, it may be good to replicate the issue with Flux's PMI_Finalize () noopt.

@dongahn
Copy link
Member

dongahn commented Jun 3, 2019

I am on travel and it will be pretty difficult to do this debugging remotely though not impossible. It would be great if we can collect as much evidence as possible.

Now that I think about this, this might be a good opportunity to evaluate if our new tool support can support remote debugging in any capacity. I don't think totalview is there to enable this, but I can certainly evaluate this. Run flux proxy locally from my docker container and connect it to the flux instance running on an LC linux system. Then flux job-debug locally with totalview... Should be fun.

@dongahn
Copy link
Member

dongahn commented Jun 11, 2019

#2170 (comment)

@damora: sorry. I've never found good time to look into this. I will give a priority for this, now that I came back to my office.

@dongahn
Copy link
Member

dongahn commented Jun 11, 2019

@damora: Ok a couple of things. With your OMPI 3.0.4, could you set the following environment variable to see if this works around the issue? OMPI_MCA_coll_hcoll_enable=0 If that doesn't work either, set the following variables as well:

OMPI_MCA_osc = "pt2pt"
OMPI_MCA_pml = "yalla"
OMPI_MCA_btl = "self"

Now, for OpenMPI 3.1.3 and higher, it turned out the W/R suggestion of making our PMI_Finalize() NOOP is not going to do the magic. (This has to be done at the OpenMPI level unfortunately. )

So if you want to use these newer versions of OpenMPI, you will need the patches posted at open-mpi/ompi#6730 (Both commits).

Let me know how this works.

@damora
Copy link
Author

damora commented Jun 14, 2019

I checked with these env vars set and still get the same error when running more than 1 MPI task

@dongahn
Copy link
Member

dongahn commented Jun 14, 2019

I checked with these env vars set and still get the same error when running more than 1 MPI task

OK. Time to do real debugging on our open Sierra systems then...

@grondo
Copy link
Contributor

grondo commented Dec 14, 2021

This is against the older execution system, so closing.

@grondo grondo closed this as completed Dec 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants