Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux PMI will not init spectrum MPI #1382

Closed
trws opened this issue Mar 23, 2018 · 54 comments
Closed

flux PMI will not init spectrum MPI #1382

trws opened this issue Mar 23, 2018 · 54 comments
Assignees

Comments

@trws
Copy link
Member

trws commented Mar 23, 2018

See below:

scogland at sierra4358 in ~/expariments  (FLUX:local:///var/tmp/flux-PWuPz2)
$ flux wreckrun -n 4 ./a.out
2018-03-23T16:35:24.298395Z sched.err[0]: job 3 bad state transition from reserved to starting
2018-03-23T16:35:24.298414Z sched.err[0]: callback returns an error
2018-03-23T16:35:24.298431Z sched.err[0]: job_state_cb: failed to invoke callbacks
2018-03-23T16:35:24.360057Z sched.err[0]: job 3 bad state transition from reserved to running
2018-03-23T16:35:24.360075Z sched.err[0]: callback returns an error
2018-03-23T16:35:24.360092Z sched.err[0]: job_state_cb: failed to invoke callbacks
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    opal_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-opal-runtime.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    opal_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-opal-runtime.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    orte_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-orte-runtime: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    mpi_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-mpi-runtime.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sierra4358:79265] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    opal_init:startup:internal-failure
2018-03-23T16:35:24.441870Z sched.err[0]: job 3 bad state transition from reserved to complete
2018-03-23T16:35:24.441888Z sched.err[0]: callback returns an error
2018-03-23T16:35:24.441904Z sched.err[0]: job_state_cb: failed to invoke callbacks
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-opal-runtime.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    orte_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-orte-runtime: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    mpi_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-mpi-runtime.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sierra4358:79266] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    opal_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-opal-runtime.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    orte_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-orte-runtime: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    mpi_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-mpi-runtime.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sierra4358:79267] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    orte_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-orte-runtime: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    mpi_init:startup:internal-failure
But I couldn't open the help file:
    /__SMPI_build_dir____________________________________________/exports/optimized/share/spectrum_mpi/help-mpi-runtime.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sierra4358:79264] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
wreckrun: tasks [0-3]: exited with exit code 1

@grondo
Copy link
Contributor

grondo commented Mar 23, 2018

Not sure if this will help but running with -o trace-pmi-server might give more information about what the flux PMI server is seeing (if anything).

@trws
Copy link
Member Author

trws commented Mar 23, 2018 via email

@dongahn
Copy link
Member

dongahn commented Mar 23, 2018

018-03-23T16:35:24.298395Z sched.err[0]: job 3 bad state transition from reserved to starting
2018-03-23T16:35:24.298414Z sched.err[0]: callback returns an error
2018-03-23T16:35:24.298431Z sched.err[0]: job_state_cb: failed to invoke callbacks
2018-03-23T16:35:24.360057Z sched.err[0]: job 3 bad state transition from reserved to running
2018-03-23T16:35:24.360075Z sched.err[0]: callback returns an error
2018-03-23T16:35:24.360092Z sched.err[0]: job_state_cb: failed to invoke callbacks

Just a side note: but the state transition needs to be beefed up on abnormal transitions like this.

@dongahn
Copy link
Member

dongahn commented Mar 23, 2018

@trws: if you have specific questions, I can talk to Spectrum MPI guys at IBM. I guess the main question is:

Does Spectrum MPI uses PMI (or PMIX)? And do they have recipe to make Spectrum MPI talk to another bootstrapped like flux that implements normal PMI?

@trws
Copy link
Member Author

trws commented Mar 23, 2018 via email

@dongahn
Copy link
Member

dongahn commented Mar 23, 2018

Also in theory, we should be able to use OpenMPI. We can use a version of OMPI for which we tested flux's support. I know we installed those on EA systems but I'm not sure if we have on Sierra systems. Let me ask.

@dongahn
Copy link
Member

dongahn commented Mar 23, 2018

Talked to Adam Moody; will start a separate email discussion thread.

@garlick
Copy link
Member

garlick commented Mar 23, 2018

It uses PMIX, which should work with us if they’re using a recent
enough version IIRC.

What does work is Flux can be launched wtih PMIX's backwards compatibility support for PMI-1.

If the MPI wants only PMIX, then Flux can't launch it. I think what we added to OMPI was support for the PMI-1 wire protocol which we offer and which can be used wtihout having to relink MPI against our PMI library. If this is a variant of OMPI, maybe that could be backported?

ompi_info output might be helpful, e.g.

$ /opt/openmpi/2.x-dev/bin/ompi_info|grep pmi
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v3.0.0)
                MCA pmix: pmix2x (MCA v2.1.0, API v2.0.0, Component v3.0.0)
                MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v3.0.0)
                MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v3.0.0)
$ /opt/openmpi/2.x-dev/bin/ompi_info|grep flux
                MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v3.0.0)
                MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v3.0.0)

@garlick
Copy link
Member

garlick commented Mar 23, 2018

See #923 for more detailed info on OMPI's "flux support". I made pretty detailed commit comments in OMPI when this was added, in case anyone needs to dig into this.

Uh, looks like ralph squashed my whole PR down to one commit in the merge: open-mpi/ompi@215d629 and concatenated all my coments so it seems a bit like a run-on.

@dongahn
Copy link
Member

dongahn commented Mar 23, 2018

sierra4359{dahn}36: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-2018.02.05/bin/ompi_info | grep pmi
                MCA pmix: ext2x (MCA v2.1.0, API v2.0.0, Component v10.2.0)
                MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v10.2.0)
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v10.2.0)
                 MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v10.2.0)

Maybe there is an runtime mca option to turn on flux support... If not, building OpenMPI ourselves would be the path to least resistance. I don't believe we build spectrum mpi on our own.

@dongahn
Copy link
Member

dongahn commented Mar 23, 2018

FYI -- From Roy Mussleman:

Hi Dong,

No, I have not ported OpenMPI to sierra.

The CORAL EA Power8 build process is on rzmanta at /usr/tcetmp/packages/openmpi/openmpi-2.0.2-gcc-4.8.5/src

Looks like Chris Earl was playing with openmpi-2.0.0-clang-3.9.1

Bob Walkup gave me this process ( patches and configure ) over a year ago before he had spectrum_mpi to work with.
He recently indicated there is a problem on sierra with compatibility between OpenMPI and jsrun.
"Spectrum MPI does not expose the pmix software that would be required to make openmpi work with the spectrum mpi jsrun"
So you may need to go the mpirun path in an interactive session. Maybe similar to what Adam provided for mvapich.

For Power9, you'll need to change the reference to power8 in the build.sh script.
It uses the mellanox collectives (mxm)

Would spack have a semi-official version ?

@garlick
Copy link
Member

garlick commented Mar 23, 2018

Not sure if this is pertinent, but we did run into this problem with openmpi built on TOSS 3:

TOSS-3153 openmpi should not set rpath for libpmi.so

@trws
Copy link
Member Author

trws commented Mar 23, 2018

Here's a new fun detail, setting FLUX_JOB_SIZE and FLUX_JOB_NNODES kills the spectrum mpi mpirun... ><

@trws
Copy link
Member Author

trws commented Mar 23, 2018

Looks like if FLUX_JOB_ID is set, it tries to do something that causes it to segfault.

@garlick
Copy link
Member

garlick commented Mar 23, 2018

heh: >< - good tomoticon!

@dongahn
Copy link
Member

dongahn commented Mar 23, 2018

@trws: What happens if you select flux component for pmix type?

mpirun -n 4 --mca pmix flux hello_world

I know ultimately you want to use flux wreckrun but if pmix=flux actually activate flux support within SpectrumMPI, we can simply pass this MCA key value pair... Maybe we are already doing this though...

@trws
Copy link
Member Author

trws commented Mar 23, 2018

I'll try that, it would be a less nasty solution.

@dongahn
Copy link
Member

dongahn commented Mar 25, 2018

@trws: I know some of us will be busy with SC18 submissions and spring break next two weeks, I think i will be good to summarize the issues we need to unblock and "good to haves" for splash effort.

I think I may be able to fit emitting trimmed R for affinity and optimizing rdesc fetching rdesc using @grondo's experimental wreck. Anything else?

@trws
Copy link
Member Author

trws commented Mar 25, 2018 via email

@garlick
Copy link
Member

garlick commented Mar 30, 2018

Some notes on how ompi under flux is supposed to work, and a quick test on my desktop to ensure we haven't regressed anything.

First ensure that a hello world mpi program can be compiled with ompi and run under flux (yup):

$ /opt/openmpi/2.x-dev/ompi_info  --version
Open MPI v3.0.0a1
$ /opt/openmpi/2.x-dev/bin/mpicc -o hello.ompi hello.c
$ flux wreckrun -n 2 ./hello.ompi
0: completed MPI_Init in 0.058s.  There are 2 tasks
0: completed first barrier in 0.008s
0: completed MPI_Finalize in 0.007s

Exercise PMI client side debug (prove that ompi flux support opened flux PMI library)

$ FLUX_PMI_DEBUG=1 flux wreckrun  -n 2 ./hello.ompi
FLUX_PMI_DEBUG=1 flux wreckrun -n 2 ./hello.ompi 
PMI_Init: PMI_FD is set, selecting simple_client
PMI_Init: PMI_FD is set, selecting simple_client
1: PMI_Init rc=0 
1: PMI_KVS_Get_value_length_max rc=0 
...
1: PMI_Barrier rc=0 
1: PMI_Finalize rc=0 
0: PMI_Barrier rc=0 
0: PMI_Finalize rc=0 
0: completed MPI_Finalize in 0.014s

Exercise PMI server side debug (prove that flux PMI library connected to PMI_FD provided by wrexed):

$ flux wreckrun -o trace_pmi_server -n2 ./hello.ompi
1: C: cmd=init pmi_version=1 pmi_subversion=1
1: S: cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
1: C: cmd=get_maxes
1: S: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=1024
...
1: S: cmd=finalize_ack rc=0
0: C: cmd=finalize
0: S: cmd=finalize_ack rc=0
0: completed MPI_Finalize in 0.009s

The two ompi flux modules are: mca_pmix_flux.so and mca_schizo_flux.so.

mca_pmix_flux.so

mca_pmix_flux.so requires the following environment variables to be set.
Both are set for all tasks launched through wreckrun:

FLUX_PMI_LIBRARY_PATH
FLUX_JOB_ID

The module dlopens the flux pmi library using the above environment variable, then translates ompi generic pmi-ish calls to PMI-1 API calls supplied by our PMI library.

lib/flux/libpmi.so

The flux PMI library tries the following, in descending priority:

  1. if PMI_FD environment var is set, talk PMI-1 wire protocol to wrexecd on it (set by wrexecd - this is what we want!)
  2. if PMIX_SERVER_URI is set, dlopen libpmix.so and redirect PMI calls to PMI-1 API there
  3. if PMI_LIBRARY is set, dlopen that library and redirect PMI calls to PMI-1 API there

Regardless of what the flux PMI library chooses to do here, FLUX_PMI_DEBUG should tell you that the flux PMI library was called, and if it is dlopening another PMI library, what it passed to dlopen.

mca_schizo_flux.so

I can't make heads or tails of the mca_schizo_flux.so component. It seems to be all boilerplate (provided by Ralph I think, and since he squashed all my and his commits together it's impossible to tell if I'm just forgetting) and is part of orte.

I seem to recall it needs to be there to ensure the other module runs, but no idea how.

@garlick
Copy link
Member

garlick commented Mar 30, 2018

@trws: I said I would find the runes for getting an ompi-linked MPI program to emit some debug.
This looks like it might be promising:

$ OMPI_MCA_pmix_base_verbose=255 flux wreckrun -n2 ./hello.ompi
[jimbo:05553] mca: base: components_register: registering framework pmix components
[jimbo:05553] mca: base: components_register: found loaded component isolated
[jimbo:05553] mca: base: components_register: component isolated has no register or open function
[jimbo:05553] mca: base: components_register: found loaded component pmix2x
[jimbo:05553] mca: base: components_register: component pmix2x register function successful
[jimbo:05553] mca: base: components_register: found loaded component flux
[jimbo:05553] mca: base: components_register: component flux register function successful
[jimbo:05553] mca: base: components_open: opening pmix components
[jimbo:05553] mca: base: components_open: found loaded component isolated
[jimbo:05553] mca: base: components_open: component isolated open function successful
[jimbo:05553] mca: base: components_open: found loaded component pmix2x
[jimbo:05553] mca: base: components_open: component pmix2x open function successful
[jimbo:05553] mca: base: components_open: found loaded component flux
[jimbo:05553] mca:base:select: Auto-selecting pmix components
[jimbo:05553] mca:base:select:( pmix) Querying component [isolated]
[jimbo:05553] mca:base:select:( pmix) Query of component [isolated] set priority to 0
[jimbo:05553] mca:base:select:( pmix) Querying component [pmix2x]
[jimbo:05553] mca:base:select:( pmix) Query of component [pmix2x] set priority to 5
[jimbo:05553] mca:base:select:( pmix) Querying component [flux]
[jimbo:05553] mca:base:select:( pmix) Query of component [flux] set priority to 20
[jimbo:05553] mca:base:select:( pmix) Selected component [flux]
[jimbo:05553] mca: base: close: component isolated closed
[jimbo:05553] mca: base: close: unloading component isolated
[jimbo:05553] mca: base: close: component pmix2x closed
[jimbo:05553] mca: base: close: unloading component pmix2x
[jimbo:05553] [[0,45],1] pmix:flux: assigned tmp name
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],WILDCARD]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],1]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],WILDCARD]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],WILDCARD]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],1]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],1]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],1]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],WILDCARD]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],WILDCARD]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],WILDCARD]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],1]
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.lrank
[jimbo:05553] [[0,45],1] pmix:flux got key pmix.lrank
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.nrank
[jimbo:05553] [[0,45],1] pmix:flux got key pmix.nrank
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.max.size
[jimbo:05553] [[0,45],1] pmix:flux got key pmix.max.size
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.job.size
[jimbo:05553] [[0,45],1] pmix:flux got key pmix.job.size
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.appnum
[jimbo:05553] [[0,45],1] pmix:flux got key pmix.appnum
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.local.size
[jimbo:05553] [[0,45],1] pmix:flux got key pmix.local.size
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.num.nodes
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.tmpdir
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.nsdir
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.pdir
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.tdir.rmclean
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.ltopo
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],WILDCARD]
[jimbo:05553] [[0,45],1] pmix:flux put for key pmix.cpuset
[jimbo:05553] [[0,45],1] pmix:flux put for key opal.puri
[jimbo:05553] [[0,45],1] pmix:flux put for key pmix.hname
[jimbo:05553] [[0,45],1] pmix:flux put for key MPI_THREAD_LEVEL
[jimbo:05553] [[0,45],1] pmix:flux put for key btl.tcp.3.0
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.loc
# snip - output for other rank
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],0]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],0]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],0]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],0]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],0]
[jimbo:05553] [[0,45],1] pmix:flux got key pmix.loc
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.loc
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],0]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],0]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],0]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],0]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],0]
[jimbo:05553] [[0,45],1] pmix:flux got key pmix.loc
[jimbo:05553] [[0,45],1] pmix:flux called get for key pmix.hname
[jimbo:05553] [[0,45],1] pmix:flux got key pmix.hname
[jimbo:05553] [[0,45],1] pmix:flux called get for key btl.tcp.3.0
[jimbo:05553] [[0,45],1] pmix:flux got key btl.tcp.3.0
[jimbo:05553] [[0,45],1] pmix:flux called get for key btl.tcp.3.0
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],1]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],1]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],1]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],1]
[jimbo:05553] [[0,45],1] pmix:hash:store storing data for proc [[0,45],1]
[jimbo:05553] [[0,45],1] pmix:flux got key btl.tcp.3.0
[jimbo:05553] mca: base: close: unloading component flux
0: completed MPI_Init in 0.099s.  There are 2 tasks
0: completed first barrier in 0.024s
0: completed MPI_Finalize in 0.022s

@trws
Copy link
Member Author

trws commented Apr 8, 2018

A new bit of info here, it's also true the other way around. The spectrum MPI mpirun, orterun and jsrun will not bootstrap flux either. I'm not sure if this is an issue with something that changed in a newer PMIX or something particular to IBM's implementation of PMIX, but it may be worth looking into at some point, or at least a good reason to write something that can launch flux under LSF that doesn't require mpich...

@garlick
Copy link
Member

garlick commented Apr 8, 2018

When we do look at this, a good thing to try would be to set FLUX_PMI_DEBUG=1 in the environment and try to run a small job with the native launch tool(s). This will cause trace information from our PMI client (used by the broker) to go to stderr. (See example in earlier comment)

So can we launch flux as a batch job directly under LSF or are no native options for launching flux on that machine?

@trws
Copy link
Member Author

trws commented Apr 9, 2018

I'm not sure what blaunch would do on manta/ray, but on sierra jsrun is the official LSF-sanctioned launcher, so there is currently no native option for launching flux there. I would say that I think we could launch it with a config file with the native tools, but no PMI wireup can be expected at the moment.

@garlick garlick self-assigned this Apr 9, 2018
@garlick
Copy link
Member

garlick commented Apr 9, 2018

Just requested butte/sierra access and will try to debug issues with Flux launching spectrum MPI apps directly.

@garlick
Copy link
Member

garlick commented Apr 10, 2018

I'm on, and flux-core master builds fine (--disable-jobspec needed).

I did hit these make check failures

in t2000-wreck.t:

expecting success: 
	run_timeout 15 flux wreckrun -v -n$(($(nproc)*${SIZE}+1)) /bin/true

wreckrun: 0.011s: Registered jobid 21
wreckrun: 0.012s: State = reserved
wreckrun: 0.013s: job.submit: Function not implemented
wreckrun: Allocating 513 tasks across 4 available nodes..
wreckrun: tasks per node: node[0-2]: 129, node3: 126
wreckrun: 0.019s: Sending run event
2018-04-10T00:16:54.760719Z connector-local.err[2]: send kvs.lookup response to client 605C5: Broken pipe
2018-04-10T00:16:54.760325Z connector-local.err[1]: send kvs.lookup response to client E93A5: Broken pipe
wreckrun: Killed by SIGALRM: state = reserved
not ok 20 - wreckrun: oversubscription of tasks

In t0001-basic.t:

expecting success: 
    size=$(test_size_large)  &&
    test -n "$size" &&
    size=$(FLUX_TEST_SIZE_MAX=2 test_size_large) &&
    test "$size" = "2" &&
    size=$(FLUX_TEST_SIZE_MIN=123 FLUX_TEST_SIZE_MAX=1000 test_size_large) &&
    test "$size" = "123"

not ok 49 - builtin test_size_large () works
#	
#	    size=$(test_size_large)  &&
#	    test -n "$size" &&
#	    size=$(FLUX_TEST_SIZE_MAX=2 test_size_large) &&
#	    test "$size" = "2" &&
#	    size=$(FLUX_TEST_SIZE_MIN=123 FLUX_TEST_SIZE_MAX=1000 test_size_large) &&
#	    test "$size" = "123"
#	

Heading out, just wanted to document where I was in this investigation.

@trws
Copy link
Member Author

trws commented Apr 10, 2018

Spectrum mpirun and jsrun, which is IBM's launcher to go with LSF on these things. I managed to get flux to successfully launch a multi-node MPI job with spectrum just now, but only by turning pami off. It looks like we'll have to enlist IBM to actually get a fix for this:

(sierrapysplash) splash:flux$ OMPI_MCA_osc=pt2pt OMPI_MCA_pml=yalla OMPI_MCA_btl=self MPI_ROOT=/opt/ibm/spectrum_mpi OPAL_LIBDIR=/opt/ibm/spectrum_mpi/lib flux wreckrun -N 4 env LD_LIBRARY_PATH=/opt/ibm/spectrum_mpi/lib/pami_port:/opt/ibm/spectrum_mpi/lib:/opt/ibm/spectrum_mpi/lib:/opt/mellanox/hcoll/lib OMPI_MCA_coll_hcoll_enable=0 bash -c 'ulimit -s 10240 ; env LD_PRELOAD=/opt/ibm/spectrum_mpi/lib/libpami_cudahook.so ~/flux-base/mpitest-spectrum '
Hello world from processor sierra1414, rank 0 out of 4 processors
Hello world from processor sierra3369, rank 1 out of 4 processors
Hello world from processor sierra1415, rank 2 out of 4 processors
Hello world from processor sierra1416, rank 3 out of 4 processors

@dongahn
Copy link
Member

dongahn commented Apr 10, 2018

Spectrum mpirun and jsrun, which is IBM's launcher to go with LSF on these things. I managed to get flux to successfully launch a multi-node MPI job with spectrum just now, but only by turning pami off. It looks like we'll have to enlist IBM to actually get a fix for this:

Great!

We need to involve Roy Mussleman to get this to be fixed by IBM ASAP. Do you want to come up to 4th floor for quick to strategize with Roy? I will give him a quick heads-up as well.

@trws
Copy link
Member Author

trws commented Apr 10, 2018 via email

@trws
Copy link
Member Author

trws commented Apr 10, 2018

The potentially deeper worry here is that flux doesn't seem to work for OpenMPI builds that rely on these kinds of paths in the environment. I'm not sure if there's anything we can, or should, do about that from our end though. @rhc54 is there anything we can tie into that would make the flux-end handling for OpenMPI environment (prefix/libdir/mpi_root) requirements a little more robust?

@dongahn
Copy link
Member

dongahn commented Apr 10, 2018

I would like to, but I’m in Santa Clara at the moment. Will you be
around tomorrow?

I will be.

I just talked to Mussleman. He said the fastest route to get IBM's response would be to describe the problem in an email and send it to MPI/PAMI developers directly. And he has a couple of names. If there is a way to work around this in time, they are the ones who can provide the info or who can forward our inquiry.

@trws: can you send an email to Mussleman and copy me on? His email is [email protected].

@rhc54
Copy link

rhc54 commented Apr 10, 2018

@garlick Sorry the squash caused confusion. There has been some argument in the OMPI world about having a lot of "in-between" commits. Schizo just checks for markers of a particular environment (flux, in your case) and sets things up to ensure the right components get selected (in your case, the flux PMI one).

@trws I'm not sure there is a great solution for the problem. OMPI by itself seems to be okay in that regard, but Spectrum does some nasty things with the environment - the timing of the "schizo" framework's development didn't dovetail into their initial efforts, and so mpirun is now a wrapper that fiddles with things before calling the real mpirun. This is what causes the fragility so far we we've heard from folks.

Jim's flux work should be just fine - I confess we don't track/test it, but nothing has changed in those areas of the code. I can try to advise as you run into things, if that would help.

@garlick
Copy link
Member

garlick commented Apr 10, 2018

Thanks for that clarification @rhc54!

Poking around in /opt/ibm/spectrum_mpi it does appear that bin/mpirun is a wrapper for bin/stock/mpirun. Maybe we can learn something about how this works from the wrapper code. Good hint!

@dongahn
Copy link
Member

dongahn commented Apr 10, 2018

@garlick: if you have specific questions, feel free to involve me and Mussleman. We have contact info for some Spectrum MPI developers.

@trws
Copy link
Member Author

trws commented Apr 11, 2018

Thanks @rhc54, it looks like I had a bad build of openmpi that was making me think we needed a more general fix. Should we warn people to build with anything to make sure they get the right prefix by default, or is that all handled in schizo?

@rhc54
Copy link

rhc54 commented Apr 11, 2018

Assuming IBM doesn't interfere, you can configure OMPI with --enable-orterun-prefix-by-default and that should ensure things are always set.

@trws
Copy link
Member Author

trws commented Apr 11, 2018 via email

@dongahn
Copy link
Member

dongahn commented Jun 26, 2018

OK 2 things came out from the concall with IBM folks.

  1. It turned out the PAMI layer relies on PMIX. So flux (which supports only PMI) won't be able bootstrap a Spectrum MPI job. IBM's recommendation was to support PMIX from within the flux instance.

  2. There are other environment variables that a Spectrum MPI job depends on. The easiest way to export all of them is to execute alias.pl which is currently a part of JSM/Spectrum MPI installation. (Warning was this can change in the future, so it is not fully future proof.)

@dongahn
Copy link
Member

dongahn commented Jun 27, 2018

Related issue #1555.

@garlick
Copy link
Member

garlick commented Jun 27, 2018

In #1555 @rhc54 said:

Given the problems, you may find it simpler to just add PMIx support to flux. MPICH now supports PMIx, so I'm not sure what you gain by sticking with the older libraries, and it would allow you to smoothly move between JSM and flux.

It's interesting that this PAMI layer (library?) is, I guess, independently bootstrapping itself through PMIX, as opposed to being implemented as a plugin to OpenMPI where it would have access to OpenMPI's internal PMIish interfaces that work with multiple resource managers including Flux. Probably we're not going to be able change that though.

I wonder if we can offer PMIX support in Flux by simply exporting a libpmix.so that implements the API, or if we'll have to implement PMIX's wire protocol, security, etc? I guess it depends on how libpami uses PMIX?

@dongahn
Copy link
Member

dongahn commented Jun 27, 2018

@garlick:

my current thinking is:

  1. Using libpmix.so that is bundled with Spectrum MPI will allow us to bootstrap a flux instance with jsrun. I think one can play with writting a minimalistic PMI wrapper for feasibility + helping the current push of MLSI.

  2. Implementing our own PMIX from within flux will allow this flux instance to run spectrum MPI jobs.

We probably don't want to rely on the PMIX server running on the node (which was used to launch flux) in launching MPI jobs within the flux instance, though.

@garlick
Copy link
Member

garlick commented Jun 27, 2018

There are certainly PAMI bits implemented as OpenMPI plugins, and none of them are using PMIx_ symbols directly.

$ find /opt/ibm/spectrum_mpi  -name \*pami\*.so
/opt/ibm/spectrum_mpi/profilesupport/lib/libmca_common_pamiopal.so
/opt/ibm/spectrum_mpi/profilesupport/lib/libmca_common_pami.so
/opt/ibm/spectrum_mpi/profilesupport/lib/libpami_cudahook.so
/opt/ibm/spectrum_mpi/lib/spectrum_mpi/mca_osc_pami.so
/opt/ibm/spectrum_mpi/lib/spectrum_mpi/mca_pml_pami.so
/opt/ibm/spectrum_mpi/lib/libmca_common_pamiopal.so
/opt/ibm/spectrum_mpi/lib/pami_433/libpami.so
/opt/ibm/spectrum_mpi/lib/pami_port_dt/libpami.so
/opt/ibm/spectrum_mpi/lib/pami_port_ftdt/libpami.so
/opt/ibm/spectrum_mpi/lib/libmca_common_pami.so
/opt/ibm/spectrum_mpi/lib/pami_port_ft/libpami.so
/opt/ibm/spectrum_mpi/lib/pami_port/libpami.so
/opt/ibm/spectrum_mpi/lib/pami_noib/libpami.so
/opt/ibm/spectrum_mpi/lib/mpicoll/libpami.so
/opt/ibm/spectrum_mpi/lib/libpami_cudahook.so
$ nm `find /opt/ibm/spectrum_mpi -name \*pami\*.so` |grep PMIx_
$

Not sure if that is the extent of the pami code though. Maybe another library is lurking somewhere else?

@dongahn
Copy link
Member

dongahn commented Jun 27, 2018

Sent an email directly to Josh Hursey @IBM and cc'ed you.

@rhc54
Copy link

rhc54 commented Jun 27, 2018

Josh can help you better than I given his direct knowledge of the PAMI code. My understanding is that PAMI pulls all the PMIx data out of the local JSM daemon that hosts the PMIx server library, but I don't know what interfaces they use to do it. They might dlopen it, which is why it wouldn't show in a dependency listing.

One clarification just to ensure we are on the same page: there is no separate PMIx server running on the node. JSM's daemon acts as the PMIx server on each node (i.e., it calls PMIx server_init). However, I do agree that if you launch the flux instance, you would certainly want flux to handle the MPI wireup.

If there are concerns blocking your direct use of PMIx, we'd love to understand them and see if we can't resolve them. Ideally, we'd like to see flux hosting a PMIx server as there are increasingly more things being provided thru the PMIx library (e.g., comm cost matrix for scheduling, fabric topology, and storage directives).

@garlick
Copy link
Member

garlick commented Jun 27, 2018

If there are concerns blocking your direct use of PMIx, we'd love to understand them and see if we can't resolve them. Ideally, we'd like to see flux hosting a PMIx server as there are increasingly more things being provided thru the PMIx library (e.g., comm cost matrix for scheduling, fabric topology, and storage directives).

This was one hangup that made integrating the "reference server" code difficult for us: openpmix/openpmix#102

If the wire protocol is now nailed down and documented, we could maybe implement our own server.

@garlick
Copy link
Member

garlick commented Jul 11, 2018

Dropping the "in progress" label since I am not actively working on this.

Is there anything we should add to 0.10.0 to make this easier? We do have these lua scripts that provide some environment settings needed by various MPI's, but they are all loaded unconditionally. Would it make sense to provide a way to conditionally set them, e.g. so you could launch --with-mpi=spectrum or similar?

@trws
Copy link
Member Author

trws commented Jul 11, 2018 via email

SteVwonder added a commit to SteVwonder/flux-core that referenced this issue Jul 20, 2018
Add mpi "personality" for IBM spectrum MPI, enabled by user with
wreckrun -o mpi-spectrum.  Note that this plugin assumes MPI is
installed in /opt/ibm/spectrum_mpi.  It also disables PAMI, the
spectrum enhanced collectives due to their dependency on the RM
providing a PMIx server. See flux-framework#1382 for further details.  It also sets
the soft stack limit to a value the MPI runtime seems to require.

See flux-framework#1382 for more details.

Fixes flux-framework#1584
SteVwonder added a commit to SteVwonder/flux-core that referenced this issue Jul 20, 2018
Add mpi "personality" for IBM spectrum MPI, enabled by user with
wreckrun -o mpi-spectrum.  Note that this plugin assumes MPI is
installed in /opt/ibm/spectrum_mpi.  It also disables PAMI, the
spectrum enhanced collectives due to their dependency on the RM
providing a PMIx server. See flux-framework#1382 for further details.  It also sets
the soft stack limit to a value the MPI runtime seems to require.

See flux-framework#1382 for more details.

Fixes flux-framework#1584
SteVwonder added a commit to SteVwonder/flux-core that referenced this issue Jul 23, 2018
Add mpi "personality" for IBM spectrum MPI, enabled by user with
wreckrun -o mpi-spectrum.  Note that this plugin assumes MPI is
installed in /opt/ibm/spectrum_mpi.  It also disables PAMI, the
spectrum enhanced collectives due to their dependency on the RM
providing a PMIx server. See flux-framework#1382 for further details.  It also sets
the soft stack limit to a value the MPI runtime seems to require.

See flux-framework#1382 for more details.

Fixes flux-framework#1584
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants