Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPAI broken with OpenMPI 3.0.1, pmix-1.2.5 and intel compiler #5260

Closed
knjmooney opened this issue Jun 11, 2018 · 12 comments
Closed

MPAI broken with OpenMPI 3.0.1, pmix-1.2.5 and intel compiler #5260

knjmooney opened this issue Jun 11, 2018 · 12 comments

Comments

@knjmooney
Copy link

knjmooney commented Jun 11, 2018

I've compiled OpenMPI v3.0.1 with the following configure line

./configure CC=icc CXX=icpc FC=ifort --prefix=/path/to/mpi-dir --with-pmix=/path/to/pmix --with-libevent=external --with-hwloc=/path/to/hwloc/1.11.10

I've reduced the problem to the following

 -> mpicc hello.c -o hello -g
 -> gdb --args mpirun -n 1 hello
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-51.el7
(gdb-starter) start
(gdb-starter) set MPIR_being_debugged=1
(gdb-starter) continue

With pmix-1.2.5, the process hangs and I see the following message ORTE_ERROR_LOG: Not supported in file orted/pmix/pmix_server_gen.c at line 362. With pmix-2.1.1 the program runs to completion.

I've attached bug_info.tar.gz which contains the output of ompi_info --all and cat config.log for both compilation of OpenMPI 3.0.1 with pmix-1.2.5 and pmix-2.1.1. It also contains hello.c.

I don't know if the issue is specific to the intel compiler but I was unable to configure with an external pmix and the gnu compiler.

@ggouaillardet
Copy link
Contributor

are you saying pmix 1.2.5 does not work but pmix 2.1.1 does work ?

@ggouaillardet
Copy link
Contributor

note Open MPI 3.0.2 has been released, so I encourage you to upgrade to the latest version in order to get the latest bug fixes.

@knjmooney
Copy link
Author

knjmooney commented Jun 12, 2018 via email

@knjmooney
Copy link
Author

knjmooney commented Jun 12, 2018

I've reproduced the above issue with OpenMPI 3.0.2 and 3.1.0 using PMIx 1.2.5.

I'd also like to add I'm using Intel Compiler 18.2.

@rhc54
Copy link
Contributor

rhc54 commented Sep 11, 2018

Related to #5501

@jsquyres
Copy link
Member

Hey @rhc54 -- is this fixed in Open MPI 3.1.x HEAD (which has PMIx 2.1.3)?

@rhc54
Copy link
Contributor

rhc54 commented Sep 14, 2018

I suspect it will work as the user indicated that OMPI 3.x works fine with PMIx v2.1.x - it is the older PMIx v1.2.x that isn't compatible. I doubt that anyone is going to fix that problem.

@jsquyres
Copy link
Member

A few questions:

  1. Just curious: what is MPAI?
  2. @rhc54 Is the problem solely in PMIx? Put differently: is it possible for OMPI to abort/exit in this known-bad scenario?

Additionally: how do we document this kind of stuff to the user? (i.e., OMPI / external PMIX compatibility -- does that reduce to https://docs.google.com/spreadsheets/d/1SwkUEzbFb1TvKuwHzOnPgjkW6OhmGmt1yrH3YN2IUQw/edit#gid=497420864, and does that need to be communicated to the user community somehow?

@knjmooney
Copy link
Author

I can answer 1. MPAI is the MPIR Process Acquisition Interface. It allows tools to locate jobs by debugging the starter process. I think I've seen most people refer to it as MPIR but I was reading the following at the time, https://www.mpi-forum.org/docs/mpir-specification-10-11-2010.pdf, which referred to it as MPAI.

@rhc54
Copy link
Contributor

rhc54 commented Sep 20, 2018

For the second question: I suspect the MPIR connection is a red herring. PMIx has nothing to do with MPIR. If it is generally true that OMPI 3.x is broken with PMIx v1.2.5, then the only suggestion I can make is to simply change the configure logic to reject the older version - or for someone to fix the integration issue 😄

@jsquyres
Copy link
Member

jsquyres commented Oct 1, 2018

FWIW, DDT attaching to MPI processes seems to work fine for me at the OMPI git head, the v3.0.x HEAD, and the v4.0.x HEAD -- all using the embedded PMIx, that is. That's not quite the same as what @knjmooney is testing, but I figured I'd at least provide those data points.

@jsquyres
Copy link
Member

This issue is stale. AFAIK, this has been fixed on the head of the v3.0.x, v3.1.x, and v4.0.x branches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants