MPI Connect/accept broken except when from within a single mpirun #3458

rhc54 · 2017-05-05T16:52:45Z

Thank you for taking the time to submit an issue!

Background information

Multiple users have reported that MPI connect/accept no longer works when executed between two applications started by separate cmd lines. This includes passing the "port" on the cmd line, and use of ompi-server as the go-between

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Sadly, this goes back to the 2.x series and continues thru 3.x to master

Details of the problem

When we switched to PMIx for our wireup, the "port" no longer represents a typical TCP URI. It instead contains info PMIx needs for publish/lookup to rendezvous. Fixing the problem requires a little thought as application procs no longer have access to the OOB, and we'd rather not revert back to doing so.

rhc54 · 2017-05-29T22:25:27Z

Fixed in master by checking for ompi-server presence (if launched by mpirun), or availability of publish/lookup support if direct launched, and outputting a friendly show-help message if not. Operation of ompi-server was also repaired for v3.0.

Backports to the 2.x series are not planned.

tjb900 · 2017-08-17T02:24:01Z

(apologies in advance if I should have opened a new issue instead)

@rhc54 Thanks very much for looking into this - I was one of the ones hoping to use this feature. Unfortunately it still seems to be giving an error (though a different one this time):

[host:20393] [[15787,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file runtime/orte_data_server.c at line 433
[host:20406] [[15789,1],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file dpm/dpm.c at line 401
[host:20393] [[15787,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file runtime/orte_data_server.c at line 433
[host:20417] [[15800,1],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file dpm/dpm.c at line 401

I've attached a relatively relatively simple reproducer, which for me gives the above errors on today's master (1f799af).

test3.zip

A different test, using connect/accept within a single mpirun instance, still works fine.

rhc54 · 2017-08-17T19:27:52Z

I've gone back and looked at where this stands, and found that I had fixed ompi-server, but there was still some work left to resolve the cross-mpirun connect issue. I've taken it as far as I have time right now and will commit those changes. However, it won't fix the problem and so we won't port it to a release branch.

There remains an issue over how the callbacks are flowing at the end of the connect operation. An object is apparently being released at an incorrect time.

I'm not sure who will be picking this up. Sorry I can't be of more help.

rhc54 · 2018-05-22T04:12:08Z

@hppritcha Just a reminder - this is still hanging around.

rhc54 · 2018-05-22T04:12:55Z

FWIW: the connect/disconnect support was never implemented in ORTE for the v2.x series

derangedhk417 · 2018-10-01T22:52:47Z

Perhaps this is a stupid question, as I am not very familiar with Github. Is this actively being worked on? I am running into this problem as well.

rhc54 · 2018-10-02T13:38:42Z

Not at the moment it is considered a low priority, I'm afraid, and we don't have anyone focused on it.

derangedhk417 · 2018-10-08T00:04:56Z

Thanks for the response. If anyone reading this is interested, I have written a reusable workaround for this problem. If anyone shows interest I'll clean it up and put it in a public repo. (It's not a source modification, its a separate .h file)

Summerdave · 2018-10-11T12:18:14Z

Yes i am interested. We had to disable some functionality when running on OpenMPI because of this. How did you get around it?

derangedhk417 · 2018-10-11T19:51:22Z

I have the code up here (https://github.com/derangedhk417/mpi_controller). It's just a basic wrapper around some POSIX shared memory functions. It makes use of semaphores to handle synchronization between the controller and the child. I haven't exactly made this super user friendly, but it should do the trick. I'll try to add some documentation in the next few hours.

Notes:

You need mpirun to be in your PATH for it to work.
All communication is blocking. There is no message queuing.
Feel free to modify this and make it better as you see fit.
I strongly recommend that you read through and understand the code before you use it. This was written hastily and probably has bugs.

rhc54 · 2018-10-18T15:17:16Z

Another option was brought to my attention today. If you know that one of the mpirun executions will always be running, then you can point the other mpirun's to it as the "ompi-server" like this:

$ mpirun -n 3 --report-uri myuri.txt myapp &
$ mpirun -n 2 --ompi-server file:myuri.txt myotherapp

This makes the first mpirun act as the global server. I'm not sure it will solve the problem, but it might be worth trying.

nelsonspbr · 2019-02-06T21:30:36Z

@rhc54 This hasn't worked at least for me, unfortunately :(

Has there been any updates on this?

rhc54 · 2019-02-07T04:29:09Z

Not really - the developer community judged it not worth fixing and so it has sat idle. Based on current plans, it will be fixed in this year's v5.0 release - but not likely before then.

Note that you can optionally execute your OMPI job against the PMIx Reference RTE (PRRTE). I believe this is working in that environment. See https://pmix.org/support/how-to/running-apps-under-psrvr/ for info.

jrhemstad · 2019-02-25T15:47:16Z

@rhc54 I wanted to let you know that support for these APIs are important to us in Dask. See dask/dask-mpi#25

Our use-case is we need a way to create MPI processes from already existing processes (without launching a new process) and build up a communicator among these processes.

maddyscientist · 2019-02-25T18:29:51Z

This issue is also a blocker for our use of OpenMPI with our MPI job manager (mpi_jm) which we use to increase job utilization on large supercomputers for sub-nuclear physics simulations (https://arxiv.org/pdf/1810.01609.pdf). This has forced us to use MVAPICH, which compared to OpenMPI (or Spectrum MPI) results in reduced performance, but correctness is godliness in comparison.

(We here being CalLat - collaboration of physicists centred at LLNL and LBNL, using Summit, Sierra, Titan, etc.).

rhc54 · 2019-02-25T19:29:39Z

Okay, you've convinced me - I'll free up some time this week and fix it. Not sure when it will be released, however, so please be patient.

rhc54 · 2019-02-27T01:11:55Z

Okay, you guys - the fix is here: #6439

Once it gets thru CI I'll post a PR to backport it to the release branches.

datametrician · 2019-04-12T13:25:59Z

Okay, you've convinced me - I'll free up some time this week and fix it. Not sure when it will be released, however, so please be patient.

I can't thank you enough for this! Thank you thank you thank you!

gpaulsen · 2019-07-12T19:22:34Z

@rhc54 Can this issue be closed?

q2luo · 2019-10-10T23:48:27Z

MPI_Comm_connect/MPI_Comm_accept in 4.0.2 still do not work except when from a single mpirun. We're stuck at 1.6.5 and can not upgrade to any latest Open MPI releases. Please help fix.

The error message from slave process using 4.0.2 MPI_Connect:

The user has called an operation involving MPI_Comm_connect and/or MPI_Accept
that spans multiple invocations of mpirun. This requires the support of
the ompi-server tool, which must be executing somewhere that can be
accessed by all participants.

Please ensure the tool is running, and provide each mpirun with the MCA
parameter "pmix_server_uri" pointing to it.

Your application has invoked an MPI function that is not supported in
this environment.

MPI function: MPI_Comm_connect
Reason: Underlying runtime environment does not support accept/connect functionality

[sjoq49:426944] *** An error occurred in MPI_Comm_connect
[sjoq49:426944] *** reported by process [3149791233,47184510713856]
[sjoq49:426944] *** on communicator MPI_COMM_WORLD
[sjoq49:426944] *** MPI_ERR_INTERN: internal error
[sjoq49:426944] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[sjoq49:426944] *** and potentially your MPI job)

The error message from master using 4.0.2 MPI_Comm_accept:

A request has timed out and will therefore fail:

Operation: LOOKUP: orted/pmix/pmix_server_pub.c:345

Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.

[sjoq64:88026] *** An error occurred in MPI_Comm_accept
[sjoq64:88026] *** reported by process [1219035137,0]
[sjoq64:88026] *** on communicator MPI_COMM_WORLD
[sjoq64:88026] *** MPI_ERR_UNKNOWN: unknown error
[sjoq64:88026] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[sjoq64:88026] *** and potentially your MPI job)

rhc54 · 2019-11-14T01:55:07Z

@q2luo I'm not sure how to respond to your request. The error message you show indicates that the mpirun starting the slave process was not given the URI of the ompi-server. Cross-mpirun operations require the support of ompi-server as a rendezvous point.

You might want to try it again, ensuring you follow the required steps. If that doesn't work, please post exactly what you did to encounter the problem.

q2luo · 2019-11-14T02:36:38Z

@rhc54 I started paying attention to the threads related to this same issue since 2015. I tried many different versions of OpenMPI releases, the last working release is 1.6.5, all releases 1.7.1 or higher have the same problem. I also tried pointing ompi-server with the URI, but no success.

Your May 5 2017 description, at the beginning of this thread, describes the issue very well. In fact, OpenMPI release "list of changes" file also document it as a known issue in 3.0 section:

" -MPI_Connect/accept between applications started by different mpirun
commands will fail, even if ompi-server is running."

We use OpenMPI in the following way, example below assumes using 8 hosts from LSF:

Issue 8 individual LSF "bsub" command to acquire 8 hosts with specified amount of resources, each host runs our program (Linux based) with OpenMPI enabled.
The program on each host calls MPI_Comm_connect() and MPI_Comm_accept(), then MPI_Intercomm_merge() and MPI_Comm_rank() after accept is OK. The goal is to connect all 8 MPI applications into 1 MPI world.

The above 2 steps are to realize the same goal as "mpirun -n 8". "mpirun -n 8" works fine for all OpenMPI releases, but semiconductor industry doesn't allow this usage due to IT policies.

Thanks and regards.

rhc54 · 2019-11-14T03:04:47Z

Look, I'm happy to help, but you have to provide enough information so I can do so. I need to know how you actually are starting all these programs. Do you have ompi-server running somewhere that all the hosts can reach over TCP? What was the cmd line to start the programs on each host?

I don't know who you mean by "semiconductor industry", but I know of at least one company in that collective that doesn't have this issue 😄 This appears to be a pretty extreme use case, so it isn't surprising that it might uncover some problems.

q2luo · 2019-11-14T03:16:12Z

Each application is started with "mpirun -n 1" on a host acquired by LSF. I tried in-house by starting an ompi-server and let each individual mpirun pointing to it, but connect/accept still fails. On the other hand, even if it works, it would be impractical to use because it requires the company IT starting and maintaining a central ompi-server.

Yes, all hosts can reach over TCP, because SSH based approach via "mpirun -n 8" works, 1 LSF bsub command with "mpirun -n 8" also works.

rhc54 · 2019-11-14T04:27:09Z

So let me summarize. You tried doing this with a central ompi-server in a lab setup and it didn't work. Regardless, you cannot use a central ompi-server in your production environment.

I can take a look to ensure that ompi-server is still working on v4.0.2. However, without ompi-server, there is no way this configuration can work on your production system. The very old v1.6 series certainly would work, but it involves a runtime that doesn't scale to today's cluster sizes - so going back to that approach isn't an option.

On the positive side, you might get IBM to add PMIx integration to LSF - in which case, you won't need ompi-server any more. MIght be your best bet.

q2luo · 2019-11-14T05:34:00Z

@rhc54 Thanks for your explanation. Even if adding PMIx to LSF/MPI hook works, the same problem will be still faced for RTDA, SGE/UGE, etc. grids.

v1.6 series have some serious issues, such memory corruption, network interface cards recognition, etc. All those issue are fixed in latest 3.x and 4.x releases from my testings. Our application normally needs up to 256 hosts each with physical memory at least 512GB up to 3TB, it's normally impossible to acquire 64 such big memory machines instantly so that "mpirun -n 64" will almost never succeed (unless ITs set aside 64 hosts dedicated for one job/person to use). Instead, 64 hosts are normally obtained sequentially by 64 independent grid commands and the time to acquire all these 64 machines can normally span from minutes to hours.

I wonder how connect/accept work in the case of default mode "mpirun -n 32" ? is ompi-server not used or the public API connect/accept not used in this default mode ?

Thanks and regards.

rhc54 · 2019-11-15T16:03:57Z

The closest MPI comes to really supporting your use-case of the "rolling start" is the MPI Sessions work proposed for v4 of the standard. In the meantime, what I would do is:

Submit an initial job request for just one node. When I get that node, I would start ompi-server on it plus my initial mpirun for that node. I would have ompi-server either report its contact info to a file on a network location, or capture its URI from stdout.
I would then have the script submit the request for the remaining hosts and include the ompi-server URI information in the cmd line to be executed on those hosts

This will allow proper wireup of your connect/accept logic. From your description of your scheduler, it shouldn't cause you any additional delays in getting the desired resources. You might even get your IT folks to setup a "high priority" queue for the secondary submission.

rhc54 added bug Target: v2.x Target: v3.0.x labels May 5, 2017

rhc54 added this to the v3.0.0 milestone May 5, 2017

rhc54 self-assigned this May 5, 2017

rhc54 closed this as completed May 29, 2017

rhc54 mentioned this issue Aug 17, 2017

Cleanup some issues in connect/accept support across jobs started by … #4111

Merged

rhc54 reopened this Aug 17, 2017

rhc54 removed the Target: v2.x label Aug 17, 2017

bwbarrett added Target: main and removed Target: main labels Sep 12, 2017

bwbarrett modified the milestones: v3.0.1, v3.0.0 Sep 12, 2017

rhc54 added the help wanted label Dec 12, 2017

rhc54 removed their assignment Dec 12, 2017

bwbarrett modified the milestones: v3.0.1, v3.0.2 Mar 1, 2018

rhc54 mentioned this issue May 22, 2018

MPI_CONNECT broken between jobs from different mpirun instances #5182

Closed

bwbarrett modified the milestones: v3.0.2, v2.1.4, v3.0.3 Jun 1, 2018

rhc54 added the RTE Issue likely is in RTE or PMIx areas label Jun 26, 2018

bwbarrett removed this from the v3.0.3 milestone Sep 18, 2018

bwbarrett added the Target: v3.1.x label Sep 18, 2018

bwbarrett added Target: main Target: v4.0.x labels Sep 18, 2018

spotluri mentioned this issue Feb 23, 2019

Start MPI Dynamically from Dask dask/dask-mpi#25

Open

rhc54 mentioned this issue Feb 27, 2019

Fix cross-mpirun connect/accept operations #6439

Merged

rhc54 mentioned this issue Mar 1, 2019

v4.0.x: Fix cross-mpirun connect/accept operations #6446

Merged

rhc54 closed this as completed Jul 12, 2019

edwardsmith999 mentioned this issue Aug 7, 2019

Linking separate executables with MPI_open_port sharing port information in a text file #6878

Open

vicentebolea mentioned this issue Apr 5, 2022

openmpi: Added patch to fix mpi_connect spack/spack#29906

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI Connect/accept broken except when from within a single mpirun #3458

MPI Connect/accept broken except when from within a single mpirun #3458

rhc54 commented May 5, 2017

rhc54 commented May 29, 2017

tjb900 commented Aug 17, 2017

rhc54 commented Aug 17, 2017

rhc54 commented May 22, 2018

rhc54 commented May 22, 2018

derangedhk417 commented Oct 1, 2018 •

edited

Loading

rhc54 commented Oct 2, 2018

derangedhk417 commented Oct 8, 2018 •

edited

Loading

Summerdave commented Oct 11, 2018

derangedhk417 commented Oct 11, 2018 •

edited

Loading

rhc54 commented Oct 18, 2018

nelsonspbr commented Feb 6, 2019

rhc54 commented Feb 7, 2019

jrhemstad commented Feb 25, 2019 •

edited

Loading

maddyscientist commented Feb 25, 2019

rhc54 commented Feb 25, 2019

rhc54 commented Feb 27, 2019

datametrician commented Apr 12, 2019

gpaulsen commented Jul 12, 2019

q2luo commented Oct 10, 2019

rhc54 commented Nov 14, 2019

q2luo commented Nov 14, 2019

rhc54 commented Nov 14, 2019

q2luo commented Nov 14, 2019

rhc54 commented Nov 14, 2019

q2luo commented Nov 14, 2019

rhc54 commented Nov 15, 2019

MPI Connect/accept broken except when from within a single mpirun #3458

MPI Connect/accept broken except when from within a single mpirun #3458

Comments

rhc54 commented May 5, 2017

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Details of the problem

rhc54 commented May 29, 2017

tjb900 commented Aug 17, 2017

rhc54 commented Aug 17, 2017

rhc54 commented May 22, 2018

rhc54 commented May 22, 2018

derangedhk417 commented Oct 1, 2018 • edited Loading

rhc54 commented Oct 2, 2018

derangedhk417 commented Oct 8, 2018 • edited Loading

Summerdave commented Oct 11, 2018

derangedhk417 commented Oct 11, 2018 • edited Loading

rhc54 commented Oct 18, 2018

nelsonspbr commented Feb 6, 2019

rhc54 commented Feb 7, 2019

jrhemstad commented Feb 25, 2019 • edited Loading

maddyscientist commented Feb 25, 2019

rhc54 commented Feb 25, 2019

rhc54 commented Feb 27, 2019

datametrician commented Apr 12, 2019

gpaulsen commented Jul 12, 2019

q2luo commented Oct 10, 2019

Please ensure the tool is running, and provide each mpirun with the MCA parameter "pmix_server_uri" pointing to it.

MPI function: MPI_Comm_connect Reason: Underlying runtime environment does not support accept/connect functionality

Your job may terminate as a result of this problem. You may want to adjust the MCA parameter pmix_server_max_wait and try again. If this occurred during a connect/accept operation, you can adjust that time using the pmix_base_exchange_timeout parameter.

rhc54 commented Nov 14, 2019

q2luo commented Nov 14, 2019

rhc54 commented Nov 14, 2019

q2luo commented Nov 14, 2019

rhc54 commented Nov 14, 2019

q2luo commented Nov 14, 2019

rhc54 commented Nov 15, 2019

derangedhk417 commented Oct 1, 2018 •

edited

Loading

derangedhk417 commented Oct 8, 2018 •

edited

Loading

derangedhk417 commented Oct 11, 2018 •

edited

Loading

jrhemstad commented Feb 25, 2019 •

edited

Loading

Please ensure the tool is running, and provide each mpirun with the MCA
parameter "pmix_server_uri" pointing to it.

MPI function: MPI_Comm_connect
Reason: Underlying runtime environment does not support accept/connect functionality

Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.