Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI Connect/accept broken except when from within a single mpirun #3458

Closed
rhc54 opened this issue May 5, 2017 · 27 comments
Closed

MPI Connect/accept broken except when from within a single mpirun #3458

rhc54 opened this issue May 5, 2017 · 27 comments

Comments

@rhc54
Copy link
Contributor

rhc54 commented May 5, 2017

Thank you for taking the time to submit an issue!

Background information

Multiple users have reported that MPI connect/accept no longer works when executed between two applications started by separate cmd lines. This includes passing the "port" on the cmd line, and use of ompi-server as the go-between

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Sadly, this goes back to the 2.x series and continues thru 3.x to master

Details of the problem

When we switched to PMIx for our wireup, the "port" no longer represents a typical TCP URI. It instead contains info PMIx needs for publish/lookup to rendezvous. Fixing the problem requires a little thought as application procs no longer have access to the OOB, and we'd rather not revert back to doing so.

@rhc54 rhc54 added this to the v3.0.0 milestone May 5, 2017
@rhc54 rhc54 self-assigned this May 5, 2017
@rhc54
Copy link
Contributor Author

rhc54 commented May 29, 2017

Fixed in master by checking for ompi-server presence (if launched by mpirun), or availability of publish/lookup support if direct launched, and outputting a friendly show-help message if not. Operation of ompi-server was also repaired for v3.0.

Backports to the 2.x series are not planned.

@rhc54 rhc54 closed this as completed May 29, 2017
@tjb900
Copy link

tjb900 commented Aug 17, 2017

(apologies in advance if I should have opened a new issue instead)

@rhc54 Thanks very much for looking into this - I was one of the ones hoping to use this feature. Unfortunately it still seems to be giving an error (though a different one this time):

[host:20393] [[15787,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file runtime/orte_data_server.c at line 433
[host:20406] [[15789,1],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file dpm/dpm.c at line 401
[host:20393] [[15787,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file runtime/orte_data_server.c at line 433
[host:20417] [[15800,1],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file dpm/dpm.c at line 401

I've attached a relatively relatively simple reproducer, which for me gives the above errors on today's master (1f799af).

test3.zip

A different test, using connect/accept within a single mpirun instance, still works fine.

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 17, 2017

I've gone back and looked at where this stands, and found that I had fixed ompi-server, but there was still some work left to resolve the cross-mpirun connect issue. I've taken it as far as I have time right now and will commit those changes. However, it won't fix the problem and so we won't port it to a release branch.

There remains an issue over how the callbacks are flowing at the end of the connect operation. An object is apparently being released at an incorrect time.

I'm not sure who will be picking this up. Sorry I can't be of more help.

@rhc54 rhc54 reopened this Aug 17, 2017
@bwbarrett bwbarrett modified the milestones: v3.0.1, v3.0.0 Sep 12, 2017
@rhc54 rhc54 removed their assignment Dec 12, 2017
@bwbarrett bwbarrett modified the milestones: v3.0.1, v3.0.2 Mar 1, 2018
@rhc54
Copy link
Contributor Author

rhc54 commented May 22, 2018

@hppritcha Just a reminder - this is still hanging around.

@rhc54
Copy link
Contributor Author

rhc54 commented May 22, 2018

FWIW: the connect/disconnect support was never implemented in ORTE for the v2.x series

@bwbarrett bwbarrett modified the milestones: v3.0.2, v2.1.4, v3.0.3 Jun 1, 2018
@rhc54 rhc54 added the RTE Issue likely is in RTE or PMIx areas label Jun 26, 2018
@bwbarrett bwbarrett removed this from the v3.0.3 milestone Sep 18, 2018
@derangedhk417
Copy link

derangedhk417 commented Oct 1, 2018

Perhaps this is a stupid question, as I am not very familiar with Github. Is this actively being worked on? I am running into this problem as well.

@rhc54
Copy link
Contributor Author

rhc54 commented Oct 2, 2018

Not at the moment it is considered a low priority, I'm afraid, and we don't have anyone focused on it.

@derangedhk417
Copy link

derangedhk417 commented Oct 8, 2018

Thanks for the response. If anyone reading this is interested, I have written a reusable workaround for this problem. If anyone shows interest I'll clean it up and put it in a public repo. (It's not a source modification, its a separate .h file)

@Summerdave
Copy link

Yes i am interested. We had to disable some functionality when running on OpenMPI because of this. How did you get around it?

@derangedhk417
Copy link

derangedhk417 commented Oct 11, 2018

I have the code up here (https://github.com/derangedhk417/mpi_controller). It's just a basic wrapper around some POSIX shared memory functions. It makes use of semaphores to handle synchronization between the controller and the child. I haven't exactly made this super user friendly, but it should do the trick. I'll try to add some documentation in the next few hours.

Notes:

  • You need mpirun to be in your PATH for it to work.
  • All communication is blocking. There is no message queuing.
  • Feel free to modify this and make it better as you see fit.
  • I strongly recommend that you read through and understand the code before you use it. This was written hastily and probably has bugs.

@rhc54
Copy link
Contributor Author

rhc54 commented Oct 18, 2018

Another option was brought to my attention today. If you know that one of the mpirun executions will always be running, then you can point the other mpirun's to it as the "ompi-server" like this:

$ mpirun -n 3 --report-uri myuri.txt myapp &
$ mpirun -n 2 --ompi-server file:myuri.txt myotherapp

This makes the first mpirun act as the global server. I'm not sure it will solve the problem, but it might be worth trying.

@nelsonspbr
Copy link

@rhc54 This hasn't worked at least for me, unfortunately :(

Has there been any updates on this?

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 7, 2019

Not really - the developer community judged it not worth fixing and so it has sat idle. Based on current plans, it will be fixed in this year's v5.0 release - but not likely before then.

Note that you can optionally execute your OMPI job against the PMIx Reference RTE (PRRTE). I believe this is working in that environment. See https://pmix.org/support/how-to/running-apps-under-psrvr/ for info.

@jrhemstad
Copy link

jrhemstad commented Feb 25, 2019

@rhc54 I wanted to let you know that support for these APIs are important to us in Dask. See dask/dask-mpi#25

Our use-case is we need a way to create MPI processes from already existing processes (without launching a new process) and build up a communicator among these processes.

@maddyscientist
Copy link

This issue is also a blocker for our use of OpenMPI with our MPI job manager (mpi_jm) which we use to increase job utilization on large supercomputers for sub-nuclear physics simulations (https://arxiv.org/pdf/1810.01609.pdf). This has forced us to use MVAPICH, which compared to OpenMPI (or Spectrum MPI) results in reduced performance, but correctness is godliness in comparison.

(We here being CalLat - collaboration of physicists centred at LLNL and LBNL, using Summit, Sierra, Titan, etc.).

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 25, 2019

Okay, you've convinced me - I'll free up some time this week and fix it. Not sure when it will be released, however, so please be patient.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 27, 2019

Okay, you guys - the fix is here: #6439

Once it gets thru CI I'll post a PR to backport it to the release branches.

@datametrician
Copy link

Okay, you've convinced me - I'll free up some time this week and fix it. Not sure when it will be released, however, so please be patient.

I can't thank you enough for this! Thank you thank you thank you!

@gpaulsen
Copy link
Member

@rhc54 Can this issue be closed?

@q2luo
Copy link

q2luo commented Oct 10, 2019

MPI_Comm_connect/MPI_Comm_accept in 4.0.2 still do not work except when from a single mpirun. We're stuck at 1.6.5 and can not upgrade to any latest Open MPI releases. Please help fix.

The error message from slave process using 4.0.2 MPI_Connect:


The user has called an operation involving MPI_Comm_connect and/or MPI_Accept
that spans multiple invocations of mpirun. This requires the support of
the ompi-server tool, which must be executing somewhere that can be
accessed by all participants.

Please ensure the tool is running, and provide each mpirun with the MCA
parameter "pmix_server_uri" pointing to it.


Your application has invoked an MPI function that is not supported in
this environment.

MPI function: MPI_Comm_connect
Reason: Underlying runtime environment does not support accept/connect functionality

[sjoq49:426944] *** An error occurred in MPI_Comm_connect
[sjoq49:426944] *** reported by process [3149791233,47184510713856]
[sjoq49:426944] *** on communicator MPI_COMM_WORLD
[sjoq49:426944] *** MPI_ERR_INTERN: internal error
[sjoq49:426944] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[sjoq49:426944] *** and potentially your MPI job)

The error message from master using 4.0.2 MPI_Comm_accept:


A request has timed out and will therefore fail:

Operation: LOOKUP: orted/pmix/pmix_server_pub.c:345

Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.

[sjoq64:88026] *** An error occurred in MPI_Comm_accept
[sjoq64:88026] *** reported by process [1219035137,0]
[sjoq64:88026] *** on communicator MPI_COMM_WORLD
[sjoq64:88026] *** MPI_ERR_UNKNOWN: unknown error
[sjoq64:88026] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[sjoq64:88026] *** and potentially your MPI job)

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 14, 2019

@q2luo I'm not sure how to respond to your request. The error message you show indicates that the mpirun starting the slave process was not given the URI of the ompi-server. Cross-mpirun operations require the support of ompi-server as a rendezvous point.

You might want to try it again, ensuring you follow the required steps. If that doesn't work, please post exactly what you did to encounter the problem.

@q2luo
Copy link

q2luo commented Nov 14, 2019

@rhc54 I started paying attention to the threads related to this same issue since 2015. I tried many different versions of OpenMPI releases, the last working release is 1.6.5, all releases 1.7.1 or higher have the same problem. I also tried pointing ompi-server with the URI, but no success.

Your May 5 2017 description, at the beginning of this thread, describes the issue very well. In fact, OpenMPI release "list of changes" file also document it as a known issue in 3.0 section:

" -MPI_Connect/accept between applications started by different mpirun
commands will fail, even if ompi-server is running."

We use OpenMPI in the following way, example below assumes using 8 hosts from LSF:

  1. Issue 8 individual LSF "bsub" command to acquire 8 hosts with specified amount of resources, each host runs our program (Linux based) with OpenMPI enabled.

  2. The program on each host calls MPI_Comm_connect() and MPI_Comm_accept(), then MPI_Intercomm_merge() and MPI_Comm_rank() after accept is OK. The goal is to connect all 8 MPI applications into 1 MPI world.

The above 2 steps are to realize the same goal as "mpirun -n 8". "mpirun -n 8" works fine for all OpenMPI releases, but semiconductor industry doesn't allow this usage due to IT policies.

Thanks and regards.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 14, 2019

Look, I'm happy to help, but you have to provide enough information so I can do so. I need to know how you actually are starting all these programs. Do you have ompi-server running somewhere that all the hosts can reach over TCP? What was the cmd line to start the programs on each host?

I don't know who you mean by "semiconductor industry", but I know of at least one company in that collective that doesn't have this issue 😄 This appears to be a pretty extreme use case, so it isn't surprising that it might uncover some problems.

@q2luo
Copy link

q2luo commented Nov 14, 2019

Each application is started with "mpirun -n 1" on a host acquired by LSF. I tried in-house by starting an ompi-server and let each individual mpirun pointing to it, but connect/accept still fails. On the other hand, even if it works, it would be impractical to use because it requires the company IT starting and maintaining a central ompi-server.

Yes, all hosts can reach over TCP, because SSH based approach via "mpirun -n 8" works, 1 LSF bsub command with "mpirun -n 8" also works.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 14, 2019

So let me summarize. You tried doing this with a central ompi-server in a lab setup and it didn't work. Regardless, you cannot use a central ompi-server in your production environment.

I can take a look to ensure that ompi-server is still working on v4.0.2. However, without ompi-server, there is no way this configuration can work on your production system. The very old v1.6 series certainly would work, but it involves a runtime that doesn't scale to today's cluster sizes - so going back to that approach isn't an option.

On the positive side, you might get IBM to add PMIx integration to LSF - in which case, you won't need ompi-server any more. MIght be your best bet.

@q2luo
Copy link

q2luo commented Nov 14, 2019

@rhc54 Thanks for your explanation. Even if adding PMIx to LSF/MPI hook works, the same problem will be still faced for RTDA, SGE/UGE, etc. grids.

v1.6 series have some serious issues, such memory corruption, network interface cards recognition, etc. All those issue are fixed in latest 3.x and 4.x releases from my testings. Our application normally needs up to 256 hosts each with physical memory at least 512GB up to 3TB, it's normally impossible to acquire 64 such big memory machines instantly so that "mpirun -n 64" will almost never succeed (unless ITs set aside 64 hosts dedicated for one job/person to use). Instead, 64 hosts are normally obtained sequentially by 64 independent grid commands and the time to acquire all these 64 machines can normally span from minutes to hours.

I wonder how connect/accept work in the case of default mode "mpirun -n 32" ? is ompi-server not used or the public API connect/accept not used in this default mode ?

Thanks and regards.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 15, 2019

The closest MPI comes to really supporting your use-case of the "rolling start" is the MPI Sessions work proposed for v4 of the standard. In the meantime, what I would do is:

  • Submit an initial job request for just one node. When I get that node, I would start ompi-server on it plus my initial mpirun for that node. I would have ompi-server either report its contact info to a file on a network location, or capture its URI from stdout.
  • I would then have the script submit the request for the remaining hosts and include the ompi-server URI information in the cmd line to be executed on those hosts

This will allow proper wireup of your connect/accept logic. From your description of your scheduler, it shouldn't cause you any additional delays in getting the desired resources. You might even get your IT folks to setup a "high priority" queue for the secondary submission.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests