-
Notifications
You must be signed in to change notification settings - Fork 864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
orte: Fix MPI_Spawn #2925
orte: Fix MPI_Spawn #2925
Conversation
Register namespace even if there is no node-local processes that belongs to it. We need this for the MPI_Spawn case. Addressing open-mpi#2920. Was introduced in be3ef77. Signed-off-by: Artem Polyakov <[email protected]>
v2.0.x and v2.x are intact. |
|
Note, I sometimes see the following error once app is finished:
I'm running a cluster of linux containers on my laptop and in my environment delays are more visible than on the real cluster. So this may be some sort of a race condition appearing. |
@jsquyres I'm not sure if Ralph will have the time to review. So it's up to you who should review instead. |
I think this looks good. I'm ok with merging it in and letting MTT have a try this evening. |
Ok, thanks! |
I'm not convinced this change is what we really want - the "connect" code is supposed to register any missing nspaces, so it sounds instead like we aren't correctly waiting for connect to complete. If we look at the error being reported on #2920, it is when dstore is attempting to store modex data. There are no comments in the code, but it appears as though it is trying to do this prior to ORTE having registered the nspace. Arbitrarily registering every nspace even when there are no local procs causes problem for the DVM users as it consumes space on nodes that don't have any involvement in a job. We should instead fix the real source of the problem. |
Dstore was failing trying to store jobinfo for the namespace that wasn't registered. Registering namespace solved the problem. So I believe that dstore is working correctly. |
Sorry if I wasn't clear - I agree that dstore is working correctly. My point was that we aren't calling the functions in the correct order. I'll take a look at where you left off. Thanks! |
thanks, |
got it - thanks! Will fix |
Register namespace even if there is no node-local processes that
belongs to it. We need this for the MPI_Spawn case.
Addressing #2920.
Was introduced in be3ef77.
Signed-off-by: Artem Polyakov [email protected]