Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

orte: Fix MPI_Spawn #2925

Merged
merged 1 commit into from
Feb 7, 2017
Merged

orte: Fix MPI_Spawn #2925

merged 1 commit into from
Feb 7, 2017

Conversation

artpol84
Copy link
Contributor

@artpol84 artpol84 commented Feb 4, 2017

Register namespace even if there is no node-local processes that
belongs to it. We need this for the MPI_Spawn case.

Addressing #2920.
Was introduced in be3ef77.

Signed-off-by: Artem Polyakov [email protected]

Register namespace even if there is no node-local processes that
belongs to it. We need this for the MPI_Spawn case.

Addressing open-mpi#2920.
Was introduced in be3ef77.

Signed-off-by: Artem Polyakov <[email protected]>
@artpol84
Copy link
Contributor Author

artpol84 commented Feb 4, 2017

v2.0.x and v2.x are intact.

@artpol84
Copy link
Contributor Author

artpol84 commented Feb 4, 2017

@siegmargross @jsquyres

@artpol84
Copy link
Contributor Author

artpol84 commented Feb 4, 2017

$ ./mpirun_my.sh 1 ./spawn  
Parent process 0: I create 3 slave processes



Child process 0 running on cn1
    MPI_COMM_WORLD ntasks:              3
    COMM_ALL_PROCESSES ntasks:          4
    mytid in COMM_ALL_PROCESSES:        1

Child process 2 running on cn1
    MPI_COMM_WORLD ntasks:              3
    COMM_ALL_PROCESSES ntasks:          4
    mytid in COMM_ALL_PROCESSES:        3
Parent process 0 running on cn1
    MPI_COMM_WORLD ntasks:              1
    COMM_CHILD_PROCESSES ntasks_local:  1
    COMM_CHILD_PROCESSES ntasks_remote: 3
    COMM_ALL_PROCESSES ntasks:          4
    mytid in COMM_ALL_PROCESSES:        0
Child process 1 running on cn2
    MPI_COMM_WORLD ntasks:              3
    COMM_ALL_PROCESSES ntasks:          4
    mytid in COMM_ALL_PROCESSES:        2

@artpol84
Copy link
Contributor Author

artpol84 commented Feb 4, 2017

Note, I sometimes see the following error once app is finished:

$ ./mpirun_my.sh 1 ./spawn  
Parent process 0: I create 3 slave processes

Child process 0 running on cn1
    MPI_COMM_WORLD ntasks:              3
    COMM_ALL_PROCESSES ntasks:          4
    mytid in COMM_ALL_PROCESSES:        1


Child process 1 running on cn2
    MPI_COMM_WORLD ntasks:              3
    COMM_ALL_PROCESSES ntasks:          4
    mytid in COMM_ALL_PROCESSES:        2

Child process 2 running on cn1
Parent process 0 running on cn1
    MPI_COMM_WORLD ntasks:              1
    MPI_COMM_WORLD ntasks:              3
    COMM_CHILD_PROCESSES ntasks_local:  1
    COMM_CHILD_PROCESSES ntasks_remote: 3
    COMM_ALL_PROCESSES ntasks:          4
    mytid in COMM_ALL_PROCESSES:        0
    COMM_ALL_PROCESSES ntasks:          4
    mytid in COMM_ALL_PROCESSES:        3
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 2086 on node cn1 exited on signal 13 (Broken pipe).
--------------------------------------------------------------------------

I'm running a cluster of linux containers on my laptop and in my environment delays are more visible than on the real cluster. So this may be some sort of a race condition appearing.

@artpol84
Copy link
Contributor Author

artpol84 commented Feb 4, 2017

@jsquyres I'm not sure if Ralph will have the time to review. So it's up to you who should review instead.

@artpol84 artpol84 assigned jjhursey and unassigned rhc54 Feb 7, 2017
@jjhursey
Copy link
Member

jjhursey commented Feb 7, 2017

I think this looks good. I'm ok with merging it in and letting MTT have a try this evening.

@artpol84
Copy link
Contributor Author

artpol84 commented Feb 7, 2017

Ok, thanks!

@artpol84 artpol84 merged commit 4018409 into open-mpi:master Feb 7, 2017
@rhc54
Copy link
Contributor

rhc54 commented Feb 14, 2017

I'm not convinced this change is what we really want - the "connect" code is supposed to register any missing nspaces, so it sounds instead like we aren't correctly waiting for connect to complete. If we look at the error being reported on #2920, it is when dstore is attempting to store modex data. There are no comments in the code, but it appears as though it is trying to do this prior to ORTE having registered the nspace.

Arbitrarily registering every nspace even when there are no local procs causes problem for the DVM users as it consumes space on nodes that don't have any involvement in a job. We should instead fix the real source of the problem.

@artpol84
Copy link
Contributor Author

Dstore was failing trying to store jobinfo for the namespace that wasn't registered. Registering namespace solved the problem. So I believe that dstore is working correctly.
As far as I recall I got to this place from connect code path. There is no way to distinguish between the originator of namespace register call.
So I guess that the right solution is to not to call register namespace if it is not desirable instead of blindly decide in orte_pmix_server_register_nspace

@rhc54
Copy link
Contributor

rhc54 commented Feb 14, 2017

Sorry if I wasn't clear - I agree that dstore is working correctly. My point was that we aren't calling the functions in the correct order. I'll take a look at where you left off.

Thanks!

@artpol84
Copy link
Contributor Author

thanks,
to clarify as well: I'm suggesting to decide whether to register namespace or not on the higher level that has the information about what is going on rather than on the pmix level that should just do what it's being said.
Or as another option to have an additional argument for the orte_pmix_server_register_nspace that identifies if this is mandatory to register this namespace or not.
As I noted this function was called from the connect code so there was no race condition that caused namespace to be not yet registered. Registration was just immediately "aborted" because of removed statement.

@rhc54
Copy link
Contributor

rhc54 commented Feb 14, 2017

got it - thanks! Will fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants