Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI_Comm_split_type() fails or hangs when run with processes across node #9010

Closed
awlauria opened this issue May 25, 2021 · 3 comments
Closed

Comments

@awlauria
Copy link
Contributor

awlauria commented May 25, 2021

This is a regression from the v3.0.x series to v4/master/v5.

#include<mpi.h>
#include<stdio.h>
int main(int argc, char* argv[])
{
    MPI_Init(&argc,&argv);
    MPI_Comm comm;
    MPI_Comm_split_type(MPI_COMM_WORLD, OMPI_COMM_TYPE_NODE, 0, MPI_INFO_NULL, &comm);

    int lsize;
    MPI_Comm_size(comm, &lsize);
    fprintf(stderr, "local_size = %d\n", lsize);
   MPI_Finalize();
   return 0;
}

v3.0.6 run from tarball:

$. ./exports/bin/mpirun -prefix `pwd`/exports -mca pml ob1 -np 4 -host hostA:2,hostB:2 ./split
local_size = 2
local_size = 2
local_size = 2
local_size = 2

v4.* and master/v5 either will hang or error out with pml/ob1, for example:

$. ./exports/bin/mpirun --prefix `pwd`/exports --np 4 --hostfile ./hostfile --mca pml ob1 ./split
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: hostB
  PID:        160658
  Message:    connect() to X:1025 failed
  Error:      No route to host (113)
--------------------------------------------------------------------------
malloc debug: pml:ob1: mca_pml_ob1_match_completion_free: operation failed with code -12
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: hostB
  PID:        160657
  Message:    connect() to  X:1024 failed
  Error:      No route to host (113)
--------------------------------------------------------------------------
malloc debug: pml:ob1: mca_pml_ob1_match_completion_free: operation failed with code -12
*** An error occurred in Socket closed
*** reported by process [3261005825,3]
 *** on a NULL communicator
 *** Unknown error
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
 ***    and MPI will try to terminate your MPI job as well)
 *** An error occurred in Socket closed
 *** reported by process [3261005825,2]
 *** on a NULL communicator
*** Unknown error
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and MPI will try to terminate your MPI job as well)

This broke sometime in the v4 timeframe, as v4.0.0 from the tarball seems to work.

Testing this with ucx it fails as well, at least with ucx v1.7. I would need to try a more recent release to verify it is still an issue.

@awlauria awlauria changed the title MPI_Comm_split_type() hangs when run with processes across node MPI_Comm_split_type() fails or hangs when run with processes across node May 25, 2021
@hjelmn
Copy link
Member

hjelmn commented May 25, 2021

Ooof. That is a pretty fundamental function. Not sure how this was missed by MTT (or was it-- haven't looked).

@awlauria
Copy link
Contributor Author

retesting today I don't think this is in v4, but will confirm later. Re-labeling as a master/v5 problem.

@awlauria
Copy link
Contributor Author

Retested this on v5.0.x and master, and can no longer reproduce. Spoke to @rhc54 and we believe this was probably fixed in the hostname resolutions changes.

Also re-verified that I can't reproduce on v4.0.x.

Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants