You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
v4.* and master/v5 either will hang or error out with pml/ob1, for example:
$. ./exports/bin/mpirun --prefix `pwd`/exports --np 4 --hostfile ./hostfile --mca pml ob1 ./split
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: hostB
PID: 160658
Message: connect() to X:1025 failed
Error: No route to host (113)
--------------------------------------------------------------------------
malloc debug: pml:ob1: mca_pml_ob1_match_completion_free: operation failed with code -12
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: hostB
PID: 160657
Message: connect() to X:1024 failed
Error: No route to host (113)
--------------------------------------------------------------------------
malloc debug: pml:ob1: mca_pml_ob1_match_completion_free: operation failed with code -12
*** An error occurred in Socket closed
*** reported by process [3261005825,3]
*** on a NULL communicator
*** Unknown error
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and MPI will try to terminate your MPI job as well)
*** An error occurred in Socket closed
*** reported by process [3261005825,2]
*** on a NULL communicator
*** Unknown error
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and MPI will try to terminate your MPI job as well)
This broke sometime in the v4 timeframe, as v4.0.0 from the tarball seems to work.
Testing this with ucx it fails as well, at least with ucx v1.7. I would need to try a more recent release to verify it is still an issue.
The text was updated successfully, but these errors were encountered:
awlauria
changed the title
MPI_Comm_split_type() hangs when run with processes across node
MPI_Comm_split_type() fails or hangs when run with processes across node
May 25, 2021
Retested this on v5.0.x and master, and can no longer reproduce. Spoke to @rhc54 and we believe this was probably fixed in the hostname resolutions changes.
Also re-verified that I can't reproduce on v4.0.x.
This is a regression from the v3.0.x series to v4/master/v5.
v3.0.6 run from tarball:
v4.* and master/v5 either will hang or error out with pml/ob1, for example:
This broke sometime in the v4 timeframe, as v4.0.0 from the tarball seems to work.
Testing this with ucx it fails as well, at least with ucx v1.7. I would need to try a more recent release to verify it is still an issue.
The text was updated successfully, but these errors were encountered: