You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The application seems to hang with this stack on master:
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x0000000000799783 in OPAL_MCA_PMIX3X_PMIx_Connect () #2 0x000000000078b504 in pmix3x_connect () #3 0x0000000000445ff6 in ompi_dpm_connect_accept () #4 0x0000000000462a95 in PMPI_Comm_spawn () #5 0x000000000043c6c3 in main ()
And on the spawned slaves:
#0 0x00002aaaabc6f6b3 in *__GI___poll (fds=, nfds=, timeout=0) at ../sysdeps/unix/sysv/linux/poll.c:87 #1 0x00000000006b6336 in poll_dispatch () #2 0x00000000006ac23d in opal_libevent2022_event_base_loop () #3 0x0000000000661080 in opal_progress () #4 0x00000000004423fd in ompi_request_wait_completion () #5 0x00000000004446bc in ompi_comm_nextcid () #6 0x0000000000446389 in ompi_dpm_connect_accept () #7 0x000000000044a53a in ompi_dpm_dyn_init () #8 0x000000000045a890 in ompi_mpi_init () #9 0x000000000043c77d in PMPI_Init () #10 0x000000000043c5dc in main ()
The text was updated successfully, but these errors were encountered:
Please see below a scenario that leads to hanging for mpi spawn program.
mpirun command starts a few "master" processes
$MPI_HOME/bin/mpirun -oversubscribe -H red9906025 -x LD_LIBRARY_PATH -np 7 mpi_master
mpi_master.c
#include <stdlib.h>
#include <mpi.h>
#include <stdio.h>
int main() {
char slavejobtospawn[500];
strcpy(slavejobtospawn, "mpi_slave");
MPI_Comm wcomm_;
MPI_Info minfo;
int mpistat,myrank;
char localhost[40];
MPI_Init(NULL,NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
printf("master %s started\n",localhost);
fflush(stdout);
if(myrank == 0) {
}
usleep(100000000);
mpistat = MPI_Finalize();
return 0;
}
Rank 0 spawns a number of slaves.
mpi_slave.c
#include <stdlib.h>
#include <mpi.h>
#include <stdio.h>
int main() {
int rank_;
MPI_Comm slave_Comm_;
int mpistat;
// == init MPI
char localhost[40];
mpistat = gethostname(localhost, 40);
MPI_Init(NULL,NULL);
mpistat = MPI_Comm_get_parent(&slave_Comm_);
printf("slave %s connected to parent\n",localhost);
fflush(stdout);
mpistat = MPI_Finalize();
printf("slave %s shutting down\n",localhost);
fflush(stdout);
//cout << "SLAVE " << localhost << " SHUTTING DOWN" <<endl;
return 0;
}
The lamhosts_spawn looks like this:
red9906025
red9906026
red9906026
red9906027
red9906027
red9906028
red9906028
The application seems to hang with this stack on master:
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x0000000000799783 in OPAL_MCA_PMIX3X_PMIx_Connect ()
#2 0x000000000078b504 in pmix3x_connect ()
#3 0x0000000000445ff6 in ompi_dpm_connect_accept ()
#4 0x0000000000462a95 in PMPI_Comm_spawn ()
#5 0x000000000043c6c3 in main ()
And on the spawned slaves:
#0 0x00002aaaabc6f6b3 in *__GI___poll (fds=, nfds=, timeout=0) at ../sysdeps/unix/sysv/linux/poll.c:87
#1 0x00000000006b6336 in poll_dispatch ()
#2 0x00000000006ac23d in opal_libevent2022_event_base_loop ()
#3 0x0000000000661080 in opal_progress ()
#4 0x00000000004423fd in ompi_request_wait_completion ()
#5 0x00000000004446bc in ompi_comm_nextcid ()
#6 0x0000000000446389 in ompi_dpm_connect_accept ()
#7 0x000000000044a53a in ompi_dpm_dyn_init ()
#8 0x000000000045a890 in ompi_mpi_init ()
#9 0x000000000043c77d in PMPI_Init ()
#10 0x000000000043c5dc in main ()
The text was updated successfully, but these errors were encountered: