-
Notifications
You must be signed in to change notification settings - Fork 865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TCP BTL progress thread segv's #5902
Comments
Not really sure this is related to this issue ... The inline patch below can be used to evidence a race condition diff --git a/ompi/runtime/ompi_mpi_finalize.c b/ompi/runtime/ompi_mpi_finalize.c
index b636ddf..de6fed1 100644
--- a/ompi/runtime/ompi_mpi_finalize.c
+++ b/ompi/runtime/ompi_mpi_finalize.c
@@ -265,6 +265,12 @@ int ompi_mpi_finalize(void)
active = false;
}
OMPI_LAZY_WAIT_FOR_COMPLETION(active);
+ if (0 == OPAL_PROC_MY_NAME.vpid) {
+ for (int i=0; i<100; i++) {
+ usleep(10000);
+ opal_progress();
+ }
+ }
} else {
/* However, we cannot guarantee that the provided PMIx has
* fence_nb. If it doesn't, then do the best we can: an MPI
The error message occurs if one task calls Note the progress thread is not involved here, and a naive fix is simply not to diff --git a/ompi/runtime/ompi_mpi_finalize.c b/ompi/runtime/ompi_mpi_finalize.c
index b636ddf..67914b1 100644
--- a/ompi/runtime/ompi_mpi_finalize.c
+++ b/ompi/runtime/ompi_mpi_finalize.c
@@ -264,7 +264,12 @@ int ompi_mpi_finalize(void)
* completion when the fence was failed. */
active = false;
}
- OMPI_LAZY_WAIT_FOR_COMPLETION(active);
+ while (!active) usleep(1000);
+ if (0 == OPAL_PROC_MY_NAME.vpid) {
+ for (int i=0; i<100; i++) {
+ usleep(10000);
+ }
+ }
} else {
/* However, we cannot guarantee that the provided PMIx has
* fence_nb. If it doesn't, then do the best we can: an MPI With this patch, the error only occurs when the progress thread is used. @jsquyres @bosilca @bwbarrett can you please advise on how to move forward ?
|
Refs #5849 |
In doing some manual testing, I'm seeing segv's in about 60% of my runs on master when run on 2 nodes, ppn=16, with mca=tcp,self (no vader), and with
btl_tcp_progress_thread=1
.The core file stack traces are varied, but they all have a few things in common:
ompi_mpi_finalize()
.bt
something like this:__divtf3
looks to be a gcc-internal division function (https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html).Is it possible that the TCP progress thread has not been shut down properly / is still running, and the TCP BTL component got
dlclose()
ed? That could lead to Badness like this.This is happening on master; I have not checked any release branches to see if/where else it is happening.
FYI: @bosilca @bwbarrett
The text was updated successfully, but these errors were encountered: