Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP BTL progress thread segv's #5902

Open
jsquyres opened this issue Oct 11, 2018 · 3 comments
Open

TCP BTL progress thread segv's #5902

jsquyres opened this issue Oct 11, 2018 · 3 comments
Assignees
Labels

Comments

@jsquyres
Copy link
Member

In doing some manual testing, I'm seeing segv's in about 60% of my runs on master when run on 2 nodes, ppn=16, with mca=tcp,self (no vader), and with btl_tcp_progress_thread=1.

The core file stack traces are varied, but they all have a few things in common:

  1. Thread 4 is somewhere in ompi_mpi_finalize().
    • Sometimes it's deep inside the loop of closing frameworks/components in MPI_FINALIZE.
    • Other times it's closing down MPI_Info.
    • ...etc.
  2. Threads 2 and 3 looks like ORTE/PMIX/whatever progress threads.
  3. Thread 1 looks to be the thread that caused the segv; it always has a bt something like this:
(gdb) bt
#0  0x00002aaab9fcec5b in __divtf3 (a=<invalid float value>, b=-nan(0xffffffffffffffff))
    at ../../../libgcc/soft-fp/divtf3.c:47
#1  0x00002aaab9fc5d60 in ?? ()
#2  0x0000000000000000 in ?? ()

__divtf3 looks to be a gcc-internal division function (https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html).

Is it possible that the TCP progress thread has not been shut down properly / is still running, and the TCP BTL component got dlclose()ed? That could lead to Badness like this.

This is happening on master; I have not checked any release branches to see if/where else it is happening.

FYI: @bosilca @bwbarrett

@jsquyres jsquyres added the bug label Oct 11, 2018
@ggouaillardet
Copy link
Contributor

Not really sure this is related to this issue ...

The inline patch below can be used to evidence a race condition

diff --git a/ompi/runtime/ompi_mpi_finalize.c b/ompi/runtime/ompi_mpi_finalize.c
index b636ddf..de6fed1 100644
--- a/ompi/runtime/ompi_mpi_finalize.c
+++ b/ompi/runtime/ompi_mpi_finalize.c
@@ -265,6 +265,12 @@ int ompi_mpi_finalize(void)
                 active = false;
             }
             OMPI_LAZY_WAIT_FOR_COMPLETION(active);
+            if (0 == OPAL_PROC_MY_NAME.vpid) {
+                for (int i=0; i<100; i++) {
+                    usleep(10000);
+                    opal_progress();
+                }
+            }
         } else {
             /* However, we cannot guarantee that the provided PMIx has
              * fence_nb.  If it doesn't, then do the best we can: an MPI
$ mpirun --mca pml ob1 --mca btl tcp,self -np 2 ./ring_c 
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 1 exiting
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
[c7:00720] Socket closed

The error message occurs if one task calls opal_progress() via OMPI_LAZY_WAIT_FOR_COMPLETION() after an other task has closed the TCP socket.

Note the progress thread is not involved here, and a naive fix is simply not to opal_progress() when waiting for completion of the PMIx fence (see the following inline patch)

diff --git a/ompi/runtime/ompi_mpi_finalize.c b/ompi/runtime/ompi_mpi_finalize.c
index b636ddf..67914b1 100644
--- a/ompi/runtime/ompi_mpi_finalize.c
+++ b/ompi/runtime/ompi_mpi_finalize.c
@@ -264,7 +264,12 @@ int ompi_mpi_finalize(void)
                  * completion when the fence was failed. */
                 active = false;
             }
-            OMPI_LAZY_WAIT_FOR_COMPLETION(active);
+            while (!active) usleep(1000);
+            if (0 == OPAL_PROC_MY_NAME.vpid) {
+                for (int i=0; i<100; i++) {
+                    usleep(10000);
+                }
+            }
         } else {
             /* However, we cannot guarantee that the provided PMIx has
              * fence_nb.  If it doesn't, then do the best we can: an MPI

With this patch, the error only occurs when the progress thread is used.

@jsquyres @bosilca @bwbarrett can you please advise on how to move forward ?

  • do we need to opal_progress() when waiting for the PMIx fence completion ?
  • should we add an extra callback so the progress thread can be stopped before the PMIx fence ?
  • could we simply add an extra global variable so the btl/tcp component simply ignore such errors before entering the PMIx fence ?

@ggouaillardet
Copy link
Contributor

Refs #5849

@bosilca
Copy link
Member

bosilca commented Oct 18, 2018

@thananon and @bosilca please take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants