TCP BTL progress thread segv's #5902

jsquyres · 2018-10-11T17:45:36Z

In doing some manual testing, I'm seeing segv's in about 60% of my runs on master when run on 2 nodes, ppn=16, with mca=tcp,self (no vader), and with btl_tcp_progress_thread=1.

The core file stack traces are varied, but they all have a few things in common:

Thread 4 is somewhere in ompi_mpi_finalize().
- Sometimes it's deep inside the loop of closing frameworks/components in MPI_FINALIZE.
- Other times it's closing down MPI_Info.
- ...etc.
Threads 2 and 3 looks like ORTE/PMIX/whatever progress threads.
Thread 1 looks to be the thread that caused the segv; it always has a bt something like this:

(gdb) bt
#0  0x00002aaab9fcec5b in __divtf3 (a=<invalid float value>, b=-nan(0xffffffffffffffff))
    at ../../../libgcc/soft-fp/divtf3.c:47
#1  0x00002aaab9fc5d60 in ?? ()
#2  0x0000000000000000 in ?? ()

__divtf3 looks to be a gcc-internal division function (https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html).

Is it possible that the TCP progress thread has not been shut down properly / is still running, and the TCP BTL component got dlclose()ed? That could lead to Badness like this.

This is happening on master; I have not checked any release branches to see if/where else it is happening.

FYI: @bosilca @bwbarrett

The text was updated successfully, but these errors were encountered:

ggouaillardet · 2018-10-16T07:33:01Z

Not really sure this is related to this issue ...

The inline patch below can be used to evidence a race condition

diff --git a/ompi/runtime/ompi_mpi_finalize.c b/ompi/runtime/ompi_mpi_finalize.c
index b636ddf..de6fed1 100644
--- a/ompi/runtime/ompi_mpi_finalize.c
+++ b/ompi/runtime/ompi_mpi_finalize.c
@@ -265,6 +265,12 @@ int ompi_mpi_finalize(void)
                 active = false;
             }
             OMPI_LAZY_WAIT_FOR_COMPLETION(active);
+            if (0 == OPAL_PROC_MY_NAME.vpid) {
+                for (int i=0; i<100; i++) {
+                    usleep(10000);
+                    opal_progress();
+                }
+            }
         } else {
             /* However, we cannot guarantee that the provided PMIx has
              * fence_nb.  If it doesn't, then do the best we can: an MPI

$ mpirun --mca pml ob1 --mca btl tcp,self -np 2 ./ring_c 
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 1 exiting
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
[c7:00720] Socket closed

The error message occurs if one task calls opal_progress() via OMPI_LAZY_WAIT_FOR_COMPLETION() after an other task has closed the TCP socket.

Note the progress thread is not involved here, and a naive fix is simply not to opal_progress() when waiting for completion of the PMIx fence (see the following inline patch)

diff --git a/ompi/runtime/ompi_mpi_finalize.c b/ompi/runtime/ompi_mpi_finalize.c
index b636ddf..67914b1 100644
--- a/ompi/runtime/ompi_mpi_finalize.c
+++ b/ompi/runtime/ompi_mpi_finalize.c
@@ -264,7 +264,12 @@ int ompi_mpi_finalize(void)
                  * completion when the fence was failed. */
                 active = false;
             }
-            OMPI_LAZY_WAIT_FOR_COMPLETION(active);
+            while (!active) usleep(1000);
+            if (0 == OPAL_PROC_MY_NAME.vpid) {
+                for (int i=0; i<100; i++) {
+                    usleep(10000);
+                }
+            }
         } else {
             /* However, we cannot guarantee that the provided PMIx has
              * fence_nb.  If it doesn't, then do the best we can: an MPI

With this patch, the error only occurs when the progress thread is used.

@jsquyres @bosilca @bwbarrett can you please advise on how to move forward ?

do we need to opal_progress() when waiting for the PMIx fence completion ?
should we add an extra callback so the progress thread can be stopped before the PMIx fence ?
could we simply add an extra global variable so the btl/tcp component simply ignore such errors before entering the PMIx fence ?

ggouaillardet · 2018-10-16T15:29:29Z

Refs #5849

bosilca · 2018-10-18T00:29:43Z

@thananon and @bosilca please take a look.

jsquyres added the bug label Oct 11, 2018

ggouaillardet assigned bosilca Oct 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TCP BTL progress thread segv's #5902

TCP BTL progress thread segv's #5902

jsquyres commented Oct 11, 2018

ggouaillardet commented Oct 16, 2018

ggouaillardet commented Oct 16, 2018

bosilca commented Oct 18, 2018 •

edited

Loading

TCP BTL progress thread segv's #5902

TCP BTL progress thread segv's #5902

Comments

jsquyres commented Oct 11, 2018

ggouaillardet commented Oct 16, 2018

ggouaillardet commented Oct 16, 2018

bosilca commented Oct 18, 2018 • edited Loading

bosilca commented Oct 18, 2018 •

edited

Loading