Hang on fork/exec - subprocess blocked in __tsan_func_exit hook #945

toddlipcon · 2018-04-27T22:34:01Z

Under TSAN we are seeing a small fraction (about 2/10000) of process starts deadlock. The forked process is stuck int he following trace:

1 Thread 0x7f1feebdb440 (LWP 9453) "disk_failure-it" __sanitizer::internal_sched_yield () at /data/1/todd/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_linux.cc:414
(gdb) bt
#0 __sanitizer::internal_sched_yield () at /data/1/todd/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_linux.cc:414
Upgrade the LLVM code to avoid creating AVX instructions instead of SSE2 instructions #1 0x00000000004c39db in Do (this=) at /data/1/todd/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_mutex.cc:195
Build failed on RHEL 5.6 #2 __tsan::Mutex::Lock (this=this@entry=0x600001000000) at /data/1/todd/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_mutex.cc:235
globals are broken if PIC and nonPIC objects are mixed: #3 0x00000000004c72ed in GenericScopedLock (mu=0x600001000000, this=) at /data/1/todd/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_mutex.h:189
Clang changes behaviour of MemIntrinsic functions before we instrument their arguments #4 TraceSwitch (thr=) at /data/1/todd/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_rtl.cc:552
ASAN fails to link programs with -O0 #5 __tsan::__tsan_trace_switch () at /data/1/todd/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_rtl.cc:581
clang on CentOS #6 0x00000000004da3df in __tsan_trace_switch_thunk () at /data/1/todd/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_rtl_amd64.S:53
No redzones for weak symbols #7 0x00000000004cfabc in TraceAddEvent (thr=, addr=0, typ=__tsan::EventTypeFuncExit, fs=...) at /data/1/todd/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_rtl.h:845
implement adaptive redzones #8 FuncExit (thr=) at /data/1/todd/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_rtl.cc:997
Asan logo is no longer relevant #9 __tsan_func_exit () at /data/1/todd/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_interface_inl.h:108
Some programs hang because of the replaced CFAllocator #10 0x00007f1fe06a3ee5 in safe_strtou32_base (str=, value=, base=) at ../../src/kudu/gutil/strings/numbers.cc:717

This is with LLVM 6.0.0. It seems like the process forked while some TSAN-internal mutex was held, and then the forked process is attempting to acquire it.

toddlipcon · 2018-04-28T01:05:23Z

I spent some time trying to reproduce this with simple programs that do forks and call functions with entry/exit instrumentation in between fork and exec but wasn't able to. It seems that it's some relatively rare race where some other thread is holding the main thread's Trace::mtx member when the main thread initiates the fork. The only case I could find where one thread holds another thread's mutex is in the reporting code, but the pre/post-fork hooks should already ensure that no report is going on during the fork process.

Any ideas? Hoping some more experienced TSAN developer may know of a second place where a thread's trace mutex is held by a "remote" thread, that might be concurrent with a fork.

toddlipcon · 2018-04-28T06:23:42Z

I managed to catch this under a 'perf record -e '\mem:0x600001000000'' and sure enough found a different thread taking the forking thread's Trace->mtx in read mode. I looked further into the tsan_rtl_report.cc code and it appears that the report_mtx is only acquired after calling RestoreStack() against a thread which may have forked. So, I believe the race is the following:

Thread 1 writes some variable
Thread 2 racily accesses that some variable
Thread 2 is in RestoreStack(T1) and acquires Thread 1's Trace::mtx
Thread 1 forks
the forked process calls a function which has TSAN func-entry or func-exit instrumentation
the thread's current trace position has exactly completed a trace "part" and therefore needs to re-acquire the Trace::mtx lock

It seems like there are a few possible fixes:

add some other lock to ensure that no thread is in RestoreStack during fork
change TraceSwitch so that in the case that we are in a post-multirheaded-fork child, we just ignore the TraceSwitch call and overwrite some existing history. That messes up history but assumedly we're about to exec anyway, so no big deal?

Not sure the appropriate route of action. I'll see if, now that I fully understand the race, I can write a repro some time next week.

dvyukov · 2018-04-29T06:58:45Z

Hi @toddlipcon

Thanks for the investigation.

Since we already doing this:

    // We've just forked a multi-threaded process. We cannot reasonably function
    // after that (some mutexes may be locked before fork). So just enable
    // ignores for everything in the hope that we will exec soon.
    ctx->after_multithreaded_fork = true;
    thr->ignore_interceptors++;
    ThreadIgnoreBegin(thr, pc);
    ThreadIgnoreSyncBegin(thr, pc);

I think the reasonable thing to do is to return from TraceSwitch early if after_multithreaded_fork is set. The intention is that we won't/can't report races anymore.

As far as I remember the reason to move RestoreStack out of any locks is that some programs suppress huge amount of races, so scalability of that part of code matters. We probably need a comment there.

dvyukov · 2018-04-30T07:34:03Z

Should be fixed with:
http://llvm.org/viewvc/llvm-project?view=revision&revision=331163

Two things you can do on your side:

Fix the data race.
Disable instrumentation of all code executed between fork and exec (with function attributes, or not instrument whole file). That code should be very special already (no mutexes, no malloc).

This pulls in an upstream patch from LLVM to fix an occasional hang when starting processes in TSAN builds[1] [1] google/sanitizers#945 Change-Id: I9426b28c8a925b6d59ec4241aa12007a07f6ec70

The problem is reported in: google/sanitizers#945 We already disable as much as possible after multithreaded fork, trace switching is last place that can hang due to basic operations (memory accesses, function calls). Disable it too. llvm-svn=331163

dvyukov closed this as completed Apr 30, 2018

javeme mentioned this issue May 31, 2024

support swapcontext #189

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hang on fork/exec - subprocess blocked in __tsan_func_exit hook #945

Hang on fork/exec - subprocess blocked in __tsan_func_exit hook #945

toddlipcon commented Apr 27, 2018

toddlipcon commented Apr 28, 2018

toddlipcon commented Apr 28, 2018

dvyukov commented Apr 29, 2018

dvyukov commented Apr 30, 2018

Hang on fork/exec - subprocess blocked in __tsan_func_exit hook #945

Hang on fork/exec - subprocess blocked in __tsan_func_exit hook #945

Comments

toddlipcon commented Apr 27, 2018

toddlipcon commented Apr 28, 2018

toddlipcon commented Apr 28, 2018

dvyukov commented Apr 29, 2018

dvyukov commented Apr 30, 2018