-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hang on fork/exec - subprocess blocked in __tsan_func_exit hook #945
Comments
I spent some time trying to reproduce this with simple programs that do forks and call functions with entry/exit instrumentation in between fork and exec but wasn't able to. It seems that it's some relatively rare race where some other thread is holding the main thread's Trace::mtx member when the main thread initiates the fork. The only case I could find where one thread holds another thread's mutex is in the reporting code, but the pre/post-fork hooks should already ensure that no report is going on during the fork process. Any ideas? Hoping some more experienced TSAN developer may know of a second place where a thread's trace mutex is held by a "remote" thread, that might be concurrent with a fork. |
I managed to catch this under a 'perf record -e '\mem:0x600001000000'' and sure enough found a different thread taking the forking thread's Trace->mtx in read mode. I looked further into the tsan_rtl_report.cc code and it appears that the report_mtx is only acquired after calling RestoreStack() against a thread which may have forked. So, I believe the race is the following:
It seems like there are a few possible fixes:
Not sure the appropriate route of action. I'll see if, now that I fully understand the race, I can write a repro some time next week. |
Hi @toddlipcon Thanks for the investigation. Since we already doing this:
I think the reasonable thing to do is to return from TraceSwitch early if after_multithreaded_fork is set. The intention is that we won't/can't report races anymore. As far as I remember the reason to move RestoreStack out of any locks is that some programs suppress huge amount of races, so scalability of that part of code matters. We probably need a comment there. |
Should be fixed with: Two things you can do on your side:
|
This pulls in an upstream patch from LLVM to fix an occasional hang when starting processes in TSAN builds[1] [1] google/sanitizers#945 Change-Id: I9426b28c8a925b6d59ec4241aa12007a07f6ec70
The problem is reported in: google/sanitizers#945 We already disable as much as possible after multithreaded fork, trace switching is last place that can hang due to basic operations (memory accesses, function calls). Disable it too. llvm-svn=331163
Under TSAN we are seeing a small fraction (about 2/10000) of process starts deadlock. The forked process is stuck int he following trace:
(gdb) bt
#0 __sanitizer::internal_sched_yield () at /data/1/todd/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_linux.cc:414
Upgrade the LLVM code to avoid creating AVX instructions instead of SSE2 instructions #1 0x00000000004c39db in Do (this=) at /data/1/todd/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_mutex.cc:195
Build failed on RHEL 5.6 #2 __tsan::Mutex::Lock (this=this@entry=0x600001000000) at /data/1/todd/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_mutex.cc:235
globals are broken if PIC and nonPIC objects are mixed: #3 0x00000000004c72ed in GenericScopedLock (mu=0x600001000000, this=) at /data/1/todd/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_mutex.h:189
Clang changes behaviour of MemIntrinsic functions before we instrument their arguments #4 TraceSwitch (thr=) at /data/1/todd/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_rtl.cc:552
ASAN fails to link programs with -O0 #5 __tsan::__tsan_trace_switch () at /data/1/todd/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_rtl.cc:581
clang on CentOS #6 0x00000000004da3df in __tsan_trace_switch_thunk () at /data/1/todd/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_rtl_amd64.S:53
No redzones for weak symbols #7 0x00000000004cfabc in TraceAddEvent (thr=, addr=0, typ=__tsan::EventTypeFuncExit, fs=...) at /data/1/todd/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_rtl.h:845
implement adaptive redzones #8 FuncExit (thr=) at /data/1/todd/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_rtl.cc:997
Asan logo is no longer relevant #9 __tsan_func_exit () at /data/1/todd/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_interface_inl.h:108
Some programs hang because of the replaced CFAllocator #10 0x00007f1fe06a3ee5 in safe_strtou32_base (str=, value=, base=) at ../../src/kudu/gutil/strings/numbers.cc:717
This is with LLVM 6.0.0. It seems like the process forked while some TSAN-internal mutex was held, and then the forked process is attempting to acquire it.
The text was updated successfully, but these errors were encountered: