Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang on fork/exec - subprocess blocked in __tsan_func_exit hook #945

Closed
toddlipcon opened this issue Apr 27, 2018 · 4 comments
Closed

Hang on fork/exec - subprocess blocked in __tsan_func_exit hook #945

toddlipcon opened this issue Apr 27, 2018 · 4 comments

Comments

@toddlipcon
Copy link

Under TSAN we are seeing a small fraction (about 2/10000) of process starts deadlock. The forked process is stuck int he following trace:

This is with LLVM 6.0.0. It seems like the process forked while some TSAN-internal mutex was held, and then the forked process is attempting to acquire it.

@toddlipcon
Copy link
Author

I spent some time trying to reproduce this with simple programs that do forks and call functions with entry/exit instrumentation in between fork and exec but wasn't able to. It seems that it's some relatively rare race where some other thread is holding the main thread's Trace::mtx member when the main thread initiates the fork. The only case I could find where one thread holds another thread's mutex is in the reporting code, but the pre/post-fork hooks should already ensure that no report is going on during the fork process.

Any ideas? Hoping some more experienced TSAN developer may know of a second place where a thread's trace mutex is held by a "remote" thread, that might be concurrent with a fork.

@toddlipcon
Copy link
Author

I managed to catch this under a 'perf record -e '\mem:0x600001000000'' and sure enough found a different thread taking the forking thread's Trace->mtx in read mode. I looked further into the tsan_rtl_report.cc code and it appears that the report_mtx is only acquired after calling RestoreStack() against a thread which may have forked. So, I believe the race is the following:

  • Thread 1 writes some variable
  • Thread 2 racily accesses that some variable
  • Thread 2 is in RestoreStack(T1) and acquires Thread 1's Trace::mtx
  • Thread 1 forks
  • the forked process calls a function which has TSAN func-entry or func-exit instrumentation
  • the thread's current trace position has exactly completed a trace "part" and therefore needs to re-acquire the Trace::mtx lock

It seems like there are a few possible fixes:

  1. add some other lock to ensure that no thread is in RestoreStack during fork
  2. change TraceSwitch so that in the case that we are in a post-multirheaded-fork child, we just ignore the TraceSwitch call and overwrite some existing history. That messes up history but assumedly we're about to exec anyway, so no big deal?

Not sure the appropriate route of action. I'll see if, now that I fully understand the race, I can write a repro some time next week.

@dvyukov
Copy link
Contributor

dvyukov commented Apr 29, 2018

Hi @toddlipcon

Thanks for the investigation.

Since we already doing this:

    // We've just forked a multi-threaded process. We cannot reasonably function
    // after that (some mutexes may be locked before fork). So just enable
    // ignores for everything in the hope that we will exec soon.
    ctx->after_multithreaded_fork = true;
    thr->ignore_interceptors++;
    ThreadIgnoreBegin(thr, pc);
    ThreadIgnoreSyncBegin(thr, pc);

I think the reasonable thing to do is to return from TraceSwitch early if after_multithreaded_fork is set. The intention is that we won't/can't report races anymore.

As far as I remember the reason to move RestoreStack out of any locks is that some programs suppress huge amount of races, so scalability of that part of code matters. We probably need a comment there.

@dvyukov
Copy link
Contributor

dvyukov commented Apr 30, 2018

Should be fixed with:
http://llvm.org/viewvc/llvm-project?view=revision&revision=331163

Two things you can do on your side:

  1. Fix the data race.
  2. Disable instrumentation of all code executed between fork and exec (with function attributes, or not instrument whole file). That code should be very special already (no mutexes, no malloc).

@dvyukov dvyukov closed this as completed Apr 30, 2018
toddlipcon added a commit to toddlipcon/kudu that referenced this issue Apr 30, 2018
This pulls in an upstream patch from LLVM to fix an occasional hang when
starting processes in TSAN builds[1]

[1] google/sanitizers#945

Change-Id: I9426b28c8a925b6d59ec4241aa12007a07f6ec70
jyknight pushed a commit to jyknight/llvm-monorepo that referenced this issue May 9, 2018
The problem is reported in:
google/sanitizers#945

We already disable as much as possible after multithreaded fork,
trace switching is last place that can hang due to basic
operations (memory accesses, function calls).
Disable it too.

llvm-svn=331163
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants