safe_read magic unstable during detach from threaded app. #3535

hgreving2304 · 2019-04-11T20:13:31Z

Following happens with a small reproducer that we produced off Google code.

DR is attached to app running threaded code. The app is sending PROF signals (but type of signal doesn't matter, as long as it is asynchronous and process-directed).
The app starts a detach via start/stop API (with_cleanup() path).
Detach syncs and all threads go native. At this point, the thread's TLS (DynamoRIO's) has been removed.
Main thread is not done cleaning up, signal handlers are still DR's.
PROF signal comes in, child thread receives it. Signal handler calls safe_read_tls_magic().
<=>
Corner case: main thread removes signal handlers and restores the app's handlers.
safe_read_tls_magic() segfaults because no TLS.
SIGSEGV is delivered to app.

hgreving2304 · 2019-04-12T16:04:11Z

Still verifying if the SIGSEGV was really caused by a safe_read_tls_magic in the signal handler, or from somewhere else..

Removes two races that happen if asynchronous signals are delivered during detach, and the app's signal handler is restored before the safe_read_tls_magic's SIGSEGV is delivered. There are still possible detach related races left, if asynchronous process-directed signals are delivered after detach is done, or the app has already started a re-attach. Issue: #3535

…h. (#3539) Removes two races that happen if asynchronous signals are delivered during detach, and the app's signal handler is restored before the safe_read_tls_magic's SIGSEGV is delivered. There are still possible detach related races left, if asynchronous process-directed signals are delivered after detach is done, or the app has already started a re-attach. Tested with local reproducer. Issue: #3535

hgreving2304 · 2019-04-15T19:43:06Z

xref #26

…h. (#3539) Removes two races that happen if asynchronous signals are delivered during detach, and the app's signal handler is restored before the safe_read_tls_magic's SIGSEGV is delivered. There are still possible detach related races left, if asynchronous process-directed signals are delivered after detach is done, or the app has already started a re-attach. Tested with local reproducer. Issue: #3535

derekbruening · 2020-12-29T05:16:38Z

I am hitting this while adding signal mask testing to the new api.detach_signal test. The mask testing distinguishes the app handler from DR's handler b/c DR's blocks more signals.

Eliminates a race on detach where the detaching thread removes DR's SIGSEGV handler while a now-native detached thread is in the middle of having a signal delivered (natively) and invokes a TLS magic field safe read, whose SIGSEGV then goes to the application. The detaching thread is the one doing all the real cleanup, so we simply avoid any safe reads or TLS for detaching threads by recording the detacher's ID when we start the detach process. This var is not cleared until re-init, so we have no race with the end of detach. Tested on api.detach_signal with the forthcoming signal mask checks, which trigger when the handler is invoked for a DR signal instead of an app-generated signal. Without this fix, the test fails easily: about 1 in 5 runs in debug build. With this fix, it succeeds 200x in a loop. I still see one type of crash in debug build, a rare race where d_r_stats is set to NULL in between the check and use of a LOG(), but that is limited to debug and is beyond the scope of this fix and is much lower priority: I filed it as #4641. Fixes #3535

Adds proper signal mask setting for native signals, which requires adding handler-masked signals to the DR frame's mask instead of setting the mask in the kernel. Adds tests to api.detach_signal. These mask tests also detect when a DR-caused crash from a DR bug during detach shows up in the app's handler post-detach due to detach races, including from #3535. Issue: #1921, #3535

Puts in place 6 fixes for handling signals during DR initialization, typically in a start/stop setup where other threads are alive. 1) Copy the app handler at init time for delivering native signals during init. 2) Reorder signal_arch_init(), which obtains the signal frame size, to run before DR installs its handler. 3) Obtains the app handler before installing DR's handler, eliminating a (narrow) race window. 4) Until DR starts executing the app, continues delivering native signals and using the globally recorded app handler, to match how DR init works. 5) Set detacher_tid between init and setup to avoid races like in DR's handler at the end of init that were under #3535 6) Handle a race where the init thread has set the global try_except, causing master_signal_handler_C to think an app thread's signal is DR's. We add a global_try_tid and check the thread id to solve this. Augments api.detach_signal with signals sent during init. Also tested on a large proprietary application. Issue: #1921, #3535

Puts in place 6 fixes for handling signals during DR initialization, typically in a start/stop setup where other threads are alive. 1) Copy the app handler at init time for delivering native signals during init. 2) Reorder signal_arch_init(), which obtains the signal frame size, to run before DR installs its handler. 3) Obtains the app handler before installing DR's handler, eliminating a (narrow) race window. 4) Until DR starts executing the app, continues delivering native signals and using the globally recorded app handler, to match how DR init works. 5) Set detacher_tid between init and setup to avoid races like in DR's handler at the end of init that were under #3535 6) Handle a race where the init thread has set the global try_except, causing master_signal_handler_C to think an app thread's signal is DR's. We add a global_try_tid and check the thread id to solve this. Augments api.detach_signal with signals sent during init. This causes the test to hit a DR bug in the vsyscall hook (#4664) on 32-bit Linux. For now we disable the test for 32-bit. Also tested on a large proprietary application. Issue: #1921, #3535, #4664

hgreving2304 added OpSys-Linux Bug-AppFail labels Apr 11, 2019

hgreving2304 self-assigned this Apr 11, 2019

This was referenced Apr 12, 2019

i#3535 detach races: Mitigates asynchronous signal races during detach. #3539

Merged

handle pre-thread-init and post-exit signals #26

Open

derekbruening assigned derekbruening and unassigned hgreving2304 Dec 29, 2020

derekbruening mentioned this issue Dec 29, 2020

CRASH from race on detach in LOG debug-build code #4641

Open

derekbruening mentioned this issue Dec 29, 2020

i#3535: Solve detach handler race #4642

Merged

derekbruening closed this as completed in #4642 Dec 30, 2020

derekbruening mentioned this issue Dec 30, 2020

i#1921 native sig: Properly set native signal mask #4644

Merged

derekbruening mentioned this issue Jan 7, 2021

i#1921 native sig: Handle signals during DR initialization #4662

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

safe_read magic unstable during detach from threaded app. #3535

safe_read magic unstable during detach from threaded app. #3535

hgreving2304 commented Apr 11, 2019

hgreving2304 commented Apr 12, 2019

hgreving2304 commented Apr 15, 2019

derekbruening commented Dec 29, 2020

safe_read magic unstable during detach from threaded app. #3535

safe_read magic unstable during detach from threaded app. #3535

Comments

hgreving2304 commented Apr 11, 2019

hgreving2304 commented Apr 12, 2019

hgreving2304 commented Apr 15, 2019

derekbruening commented Dec 29, 2020