Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

safe_read magic unstable during detach from threaded app. #3535

Closed
hgreving2304 opened this issue Apr 11, 2019 · 3 comments · Fixed by #4642
Closed

safe_read magic unstable during detach from threaded app. #3535

hgreving2304 opened this issue Apr 11, 2019 · 3 comments · Fixed by #4642

Comments

@hgreving2304
Copy link

Following happens with a small reproducer that we produced off Google code.

  1. DR is attached to app running threaded code. The app is sending PROF signals (but type of signal doesn't matter, as long as it is asynchronous and process-directed).
  2. The app starts a detach via start/stop API (with_cleanup() path).
  3. Detach syncs and all threads go native. At this point, the thread's TLS (DynamoRIO's) has been removed.
  4. Main thread is not done cleaning up, signal handlers are still DR's.
  5. PROF signal comes in, child thread receives it. Signal handler calls safe_read_tls_magic().
    <=>
  6. Corner case: main thread removes signal handlers and restores the app's handlers.
  7. safe_read_tls_magic() segfaults because no TLS.
  8. SIGSEGV is delivered to app.
@hgreving2304
Copy link
Author

Still verifying if the SIGSEGV was really caused by a safe_read_tls_magic in the signal handler, or from somewhere else..

hgreving2304 pushed a commit that referenced this issue Apr 12, 2019
Removes two races that happen if asynchronous signals are delivered during detach, and the
app's signal handler is restored before the safe_read_tls_magic's SIGSEGV is delivered.

There are still possible detach related races left, if asynchronous process-directed
signals are delivered after detach is done, or the app has already started a re-attach.

Issue: #3535
hgreving2304 pushed a commit that referenced this issue Apr 15, 2019
…h. (#3539)

Removes two races that happen if asynchronous signals are delivered during detach, and the
app's signal handler is restored before the safe_read_tls_magic's SIGSEGV is delivered.

There are still possible detach related races left, if asynchronous process-directed
signals are delivered after detach is done, or the app has already started a re-attach.

Tested with local reproducer.

Issue: #3535
@hgreving2304
Copy link
Author

xref #26

hgreving2304 pushed a commit that referenced this issue Apr 22, 2019
…h. (#3539)

Removes two races that happen if asynchronous signals are delivered during detach, and the
app's signal handler is restored before the safe_read_tls_magic's SIGSEGV is delivered.

There are still possible detach related races left, if asynchronous process-directed
signals are delivered after detach is done, or the app has already started a re-attach.

Tested with local reproducer.

Issue: #3535
@derekbruening
Copy link
Contributor

I am hitting this while adding signal mask testing to the new api.detach_signal test. The mask testing distinguishes the app handler from DR's handler b/c DR's blocks more signals.

derekbruening added a commit that referenced this issue Dec 29, 2020
Eliminates a race on detach where the detaching thread removes DR's
SIGSEGV handler while a now-native detached thread is in the middle of
having a signal delivered (natively) and invokes a TLS magic field
safe read, whose SIGSEGV then goes to the application.

The detaching thread is the one doing all the real cleanup, so we
simply avoid any safe reads or TLS for detaching threads by recording
the detacher's ID when we start the detach process.  This var is not
cleared until re-init, so we have no race with the end of detach.

Tested on api.detach_signal with the forthcoming signal mask checks,
which trigger when the handler is invoked for a DR signal instead of
an app-generated signal.  Without this fix, the test fails easily:
about 1 in 5 runs in debug build.  With this fix, it succeeds 200x in
a loop.  I still see one type of crash in debug build, a rare race
where d_r_stats is set to NULL in between the check and use of a
LOG(), but that is limited to debug and is beyond the scope of this
fix and is much lower priority: I filed it as #4641.

Fixes #3535
derekbruening added a commit that referenced this issue Dec 30, 2020
Eliminates a race on detach where the detaching thread removes DR's
SIGSEGV handler while a now-native detached thread is in the middle of
having a signal delivered (natively) and invokes a TLS magic field
safe read, whose SIGSEGV then goes to the application.

The detaching thread is the one doing all the real cleanup, so we
simply avoid any safe reads or TLS for detaching threads by recording
the detacher's ID when we start the detach process.  This var is not
cleared until re-init, so we have no race with the end of detach.

Tested on api.detach_signal with the forthcoming signal mask checks,
which trigger when the handler is invoked for a DR signal instead of
an app-generated signal.  Without this fix, the test fails easily:
about 1 in 5 runs in debug build.  With this fix, it succeeds 200x in
a loop.  I still see one type of crash in debug build, a rare race
where d_r_stats is set to NULL in between the check and use of a
LOG(), but that is limited to debug and is beyond the scope of this
fix and is much lower priority: I filed it as #4641.

Fixes #3535
derekbruening added a commit that referenced this issue Dec 30, 2020
Adds proper signal mask setting for native signals, which requires
adding handler-masked signals to the DR frame's mask instead of
setting the mask in the kernel.

Adds tests to api.detach_signal.  These mask tests also detect when a
DR-caused crash from a DR bug during detach shows up in the app's
handler post-detach due to detach races, including from #3535.

Issue: #1921, #3535
derekbruening added a commit that referenced this issue Dec 30, 2020
Adds proper signal mask setting for native signals, which requires
adding handler-masked signals to the DR frame's mask instead of
setting the mask in the kernel.

Adds tests to api.detach_signal.  These mask tests also detect when a
DR-caused crash from a DR bug during detach shows up in the app's
handler post-detach due to detach races, including from #3535.

Issue: #1921, #3535
derekbruening added a commit that referenced this issue Jan 7, 2021
Puts in place 6 fixes for handling signals during DR initialization,
typically in a start/stop setup where other threads are alive.

1) Copy the app handler at init time for delivering native signals
during init.

2) Reorder signal_arch_init(), which obtains the signal frame size, to
run before DR installs its handler.

3) Obtains the app handler before installing DR's handler, eliminating
a (narrow) race window.

4) Until DR starts executing the app, continues delivering native
signals and using the globally recorded app handler, to match how DR
init works.

5) Set detacher_tid between init and setup to avoid races like in
DR's handler at the end of init that were under #3535

6) Handle a race where the init thread has set the global try_except,
causing master_signal_handler_C to think an app thread's signal is
DR's.  We add a global_try_tid and check the thread id to solve this.

Augments api.detach_signal with signals sent during init.

Also tested on a large proprietary application.

Issue: #1921, #3535
derekbruening added a commit that referenced this issue Jan 9, 2021
Puts in place 6 fixes for handling signals during DR initialization,
typically in a start/stop setup where other threads are alive.

1) Copy the app handler at init time for delivering native signals
during init.

2) Reorder signal_arch_init(), which obtains the signal frame size, to
run before DR installs its handler.

3) Obtains the app handler before installing DR's handler, eliminating
a (narrow) race window.

4) Until DR starts executing the app, continues delivering native
signals and using the globally recorded app handler, to match how DR
init works.

5) Set detacher_tid between init and setup to avoid races like in
DR's handler at the end of init that were under #3535

6) Handle a race where the init thread has set the global try_except,
causing master_signal_handler_C to think an app thread's signal is
DR's.  We add a global_try_tid and check the thread id to solve this.

Augments api.detach_signal with signals sent during init.
This causes the test to hit a DR bug in the vsyscall hook (#4664) on
32-bit Linux.  For now we disable the test for 32-bit.

Also tested on a large proprietary application.

Issue: #1921, #3535, #4664
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants