Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRASH from race on detach in LOG debug-build code #4641

Open
derekbruening opened this issue Dec 29, 2020 · 1 comment
Open

CRASH from race on detach in LOG debug-build code #4641

derekbruening opened this issue Dec 29, 2020 · 1 comment

Comments

@derekbruening
Copy link
Contributor

While testing a fix for #3535, running debug build I hit this crash maybe once in 100 runs:

#4  0x00007fe6e8f590a5 in getchar () at getchar.c:37
#5  0x00005640f8054db3 in handle_signal (signal=11, siginfo=0x7fe6e56cff30, ucxt=0x7fe6e56cfe00) at /home/bruening/dr/git/src/suite/tests/api/detach_signal.cpp:105
#6  <signal handler called>
#7  0x00007fe6e93d1c72 in master_signal_handler_C (sig=7, siginfo=0x7fe6e56d0770, ucxt=0x7fe6e56d0640, xsp=0x7fe6e56d0638 ";\016\071\351\346\177") at /home/bruening/dr/git/src/core/unix/signal.c:5107

   0x00007fe6e93d1c59 <+1004>:	lea    0x161388(%rip),%rax        # 0x7fe6e9532fe8 <d_r_stats>
   0x00007fe6e93d1c60 <+1011>:	mov    (%rax),%rax
   0x00007fe6e93d1c63 <+1014>:	test   %rax,%rax
   0x00007fe6e93d1c66 <+1017>:	je     0x7fe6e93d1ccc <master_signal_handler_C+1119>
   0x00007fe6e93d1c68 <+1019>:	lea    0x161379(%rip),%rax        # 0x7fe6e9532fe8 <d_r_stats>
   0x00007fe6e93d1c6f <+1026>:	mov    (%rax),%rax
==>
   0x00007fe6e93d1c72 <+1029>:	mov    0x218(%rax),%eax
   0x00007fe6e93d1c78 <+1035>:	test   %eax,%eax
   0x00007fe6e93d1c7a <+1037>:	je     0x7fe6e93d1ccc <master_signal_handler_C+1119>
   0x00007fe6e93d1c7c <+1039>:	lea    0x161365(%rip),%rax        # 0x7fe6e9532fe8 <d_r_stats>
   0x00007fe6e93d1c83 <+1046>:	mov    (%rax),%rax
   0x00007fe6e93d1c86 <+1049>:	mov    0x214(%rax),%eax
   0x00007fe6e93d1c8c <+1055>:	and    $0x10,%eax
   0x00007fe6e93d1c8f <+1058>:	test   %eax,%eax
   0x00007fe6e93d1c91 <+1060>:	je     0x7fe6e93d1ccc <master_signal_handler_C+1119>
   0x00007fe6e93d1c93 <+1062>:	callq  0x7fe6e93ae109 <get_sys_thread_id>

(gdb) p &((dr_statistics_t*)0)->loglevel
$2 = (uint *) 0x218
(gdb) p d_r_stats
$3 = (dr_statistics_t *) 0x0

Must be a race where d_r_stats is set to NULL in between the check for NULL and
the de-ref of loglevel.

I was not planning to fix it: b/c it would require eliminating all LOG calls on detach paths, which are useful for debugging, and b/c it is limited to debug build.

derekbruening added a commit that referenced this issue Dec 29, 2020
Eliminates a race on detach where the detaching thread removes DR's
SIGSEGV handler while a now-native detached thread is in the middle of
having a signal delivered (natively) and invokes a TLS magic field
safe read, whose SIGSEGV then goes to the application.

The detaching thread is the one doing all the real cleanup, so we
simply avoid any safe reads or TLS for detaching threads by recording
the detacher's ID when we start the detach process.  This var is not
cleared until re-init, so we have no race with the end of detach.

Tested on api.detach_signal with the forthcoming signal mask checks,
which trigger when the handler is invoked for a DR signal instead of
an app-generated signal.  Without this fix, the test fails easily:
about 1 in 5 runs in debug build.  With this fix, it succeeds 200x in
a loop.  I still see one type of crash in debug build, a rare race
where d_r_stats is set to NULL in between the check and use of a
LOG(), but that is limited to debug and is beyond the scope of this
fix and is much lower priority: I filed it as #4641.

Fixes #3535
derekbruening added a commit that referenced this issue Dec 30, 2020
Eliminates a race on detach where the detaching thread removes DR's
SIGSEGV handler while a now-native detached thread is in the middle of
having a signal delivered (natively) and invokes a TLS magic field
safe read, whose SIGSEGV then goes to the application.

The detaching thread is the one doing all the real cleanup, so we
simply avoid any safe reads or TLS for detaching threads by recording
the detacher's ID when we start the detach process.  This var is not
cleared until re-init, so we have no race with the end of detach.

Tested on api.detach_signal with the forthcoming signal mask checks,
which trigger when the handler is invoked for a DR signal instead of
an app-generated signal.  Without this fix, the test fails easily:
about 1 in 5 runs in debug build.  With this fix, it succeeds 200x in
a loop.  I still see one type of crash in debug build, a rare race
where d_r_stats is set to NULL in between the check and use of a
LOG(), but that is limited to debug and is beyond the scope of this
fix and is much lower priority: I filed it as #4641.

Fixes #3535
@derekbruening
Copy link
Contributor Author

The best way to solve this may be to leave the useful LOG calls but move
loglevel to a static global struct that's not freed (i.e., make d_r_stats
like the options structs), or even its own special global var. IIRC it was
placed in d_r_stats back when we had a UI to tweak options via shared
memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant