-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add proper signal support while threads are native #1921
Comments
I'm considering this issue to also cover signal support for start/stop where all threads are either native or under DR. Because handling signals for just one thread being native is difficult (we aren't tracking syscalls so we'll have to query for the proper handler and mask) the plan for start/stop whole-process is to remove our handler while the app is native. |
The api.detach_spawn test I'm trying to add for #2601 is revealing a bunch of races, some of which overlap with this issue. Today there are race windows where we haven't removed our handler completely: e.g., during detach we remove the alarm handlers after the synchall before we let any thread go native, but we leave other signal handlers. All of this has corner cases with no guarantees. For alarms we can be reasonably confident that the app's handler will come right back but that may not be true for other signals. This affects us whether we keep our handler and implement the feature covered here, sending native control to the app's handler, or whether we drop our handler and let the kernel send native control to the app's handler. The only way to get guarantees is to delay all signals which gets complicated when trying to clean up and detach. For doing native delivery, the kernel mask should be native, so we should be able to just undo the master_signal_handler_C frame and jump to the app's handler if it exists: if not we'll have to decide what to do. If the native action is to terminate and we're during detach it'd be best to wait so that any client can run exit time code. One option is to not let any thread go native until we've cleaned them all up. Then we remove our handler and then we let them all go at once. Performance will be worse though with 600 threads stacking up for much longer than before. |
xref #2941 |
A particular instance of the detach race described above is the following scenario: The application generates synchronous signals which it handles during normal execution (maybe SIGFPE, or a safe-read SIGSEGV, or something). We detach and send a thread native, but before we remove our handler, the thread raises one of its handled signals. Today our handler will panic and not cope well. This is similar to the asynchronous timer signal races, but has no (hacky?) fix like we do for alarms where we ignore them until we remove our handler. It should be easy to write a test for this case, and use it to drive developing the native-delivery mechanism. |
Adds a new execute_native_handler() feature where DR delivers a signal to a currently-native thread. For a native thread, a signal frame is set up on the appropriate stack, blocked signals are set natively rather than for emulation, and control is directly sent to the application's handler. Adds handling of asynchronous signals by not recording them for delivery later but delivering immediately. Adds a test that uses the client interface DR_EMIT_GO_NATIVE to send a thread native. The app then generates a synchronous and an asynchronous signal. Initial attempts are made to also handle a pure-native thread with no dcontex, such as late in detach, but that portion is not fully finished and will be completed separately. Handling deault-action signals is also not fully tested and likely still has some issues to be worked out. Fixes several small bugs in the go-native path hit by the test: + Moves dynamo_thread_not_under_dynamo() to dispatch_enter_native() to avoid problems unlocking bb_building_lock on the path back through dispatch. + Relaxes a TRY_EXCEPT assert to support a currently-native thread with no dcontext in TLS. + Suppresses a curiosity from a still-temporarily-native thread at exit. Issue: #1921
…4603) Adds a new execute_native_handler() feature where DR delivers a signal to a currently-native thread. For a native thread, a signal frame is set up on the appropriate stack, blocked signals are set natively rather than for emulation, and control is directly sent to the application's handler. Adds handling of asynchronous signals by not recording them for delivery later but delivering immediately. Adds a test that uses the client interface DR_EMIT_GO_NATIVE to send a thread native. The app then generates a synchronous and an asynchronous signal. Initial attempts are made to also handle a pure-native thread with no dcontex, such as late in detach, but that portion is not fully finished and will be completed separately. Handling deault-action signals is also not fully tested and likely still has some issues to be worked out. Fixes several small bugs in the go-native path hit by the test: + Moves dynamo_thread_not_under_dynamo() to dispatch_enter_native() to avoid problems unlocking bb_building_lock on the path back through dispatch. + Relaxes a TRY_EXCEPT assert to support a currently-native thread with no dcontext in TLS. + Suppresses a curiosity from a still-temporarily-native thread at exit. Issue: #1921
When a signal arrives in an completely unmanaged thread with no dcontext, typically because DR is detaching, we now deliver the signal if the application has a handler for it. This requires adding support for no dcontext to several parts of the frame setup code even beyond what was added in PR #4603 for temporarily-native threads. We have to save the app's handler when we detach a thread so we know where to send a native signal. Full support is complex when we're cleaning up and have no dynamic storage, so we use a single global handler per signal. We detect whether multiple handlers are in operation in this single DR instance (quite rare: only custom non-pthread clones have this behavior) and in that case we abort like before on a native signal. Adds ATOMIC_READ_1BYTE() to complement the existing atomic operations for a cleaner read of the new multi-handler flag. Delivering the frame often overlaps with DR's frame and even DR's stack usage while delivering, if the app has no sigaltstack. We add logic to detect this overlap and avoid clobbering the stack memory. Alarm signals are still dropped, since they can arrive mid-thread-init when it is even harder to deliver. Adds a new test api.detach_signal which creates 10 threads who all sit in a loop sending 4 different alternating signals (SIGSEGV, SIGBUS, SIGURG, SIGALRM) while the main thread attaches and then detaches. When run in debug build, many many signals arrive in post-detach threads, since detach takes a while to do debug cleanup, exercising the new code. Adds a new RSTAT for native signals so we can identify when this happens in release build. Exports the stat to the API and uses it to check that at least some signals were delivered natively in the new test. Removes the fatal error on a signal arriving with no dcontext. But, non-ignore default signal actions when no handler is present are still not fully handled, along with multi-sighand-processes as mentioned, and the fatal error remains in those cases. For default actions, since the process is going to terminate anyway, the only shortcoming of this is whether a core is generated and whether the proper process exit code is raised. Issue: #1921
When a signal arrives in an completely unmanaged thread with no dcontext, typically because DR is detaching, we now deliver the signal if the application has a handler for it. This requires adding support for no dcontext to several parts of the frame setup code even beyond what was added in PR #4603 for temporarily-native threads. We have to save the app's handler when we detach a thread so we know where to send a native signal. Full support is complex when we're cleaning up and have no dynamic storage, so we use a single global handler per signal. We detect whether multiple handlers are in operation in this single DR instance (quite rare: only custom non-pthread clones have this behavior) and in that case we abort like before on a native signal. Adds ATOMIC_READ_1BYTE() to complement the existing atomic operations for a cleaner read of the new multi-handler flag. Delivering the frame often overlaps with DR's frame and even DR's stack usage while delivering, if the app has no sigaltstack. We add logic to detect this overlap and avoid clobbering the stack memory. Alarm signals are still dropped, since they can arrive mid-thread-init when it is even harder to deliver. Adds a new test api.detach_signal which creates 10 threads who all sit in a loop sending 4 different alternating signals (SIGSEGV, SIGBUS, SIGURG, SIGALRM) while the main thread attaches and then detaches. When run in debug build, many many signals arrive in post-detach threads, since detach takes a while to do debug cleanup, exercising the new code. Adds a new RSTAT for native signals so we can identify when this happens in release build. Exports the stat to the API and uses it to check that at least some signals were delivered natively in the new test. Removes the fatal error on a signal arriving with no dcontext. But, non-ignore default signal actions when no handler is present are still not fully handled, along with multi-sighand-processes as mentioned, and the fatal error remains in those cases. For default actions, since the process is going to terminate anyway, the only shortcoming of this is whether a core is generated and whether the proper process exit code is raised. Issue: #1921
Adds proper signal mask setting for native signals, which requires adding handler-masked signals to the DR frame's mask instead of setting the mask in the kernel. Adds tests to api.detach_signal. These mask tests also detect when a DR-caused crash from a DR bug during detach shows up in the app's handler post-detach due to detach races, including from #3535. Issue: #1921, #3535
Adds proper signal mask setting for native signals, which requires adding handler-masked signals to the DR frame's mask instead of setting the mask in the kernel. Adds tests to api.detach_signal. These mask tests also detect when a DR-caused crash from a DR bug during detach shows up in the app's handler post-detach due to detach races, including from #3535. Issue: #1921, #3535
The native frame adjustments to handle the DR handler being on the native stack previously failed to handle an application with a sigaltstack. We correct that here. Adds sigaltstack testing to the api.detach_signal test. With a smaller stack, this exercises the stack-too-small error message path from DR. Issue: #1921
…#4645) The native frame adjustments to handle the DR handler being on the native stack previously failed to handle an application with a sigaltstack. We correct that here. Adds sigaltstack testing to the api.detach_signal test. With a smaller stack, this exercises the stack-too-small error message path from DR. Issue: #1921
Puts in place 6 fixes for handling signals during DR initialization, typically in a start/stop setup where other threads are alive. 1) Copy the app handler at init time for delivering native signals during init. 2) Reorder signal_arch_init(), which obtains the signal frame size, to run before DR installs its handler. 3) Obtains the app handler before installing DR's handler, eliminating a (narrow) race window. 4) Until DR starts executing the app, continues delivering native signals and using the globally recorded app handler, to match how DR init works. 5) Set detacher_tid between init and setup to avoid races like in DR's handler at the end of init that were under #3535 6) Handle a race where the init thread has set the global try_except, causing master_signal_handler_C to think an app thread's signal is DR's. We add a global_try_tid and check the thread id to solve this. Augments api.detach_signal with signals sent during init. Also tested on a large proprietary application. Issue: #1921, #3535
Puts in place 6 fixes for handling signals during DR initialization, typically in a start/stop setup where other threads are alive. 1) Copy the app handler at init time for delivering native signals during init. 2) Reorder signal_arch_init(), which obtains the signal frame size, to run before DR installs its handler. 3) Obtains the app handler before installing DR's handler, eliminating a (narrow) race window. 4) Until DR starts executing the app, continues delivering native signals and using the globally recorded app handler, to match how DR init works. 5) Set detacher_tid between init and setup to avoid races like in DR's handler at the end of init that were under #3535 6) Handle a race where the init thread has set the global try_except, causing master_signal_handler_C to think an app thread's signal is DR's. We add a global_try_tid and check the thread id to solve this. Augments api.detach_signal with signals sent during init. This causes the test to hit a DR bug in the vsyscall hook (#4664) on 32-bit Linux. For now we disable the test for 32-bit. Also tested on a large proprietary application. Issue: #1921, #3535, #4664
A fault in the executable that has a statically-linked client and DR is currently reported as a client crash, even though we have no way to distinguish client code from app code in such a situation. We change that here to invoke execute_native_handler(), which solves problems where the application has a fault handler in place and a tool or other mechanism that generates faults during DR init or other points. Updates the has-dcontext-but-during-init case from PR #4662 (issue #1921) to use the stored app sigstack and not query the kernel and find DR's sigstack. The api.static_crash exercises this path and we updated its template here, but it uses -unsafe_crash_process to trigger it. I tried to reproduce an incoming signal trigger by making a static version of api.detach_signal but it requires a fault-generating allocator replacement (hit by droption and other places that use the system allocator) or something similar, which is non-trivial to replicate in a small test. I did test on the original proprietary application. Fixes #4640
I think it's time to close this. Nearly everything is covered. There are a few cases that are not, such as too-small sigaltstack for the extra space we take to copy the frame, or default actions: but those are going to exit anyway. Let's file separate issues on anything that becomes important enough to spend time on; I think we can certainly live with the current shortcomings. |
xref #38
DR_EMIT_GO_NATIVE is considered experimental on Linux and one feature missing is:
The text was updated successfully, but these errors were encountered: