Crashes when unwinding the stack from a signal handler interrupting deallocation #189

mbautin · 2023-06-16T23:00:20Z

After we upgraded YugabyteDB codebase from Gperftools tcmalloc to this version, we encountered the following type of crashes:

(lldb) target create "tests-util/debug-util-test" --core "core.92253"
Core file '/home/mbautin/code/yugabyte-db4/build/latest/core.92253' (x86_64) was loaded.
(lldb) bt
* thread #1, name = 'debug-util-test', stop reason = signal SIGSEGV
  * frame #0: 0x00007fd65a231acf libgcc_s.so.1`uw_frame_state_for + 1055
    frame #1: 0x00007fd65a233758 libgcc_s.so.1`_Unwind_Backtrace + 104
    frame #2: 0x00007fd65a577c56 libc.so.6`__backtrace + 102
    frame #3: 0x00007fd65c07c5a5 libyb_util.so`yb::StackTrace::Collect(this=0x00007fd653da4120, skip_frames=2) at debug-util.cc:433:17
    frame #4: 0x00007fd65c274385 libyb_util.so`yb::(anonymous namespace)::HandleStackTraceSignal(signum=12) at stack_trace.cc:183:15
    frame #5: 0x00007fd65a48ab20 libc.so.6`__restore_rt
    frame #6: 0x000055894739b5c8 debug-util-test`TcmallocSlab_Internal_PopBatch_trampoline

(lldb) bt
* thread #1, name = 'debug-util-test', stop reason = signal SIGSEGV
  * frame #0: 0x00007f8ce502aacf libgcc_s.so.1`uw_frame_state_for + 1055
    frame #1: 0x00007f8ce502c758 libgcc_s.so.1`_Unwind_Backtrace + 104
    frame #2: 0x00007f8ce5370c56 libc.so.6`__backtrace + 102
    frame #3: 0x00007f8ce6e755a5 libyb_util.so`yb::StackTrace::Collect(this=0x00007f8cdfb9f0e0, skip_frames=2) at debug-util.cc:433:17
    frame #4: 0x00007f8ce706d385 libyb_util.so`yb::(anonymous namespace)::HandleStackTraceSignal(signum=12) at stack_trace.cc:183:15
    frame #5: 0x00007f8ce5283b20 libc.so.6`__restore_rt
    frame #6: 0x000055f6fb79a40d debug-util-test`tcmalloc_internal_tls_fetch_pic + 77
    frame #7: 0x000055f6fb75639c debug-util-test`tcmalloc::tcmalloc_internal::cpu_cache_internal::CpuCache<tcmalloc::tcmalloc_internal::cpu_cache_internal::StaticForwarder>::Overflow(void*, unsigned long, int) + 252
    frame #8: 0x000055f6fb737c62 debug-util-test`operator delete(void*) + 1122
    frame #9: 0x000055f6fb6e712c debug-util-test`yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody(this=0x000024fd7eaa3650)::Entry::~Entry() at debug-util-test.cc:345:9
    frame #10: 0x000055f6fb6ebf75 debug-util-test`yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody(this=0x000024fd7fcab590)::$_0::operator()() const at debug-util-test.cc:385:11
    frame #11: 0x000055f6fb6ebd1f debug-util-test`void yb::TestThreadHolder::AddThreadFunctor<yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody()::$_0>(this=0x000024fd7fcab588)::$_0 const&)::'lambda'()::operator()() const at test_thread_holder.h:62:7
    frame #12: 0x000055f6fb6ebcb5 debug-util-test`decltype(__f=0x000024fd7fcab588)::$_0>()()) std::__1::__invoke[abi:v160003]<void yb::TestThreadHolder::AddThreadFunctor<yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody()::$_0>(yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody()::$_0 const&)::'lambda'()>(yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody()::$_0&&) at invoke.h:394:23
    frame #13: 0x000055f6fb6ebc8d debug-util-test`void std::__1::__thread_execute[abi:v160003]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void yb::TestThreadHolder::AddThreadFunctor<yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody()::$_0>(yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody()::$_0 const&)::'lambda'()>(__t=0x000024fd7fcab580, (null)=__tuple_indices<> @ 0x00007f8cdfba05a8)::$_0, void yb::TestThreadHolder::AddThreadFunctor<yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody()::$_0>(yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody()::$_0 const&)::'lambda'()>&, std::__1::__tuple_indices<>) at thread:282:5
    frame #14: 0x000055f6fb6ebab2 debug-util-test`void* std::__1::__thread_proxy[abi:v160003]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void yb::TestThreadHolder::AddThreadFunctor<yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody()::$_0>(yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody()::$_0 const&)::'lambda'()>>(__vp=0x000024fd7fcab580) at thread:293:5
    frame #15: 0x00007f8ce56021cf libpthread.so.0`start_thread + 239
    frame #16: 0x00007f8ce526edd3 libc.so.6`__clone + 67

We have a stack trace dump facility that sends signal to threads and causes them to capture their stacks. This is being done using the backtrace Linux function that uses libunwind internally. We did not have any problems with this approach with Gperftools tcmalloc, but with this tcmalloc we are getting segmentation faults in case tcmalloc code is interrupted in functions such as tcmalloc_internal_tls_fetch_pic or TcmallocSlab_Internal_PopBatch_trampoline. We have a unit test that reliably reproduces this situation by creating a few threads that allocate objects and pass them to other threads for deallocation, while the main thread is repeatedly trying to dump the stacks of those worker threads.

As far as I know, libunwind backtrace facility is async-safe and is suitable for use in a signal handler. We are currently using LLVM 15's version of libunwind.

Has anyone else encountered this issue and is there a known workaround?

The text was updated successfully, but these errors were encountered:

mbautin · 2023-06-16T23:01:02Z

We are currently using this fork of tcmalloc: https://github.com/yugabyte/tcmalloc/tree/e116a66-yb (based on commit e116a66 with some build-related changes).

Summary: When trying to capture a stack trace with a signal handler, if a memory allocation/deallocation is happening in the thread receiving the signal, the process could crash. Google TCMalloc issue: google/tcmalloc#189. In this diff, we are using the IsCurThreadInAllocDealloc malloc extension API we added in yugabyte/tcmalloc@677ba2d to skip capturing the stack trace in case the signal interrupted a thread that is currently allocating or deallocating memory. In such cases, we produce an empty stack trace which is later omitted from the overall threads dump. #17889 is a follow-up issue for retrying obtaining stack traces in such cases. Another change contained in the TCMalloc version that we are upgrading to is yugabyte/tcmalloc@d1b0e69 (adding an option to not seed lifetime profiler with live allocations). We are now setting seed_with_live_allocs to false when capturing an allocation profile. Test Plan: Jenkins Reviewers: asrivastava Reviewed By: asrivastava Subscribers: ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D26349

mbautin mentioned this issue Jun 21, 2023

[DocDB] google tcmalloc can crash with a SEGV on stack trace dump yugabyte/yugabyte-db#17875

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crashes when unwinding the stack from a signal handler interrupting deallocation #189

Crashes when unwinding the stack from a signal handler interrupting deallocation #189

mbautin commented Jun 16, 2023 •

edited

Loading

mbautin commented Jun 16, 2023

Crashes when unwinding the stack from a signal handler interrupting deallocation #189

Crashes when unwinding the stack from a signal handler interrupting deallocation #189

Comments

mbautin commented Jun 16, 2023 • edited Loading

mbautin commented Jun 16, 2023

mbautin commented Jun 16, 2023 •

edited

Loading