You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a stack trace dump facility that sends signal to threads and causes them to capture their stacks. This is being done using the backtrace Linux function that uses libunwind internally. We did not have any problems with this approach with Gperftools tcmalloc, but with this tcmalloc we are getting segmentation faults in case tcmalloc code is interrupted in functions such as tcmalloc_internal_tls_fetch_pic or TcmallocSlab_Internal_PopBatch_trampoline. We have a unit test that reliably reproduces this situation by creating a few threads that allocate objects and pass them to other threads for deallocation, while the main thread is repeatedly trying to dump the stacks of those worker threads.
As far as I know, libunwind backtrace facility is async-safe and is suitable for use in a signal handler. We are currently using LLVM 15's version of libunwind.
Has anyone else encountered this issue and is there a known workaround?
The text was updated successfully, but these errors were encountered:
Summary:
When trying to capture a stack trace with a signal handler, if a memory allocation/deallocation is happening in the thread receiving the signal, the process could crash. Google TCMalloc issue: google/tcmalloc#189.
In this diff, we are using the IsCurThreadInAllocDealloc malloc extension API we added in yugabyte/tcmalloc@677ba2d to skip capturing the stack trace in case the signal interrupted a thread that is currently allocating or deallocating memory. In such cases, we produce an empty stack trace which is later omitted from the overall threads dump. #17889 is a follow-up issue for retrying obtaining stack traces in such cases.
Another change contained in the TCMalloc version that we are upgrading to is yugabyte/tcmalloc@d1b0e69 (adding an option to not seed lifetime profiler with live allocations). We are now setting seed_with_live_allocs to false when capturing an allocation profile.
Test Plan: Jenkins
Reviewers: asrivastava
Reviewed By: asrivastava
Subscribers: ybase, bogdan
Differential Revision: https://phorge.dev.yugabyte.com/D26349
After we upgraded YugabyteDB codebase from Gperftools tcmalloc to this version, we encountered the following type of crashes:
We have a stack trace dump facility that sends signal to threads and causes them to capture their stacks. This is being done using the
backtrace
Linux function that uses libunwind internally. We did not have any problems with this approach with Gperftools tcmalloc, but with this tcmalloc we are getting segmentation faults in case tcmalloc code is interrupted in functions such as tcmalloc_internal_tls_fetch_pic or TcmallocSlab_Internal_PopBatch_trampoline. We have a unit test that reliably reproduces this situation by creating a few threads that allocate objects and pass them to other threads for deallocation, while the main thread is repeatedly trying to dump the stacks of those worker threads.As far as I know, libunwind backtrace facility is async-safe and is suitable for use in a signal handler. We are currently using LLVM 15's version of libunwind.
Has anyone else encountered this issue and is there a known workaround?
The text was updated successfully, but these errors were encountered: