Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiFlash get stuck occasionally when mem-profiling is actived #3236

Closed
JaySon-Huang opened this issue Oct 15, 2021 · 12 comments · Fixed by #9240
Closed

TiFlash get stuck occasionally when mem-profiling is actived #3236

JaySon-Huang opened this issue Oct 15, 2021 · 12 comments · Fixed by #9240
Assignees
Labels
affects-6.0 affects-6.1 This bug affects the 6.1.x(LTS) versions. component/storage help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. severity/minor type/bug The issue is confirmed as a bug.

Comments

@JaySon-Huang
Copy link
Contributor

Branch: https://github.com/JaySon-Huang/tics/tree/add_mem_prof

Start a TiFlash compiled with -DENABLE_JEMALLOC_PROF=1, and start tiflash with env export MALLOC_CONF="prof:true,prof_active:true,lg_prof_interval:33", TiFlash get stuck occasionally. The stacks are as below:
stuck-stacks-log.tar.gz

@JaySon-Huang JaySon-Huang added the type/bug The issue is confirmed as a bug. label Oct 15, 2021
@JaySon-Huang
Copy link
Contributor Author

Seems that it is the same issue as libunwind/libunwind#16

Thread 22 (Thread 0x7f99383ff700 (LWP 1943)):
#0  0x00007f996b76b4ed in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f996b766de6 in _L_lock_941 () from /lib64/libpthread.so.0
#2  0x00007f996b766cdf in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f996ad5a36f in dl_iterate_phdr () from /lib64/libc.so.6
#4  0x0000000007dca95e in _ULx86_64_dwarf_find_proc_info ()
#5  0x0000000007dc763f in _ULx86_64_dwarf_step ()
#6  0x0000000007dc4793 in _ULx86_64_step ()
#7  0x0000000007dc5668 in _ULx86_64_tdep_trace ()
#8  0x0000000007dc3a17 in unw_backtrace ()
#9  0x0000000007d78dc9 in prof_backtrace ()
#10 0x0000000007d5f3aa in calloc ()
#11 0x0000000003697947 in Allocator<true>::alloc(unsigned long, unsigned long) ()
#12 0x00000000077daabf in DB::DataTypeEnum<short>::DataTypeEnum(std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, short>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, short> > > const&) ()
#13 0x00000000077d21a6 in std::shared_ptr<DB::IDataType const> DB::create<DB::DataTypeEnum<short> >(std::shared_ptr<DB::IAST> const&) ()
#14 0x0000000006b6ec14 in std::_Function_handler<std::shared_ptr<DB::IDataType const> (std::shared_ptr<DB::IAST> const&), std::shared_ptr<DB::IDataType const> (*)(std::shared_ptr<DB::IAST> const&)>::_M_invoke(std::_Any_data const&, std::shared_ptr<DB::IAST> const&) ()
#15 0x0000000006b8864a in DB::DataTypeFactory::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<DB::IAST> const&) const ()
#16 0x0000000006b887e7 in DB::DataTypeFactory::get(std::shared_ptr<DB::IAST> const&) const ()
#17 0x000000000728b707 in DB::parseColumns(DB::ASTExpressionList const&, DB::Context const&) ()
#18 0x000000000728d0e1 in DB::InterpreterCreateQuery::getColumnsDescription(DB::ASTExpressionList const&, DB::Context const&) ()
#19 0x0000000006bd6d94 in DB::createTableFromDefinition(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, DB::Context&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#20 0x0000000006bd8dc0 in DB::DatabaseLoading::loadTable(DB::Context&, DB::IDatabase&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) ()
#21 0x0000000006bcd346 in DB::DatabaseTiFlash::loadTables(DB::Context&, ThreadPool*, bool)::{lambda(__gnu_cxx::__normal_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, __gnu_cxx::__normal_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >)#1}::operator()(__gnu_cxx::__normal_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, __gnu_cxx::__normal_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >) const ()
#22 0x0000000007d3a4b7 in ThreadPool::worker() ()
#23 0x000000000884a89f in std::execute_native_thread_routine (__p=0x7f99663b5cd0) at ../../../.././libstdc++-v3/src/c++11/thread.cc:83
#24 0x00007f996b764dd5 in start_thread () from /lib64/libpthread.so.0
#25 0x00007f996ad1bead in clone () from /lib64/libc.so.6

@SchrodingerZhu
Copy link
Contributor

@SchrodingerZhu
Copy link
Contributor

Is it possible then to use llvm's libunwind?

@SchrodingerZhu
Copy link
Contributor

The problem is that we need to vendor libunwind on ourselves. I think no one could like to make llvm-project as a sub project. xD

@JaySon-Huang
Copy link
Contributor Author

JaySon-Huang commented Oct 18, 2021

Maybe we can use https://github.com/ClickHouse-Extras/libunwind? Seems it is a fork from llvm's libunwind

I'll have a try to use ClickHouse-Extras/libunwind

@JaySon-Huang
Copy link
Contributor Author

Failed to build tiflash with ClickHouse-Extras/libunwind

Consolidate compiler generated dependencies of target tiflash
[100%] Linking CXX executable tiflash
/usr/local/lib/gcc/x86_64-pc-linux-gnu/7.3.0/libgcc_eh.a(unwind-dw2.o): In function `_Unwind_GetGR':
/root/flow/gcc-7.3.0/x86_64-pc-linux-gnu/libgcc/../.././libgcc/unwind-dw2.c:233: multiple definition of `_Unwind_GetGR'
../../../contrib/libunwind-cmake/libunwindd.a(UnwindLevel1.c.o):/data1/jaysonhuang/tics-master/contrib/libunwind/src/UnwindLevel1.c:475: first defined here
/usr/local/lib/gcc/x86_64-pc-linux-gnu/7.3.0/libgcc_eh.a(unwind-dw2.o): In function `_Unwind_SetGR':
/root/flow/gcc-7.3.0/x86_64-pc-linux-gnu/libgcc/../.././libgcc/unwind-dw2.c:273: multiple definition of `_Unwind_SetGR'
../../../contrib/libunwind-cmake/libunwindd.a(UnwindLevel1.c.o):/data1/jaysonhuang/tics-master/contrib/libunwind/src/UnwindLevel1.c:486: first defined here
/usr/local/lib/gcc/x86_64-pc-linux-gnu/7.3.0/libgcc_eh.a(unwind-dw2.o): In function `_Unwind_GetIP':
/root/flow/gcc-7.3.0/x86_64-pc-linux-gnu/libgcc/../.././libgcc/unwind-dw2.c:345: multiple definition of `_Unwind_GetIP'
../../../contrib/libunwind-cmake/libunwindd.a(UnwindLevel1.c.o):/data1/jaysonhuang/tics-master/contrib/libunwind/src/UnwindLevel1.c:495: first defined here
/usr/local/lib/gcc/x86_64-pc-linux-gnu/7.3.0/libgcc_eh.a(unwind-dw2.o): In function `_Unwind_SetIP':
/root/flow/gcc-7.3.0/x86_64-pc-linux-gnu/libgcc/../.././libgcc/unwind-dw2.c:363: multiple definition of `_Unwind_SetIP'
../../../contrib/libunwind-cmake/libunwindd.a(UnwindLevel1.c.o):/data1/jaysonhuang/tics-master/contrib/libunwind/src/UnwindLevel1.c:508: first defined here
/usr/local/lib/gcc/x86_64-pc-linux-gnu/7.3.0/libgcc_eh.a(unwind-dw2.o): In function `_Unwind_GetLanguageSpecificData':
/root/flow/gcc-7.3.0/x86_64-pc-linux-gnu/libgcc/../.././libgcc/unwind-dw2.c:369: multiple definition of `_Unwind_GetLanguageSpecificData'
../../../contrib/libunwind-cmake/libunwindd.a(UnwindLevel1.c.o):/data1/jaysonhuang/tics-master/contrib/libunwind/src/UnwindLevel1.c:428: first defined here
/usr/local/lib/gcc/x86_64-pc-linux-gnu/7.3.0/libgcc_eh.a(unwind-dw2.o): In function `_Unwind_GetRegionStart':
/root/flow/gcc-7.3.0/x86_64-pc-linux-gnu/libgcc/../.././libgcc/unwind-dw2.c:375: multiple definition of `_Unwind_GetRegionStart'
../../../contrib/libunwind-cmake/libunwindd.a(UnwindLevel1.c.o):/data1/jaysonhuang/tics-master/contrib/libunwind/src/UnwindLevel1.c:449: first defined here
/usr/local/lib/gcc/x86_64-pc-linux-gnu/7.3.0/libgcc_eh.a(unwind-dw2.o): In function `_Unwind_RaiseException':
/root/flow/gcc-7.3.0/x86_64-pc-linux-gnu/libgcc/../.././libgcc/unwind.inc:83: multiple definition of `_Unwind_RaiseException'
../../../contrib/libunwind-cmake/libunwindd.a(UnwindLevel1.c.o):/data1/jaysonhuang/tics-master/contrib/libunwind/src/UnwindLevel1.c:350: first defined here
/usr/local/lib/gcc/x86_64-pc-linux-gnu/7.3.0/libgcc_eh.a(unwind-dw2.o): In function `_Unwind_ForcedUnwind':
/root/flow/gcc-7.3.0/x86_64-pc-linux-gnu/libgcc/../.././libgcc/unwind.inc:197: multiple definition of `_Unwind_ForcedUnwind'
../../../contrib/libunwind-cmake/libunwindd.a(UnwindLevel1.c.o):/data1/jaysonhuang/tics-master/contrib/libunwind/src/UnwindLevel1.c:409: first defined here
/usr/local/lib/gcc/x86_64-pc-linux-gnu/7.3.0/libgcc_eh.a(unwind-dw2.o): In function `_Unwind_Resume':
/root/flow/gcc-7.3.0/x86_64-pc-linux-gnu/libgcc/../.././libgcc/unwind.inc:220: multiple definition of `_Unwind_Resume'
../../../contrib/libunwind-cmake/libunwindd.a(UnwindLevel1.c.o):/data1/jaysonhuang/tics-master/contrib/libunwind/src/UnwindLevel1.c:385: first defined here
/usr/local/lib/gcc/x86_64-pc-linux-gnu/7.3.0/libgcc_eh.a(unwind-dw2.o): In function `_Unwind_DeleteException':
/root/flow/gcc-7.3.0/x86_64-pc-linux-gnu/libgcc/../.././libgcc/unwind.inc:270: multiple definition of `_Unwind_DeleteException'
../../../contrib/libunwind-cmake/libunwindd.a(UnwindLevel1.c.o):/data1/jaysonhuang/tics-master/contrib/libunwind/src/UnwindLevel1.c:465: first defined here
collect2: error: ld returned 1 exit status
make[3]: *** [dbms/src/Server/tiflash] Error 1
make[2]: *** [dbms/src/Server/CMakeFiles/tiflash.dir/all] Error 2
make[1]: *** [dbms/src/Server/CMakeFiles/tiflash.dir/rule] Error 2
make: *** [tiflash] Error 2

@JaySon-Huang
Copy link
Contributor Author

Find that we can not static link libgcc && libstdc++ when using llvm/libunwind or ClickHouse-Extras/libunwind

@JaySon-Huang JaySon-Huang self-assigned this Oct 19, 2021
@JaySon-Huang
Copy link
Contributor Author

JaySon-Huang commented Oct 20, 2021

If we want to use llvm/libcxx && llvm/libcxx-abi, then we need all libraries built with them instead of libgcc && libstdc++

Or we may end-up with errors like

/usr/local/lib/libgrpc++.a(client_channel.cc.o): In function `grpc_core::(anonymous namespace)::ChannelData::SubchannelWrapper::CancelConnectivityStateWatch(grpc_core::SubchannelInterface::ConnectivityStateWatcherInterface*)':
client_channel.cc:(.text+0xe10): undefined reference to `std::_Rb_tree_rebalance_for_erase(std::_Rb_tree_node_base*, std::_Rb_tree_node_base&)'
/usr/local/lib/libgrpc++.a(client_channel.cc.o): In function `grpc_core::(anonymous namespace)::CallData::RecvTrailingMetadataReadyForLoadBalancingPolicy(void*, grpc_error*)':
client_channel.cc:(.text+0x10df): undefined reference to `std::__throw_bad_function_call()'

@JaySon-Huang
Copy link
Contributor Author

JaySon-Huang commented Oct 26, 2021

https://github.com/JaySon-Huang/tics/tree/add_mem_prof
OPENSSL_ROOT_DIR=/usr/local/opt/openssl cmake -DCMAKE_BUILD_TYPE=Release -Wno-dev -DENABLE_TESTS=ON -DENABLE_JEMALLOC=ON -DENABLE_JEMALLOC_PROF=ON -DENABLE_UNWIND=ON -DNO_WERROR=OFF -DUSE_INTERNAL_LIBCXX_LIBRARY=OFF -DUSE_LIBCXX=ON ..

Compiled using clang 13.0.0, finally, tiflash get crash with stack like this:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe94ff700 (LWP 74968)]
0x000000000b4b6e06 in unw_get_reg ()
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7.x86_64 libgcc-4.8.5-36.el7.x86_64 libstdc++-4.8.5-36.el7.x86_64
(gdb) bt
#0  0x000000000b4b6e06 in unw_get_reg ()
#1  0x000000000b4ba676 in _Unwind_GetIP ()
#2  0x00007ffff51140be in backtrace::capture::Backtrace::create::_$u7b$$u7b$closure$u7d$$u7d$::h4b56b17a68ea9224 () from /data1/jaysonhuang/tiup-deploy/deploy/tiflash-5018/bin/tiflash/libtiflash_proxy.so
#3  0x00007ffff51139de in backtrace::backtrace::libunwind::trace::trace_fn::h16d5f3a13067d326 () from /data1/jaysonhuang/tiup-deploy/deploy/tiflash-5018/bin/tiflash/libtiflash_proxy.so
#4  0x00007ffff3893f99 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#5  0x00007ffff5113f7e in backtrace::capture::Backtrace::new::ha8e5a6922c70022e () from /data1/jaysonhuang/tiup-deploy/deploy/tiflash-5018/bin/tiflash/libtiflash_proxy.so
#6  0x00007ffff68bed2a in std::sys_common::backtrace::__rust_begin_short_backtrace::h08238d9c33726a64 () from /data1/jaysonhuang/tiup-deploy/deploy/tiflash-5018/bin/tiflash/libtiflash_proxy.so
#7  0x00007ffff68c759f in core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h4355c43fef11373a () from /data1/jaysonhuang/tiup-deploy/deploy/tiflash-5018/bin/tiflash/libtiflash_proxy.so
#8  0x00007ffff61a957b in call_once<(),FnOnce<()>,alloc::alloc::Global> () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/alloc/src/boxed.rs:1546
#9  call_once<(),alloc::boxed::Box<FnOnce<()>, alloc::alloc::Global>,alloc::alloc::Global> () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/alloc/src/boxed.rs:1546
#10 std::sys::unix::thread::Thread::new::thread_start::hfbe13ead469fd0bc () at library/std/src/sys/unix/thread.rs:71
#11 0x00007ffff4175dd5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007ffff3e9eead in clone () from /lib64/libc.so.6

@JaySon-Huang
Copy link
Contributor Author

JaySon-Huang commented Oct 28, 2021

Maybe we can fix it with a similar approach as tikv/pprof-rs#85, need further investigation.
Related document of mem-profiling from tidb insight team. https://docs.google.com/document/d/1gjoMQFqiDoxw0QX65sCyqDWXYEGLLV4Fd3fw2bFNzzM/edit#

@JaySon-Huang JaySon-Huang added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Nov 23, 2021
@VelocityLight VelocityLight added the affects-6.1 This bug affects the 6.1.x(LTS) versions. label May 20, 2022
@flowbehappy
Copy link
Contributor

Change severity to minor since we don't do mem profile currently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.0 affects-6.1 This bug affects the 6.1.x(LTS) versions. component/storage help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. severity/minor type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants