Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiFlash crash when preHandleSSTsToDTFiles in CI fullstack test #6616

Closed
hehechen opened this issue Jan 11, 2023 · 17 comments · Fixed by #6630
Closed

TiFlash crash when preHandleSSTsToDTFiles in CI fullstack test #6616

hehechen opened this issue Jan 11, 2023 · 17 comments · Fixed by #6630
Assignees
Labels
affects-6.5 This bug affects the 6.5.x(LTS) versions. component/storage severity/minor type/bug The issue is confirmed as a bug.

Comments

@hehechen
Copy link
Contributor

hehechen commented Jan 11, 2023

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Set TiFlash replica. https://ci.pingcap.net/blue/rest/organizations/jenkins/pipelines/tiflash-ghpr-integration-tests/runs/10529/nodes/182/steps/305/log/?start=0

2. What did you expect to see? (Required)

Don't crash.

3. What did you see instead (Required)

TiFlash crashed.

[2023-01-10T03:20:36.216Z] [2023/01/10 11:10:23.602 +08:00] [ERROR] [Exception.cpp:89] ["Code: 49, e.displayText() = DB::Exception: RWLock::getLock(): RWLock is already locked in exclusive mode: physical_table_id=-1: (while preHandleSnapshot region_id=2650, index=7, term=6), e.what() = DB::Exception, Stack trace:\n\n\n 0x37f675e\tStackTrace::StackTrace() [tiflash+58681182]\n 0x37e5fb2\tDB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+58613682]\n 0xc3f0480\tDB::RWLock::getLock(DB::RWLock::Type, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> > const&) [tiflash+205456512]\n 0xc3ed1d5\tDB::IStorage::tryLockTimed(std::__1::shared_ptr<DB::RWLock> const&, DB::RWLock::Type, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> > const&) const [tiflash+205443541]\n 0xc3ed7ae\tDB::IStorage::lockStructureForShare(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> > const&) [tiflash+205445038]\n 0xc897ff0\tDB::AtomicGetStorageSchema(std::__1::shared_ptr<DB::Region> const&, DB::TMTContext&)::$_6::operator()(bool) const [tiflash+210337776]\n 0xc897ac8\tDB::AtomicGetStorageSchema(std::__1::shared_ptr<DB::Region> const&, DB::TMTContext&) [tiflash+210336456]\n 0xc7eb592\tDB::KVStore::preHandleSSTsToDTFiles(std::__1::shared_ptr<DB::Region>, DB::SSTViewVec, unsigned long, unsigned long, DB::DM::FileConvertJobType, DB::TMTContext&) [tiflash+209630610]\n 0xc7eb108\tDB::KVStore::preHandleSnapshotToFiles(std::__1::shared_ptr<DB::Region>, DB::SSTViewVec, unsigned long, unsigned long, DB::TMTContext&) [tiflash+209629448]\n 0xc8a6484\tPreHandleSnapshot [tiflash+210396292]\n 0x7f04d4432930\tengine_store_ffi::_$LT$impl$u20$engine_store_ffi..interfaces..root..DB..EngineStoreServerHelper$GT$::pre_handle_snapshot::h13dfe909d7f6a3ea [libtiflash_proxy.so+76658992]\n 0x7f04d444508d\tengine_store_ffi::observer::pre_handle_snapshot_impl::hccb27a62a7e3c70f [libtiflash_proxy.so+76734605]\n 0x7f04d445a063\t_$LT$engine_store_ffi..observer..TiFlashObserver$u20$as$u20$raftstore..coprocessor..ApplySnapshotObserver$GT$::pre_apply_snapshot::_$u7b$$u7b$closure$u7d$$u7d$::h0f781b1b777e36fa [libtiflash_proxy.so+76820579]\n 0x7f04d44624e1\t_$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h7b86279f362c63a3 [libtiflash_proxy.so+76854497]\n 0x7f04d4438b79\tyatp::task::future::RawTask$LT$F$GT$::poll::hef947552eb87fdf3 [libtiflash_proxy.so+76684153]\n 0x7f04d7141ede\tyatp::task::future::TaskCell::poll::h7c797921d1a996a2 [libtiflash_proxy.so+123907806]\n 0x7f04d7142c94\t_$LT$yatp..task..future..Runner$u20$as$u20$yatp..pool..runner..Runner$GT$::handle::h4d1272e83351072c [libtiflash_proxy.so+123911316]\n 0x7f04d7141d43\tyatp::pool::worker::WorkerThread$LT$T$C$R$GT$::run::h7506ebb2efa488d0 [libtiflash_proxy.so+123907395]\n 0x7f04d7128ccd\tyatp::pool::builder::LazyBuilder$LT$T$GT$::build::_$u7b$$u7b$closure$u7d$$u7d$::h4ac5170a13db1e09 [libtiflash_proxy.so+123804877]\n 0x7f04d711c64c\tstd::sys_common::backtrace::__rust_begin_short_backtrace::hcf649c42df4fed43 [libtiflash_proxy.so+123754060]\n 0x7f04d714d1ed\tstd::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h64a73749dd71c9f8 [libtiflash_proxy.so+123953645]\n 0x7f04d711a641\t_$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::h85b1ff4258c2deaf [libtiflash_proxy.so+123745857]\n 0x7f04d714d3b6\tstd::panicking::try::do_call::he09d687beb31ad5b [libtiflash_proxy.so+123954102]\n 0x7f04d715fadb\t__rust_try [libtiflash_proxy.so+124029659]\n 0x7f04d714d2df\tstd::panicking::try::hec245749eb3c66e0 [libtiflash_proxy.so+123953887]\n 0x7f04d714ae51\tstd::panic::catch_unwind::h6f4be4704c94ef3b [libtiflash_proxy.so+123944529]\n 0x7f04d714cfba\tstd::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::h2c29f8e8e8dfb511 [libtiflash_proxy.so+123953082]\n 0x7f04d714e627\tcore::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h5eb0e81e2f12c3b9 [libtiflash_proxy.so+123958823]\n 0x7f04d84fc873\tstd::sys::unix::thread::Thread::new::thread_start::hd2791a9cabec1fda [libtiflash_proxy.so+144595059]\n \t/rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/std/src/sys/unix/thread.rs:108\n 0x7f04cf3f8ea5\tstart_thread [libpthread.so.0+32421]\n 0x7f04ced0796d\t__clone [libc.so.6+1042797]"] [source="DB::RawCppPtr DB::PreHandleSnapshot(DB::EngineStoreServerWrap *, DB::BaseBuffView, uint64_t, DB::SSTViewVec, uint64_t, uint64_t)"] [thread_id=50]

The thread_names created by TiFlash Proxy are RaftStoreProxy.
image

4. What is your TiFlash version? (Required)

f9167f3 in master branch

@hehechen hehechen added the type/bug The issue is confirmed as a bug. label Jan 11, 2023
@hehechen
Copy link
Contributor Author

hehechen commented Jan 11, 2023

image

It seems that the pthread_setname_np doesn't work as expected.

@CalvinNeo
Copy link
Member

v6.4.0 works fine
image

@hehechen
Copy link
Contributor Author

hehechen commented Jan 11, 2023

v6.4.0 works fine image

v6.5.0 also works fine.
image

@CalvinNeo
Copy link
Member

commit 445dc72
image

@CalvinNeo
Copy link
Member

d8c369c has this problem
image

@CalvinNeo
Copy link
Member

CalvinNeo commented Jan 11, 2023

e255112 has this problem
image

So,
image

@CalvinNeo
Copy link
Member

3b3048e
image

@CalvinNeo
Copy link
Member

41c08db which is exactly tiup's version of release-6.5
image

@CalvinNeo
Copy link
Member

image

It seems that the pthread_setname_np doesn't work as expected.

Seems TiFlash uses prctl

void setThreadName(const char * tname)
{
    constexpr auto max_len = 15; // thread name will be tname[:MAX_LEN]
    // TODO: Replace strlen for safety
    if (std::strlen(tname) > max_len)
        std::cerr << "set thread name " << tname << " is too long and will be truncated by system\n";

#if defined(__FreeBSD__)
    pthread_set_name_np(pthread_self(), tname);
    return;

#elif defined(__APPLE__)
    if (0 != pthread_setname_np(tname))
#else
    if (0 != prctl(PR_SET_NAME, tname, 0, 0, 0))
#endif
    DB::throwFromErrno("Cannot set thread name " + std::string(tname), DB::ErrorCodes::PTHREAD_ERROR);
}

@CalvinNeo
Copy link
Member

CalvinNeo commented Jan 11, 2023

In rust, Thread::set_name always lead to pthread_setname_np on Linux platform.

The problem is possibly due to pthread_setname_np is missing in some glibc, so it will fallback to libglibc-compatibility/glibc-compatibility.c, which is an empty impl.

Will the problem be solved if we impl this in libglibc-compatibility/glibc-compatibility.c, or just delete the fallback, or make it weak symbol? Or we actually need to avoid calling pthread_setname_np?
@hehechen

int pthread_setname_np(pthread_t thread, const char *name) { return 0; }
int pthread_getname_np(pthread_t thread, char *name, size_t len) { name[0] = '\0'; return 0; };

@CalvinNeo
Copy link
Member

By removing the fallback codes, it works
image

@hehechen
Copy link
Contributor Author

I tried the release version and debug version built in CI:
release version:
wget http://fileserver.pingcap.net/download/builds/pingcap/tiflash/release/master/4a7f463aa6f48932befd3a2119b7214f45ff8a32/centos7/tiflash.tar.gz
debug version:
wget https://ci.pingcap.net/job/tiflash-build-common/17775/artifact/tiflash.tar.gz.
The release version use pthread_setname_np from /lib64/libpthread.so.0, so it works fine.
image
The debug version use pthread_setname_np from glibc-compatibility.c, so the thread names are all RaftStoreProxy.
image

@hehechen
Copy link
Contributor Author

hehechen commented Jan 11, 2023

In rust, Thread::set_name always lead to pthread_setname_np on Linux platform.

The problem is possibly due to pthread_setname_np is missing in some glibc, so it will fallback to libglibc-compatibility/glibc-compatibility.c, which is an empty impl.

Will the problem be solved if we impl this in libglibc-compatibility/glibc-compatibility.c, or just delete the fallback, or make it weak symbol? Or we actually need to avoid calling pthread_setname_np? @hehechen

int pthread_setname_np(pthread_t thread, const char *name) { return 0; }
int pthread_getname_np(pthread_t thread, char *name, size_t len) { name[0] = '\0'; return 0; };

https://docs.pingcap.com/tidb/stable/hardware-and-software-requirements
image
The glibc version required for running TiDB (including TiFlash) is 2.28-151.el8, and pthread_setname_np function has been added in glibc 2.12, so I think we can just delete the fallback.

@CalvinNeo
Copy link
Member

I tried the release version and debug version built in CI: release version: wget http://fileserver.pingcap.net/download/builds/pingcap/tiflash/release/master/4a7f463aa6f48932befd3a2119b7214f45ff8a32/centos7/tiflash.tar.gz debug version: wget https://ci.pingcap.net/job/tiflash-build-common/17775/artifact/tiflash.tar.gz. The release version use pthread_setname_np from /lib64/libpthread.so.0, so it works fine. image The debug version use pthread_setname_np from glibc-compatibility.c, so the thread names are all RaftStoreProxy. image

If we nm both the debug and release version, will all pthread_setname_np appear T?

@hehechen
Copy link
Contributor Author

I tried the release version and debug version built in CI: release version: wget http://fileserver.pingcap.net/download/builds/pingcap/tiflash/release/master/4a7f463aa6f48932befd3a2119b7214f45ff8a32/centos7/tiflash.tar.gz debug version: wget https://ci.pingcap.net/job/tiflash-build-common/17775/artifact/tiflash.tar.gz. The release version use pthread_setname_np from /lib64/libpthread.so.0, so it works fine. image The debug version use pthread_setname_np from glibc-compatibility.c, so the thread names are all RaftStoreProxy. image

If we nm both the debug and release version, will all pthread_setname_np appear T?

Release:
image
Debug:
image
The pthread_setname_np symbol is in local text in release version because there is -fvisibility=hidden flag in release version.

@CalvinNeo
Copy link
Member

See also #6456

@hehechen hehechen mentioned this issue Jan 12, 2023
12 tasks
@hehechen hehechen changed the title TiFlash crash when preHandleSSTsToDTFiles TiFlash crash when preHandleSSTsToDTFiles in CI fullstack test Jan 12, 2023
@hehechen hehechen self-assigned this Jan 12, 2023
@CalvinNeo
Copy link
Member

Note 6.1 proxy + 6.1 tiflash works fine since
4a68935b-3702-4025-a025-c43e3478491f

6.5 proxy + 6.1 tiflash works not find since
6d5235a5-e683-41bd-bdbf-c84abaf7393f

@JaySon-Huang JaySon-Huang added the affects-6.5 This bug affects the 6.5.x(LTS) versions. label Apr 25, 2023
ti-chi-bot bot pushed a commit that referenced this issue Apr 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.5 This bug affects the 6.5.x(LTS) versions. component/storage severity/minor type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants