Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiFlash can not start when restart multiple times #6486

Closed
hongyunyan opened this issue Dec 14, 2022 · 5 comments · Fixed by #6558
Closed

TiFlash can not start when restart multiple times #6486

hongyunyan opened this issue Dec 14, 2022 · 5 comments · Fixed by #6558
Assignees
Labels
affects-6.5 This bug affects the 6.5.x(LTS) versions. component/storage severity/critical type/bug The issue is confirmed as a bug.

Comments

@hongyunyan
Copy link
Contributor

hongyunyan commented Dec 14, 2022

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Use nightly tiflash, and use tiup cluster restart multiple times, then tiflash can not start anymore.

2. What did you expect to see? (Required)

restart normally

3. What did you see instead (Required)

[2022/12/13 21:43:19.347 +08:00] [ERROR] [Exception.cpp:89] ["Code: 0, e.displayText() = DB::Exception: Can not find path for DMFile, file_id=94893, e.what() = DB::Exception, Stack trace:\n\n\n       0x1708073\tStackTrace::StackTrace() [tiflash+24150131]\n       0x1705e76\tDB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+24141430]\n       0x68be70c\tDB::StableDiskDelegator::getDTFilePath(unsigned long, bool) const [tiflash+109831948]\n       0x6a0df89\tDB::DM::StableValueSpace::restore(DB::DM::DMContext&, unsigned long) [tiflash+111206281]\n       0x69a6849\tDB::DM::Segment::restoreSegment(std::__1::shared_ptr<DB::Logger> const&, DB::DM::DMContext&, unsigned long) [tiflash+110782537]\n       0x6949090\tDB::DM::DeltaMergeStore::DeltaMergeStore(DB::Context&, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, long, bool, std::__1::vector<DB::DM::ColumnDefine, std::__1::allocator<DB::DM::ColumnDefine> > const&, DB::DM::ColumnDefine const&, bool, unsigned long, DB::DM::DeltaMergeStore::Settings const&) [tiflash+110399632]\n       0x68d56f1\tDB::StorageDeltaMerge::getAndMaybeInitStore() [tiflash+109926129]\n       0x68d96c8\tDB::StorageDeltaMerge::flushCache(DB::Context const&, DB::DM::RowKeyRange const&, bool) [tiflash+109942472]\n       0x6b3b35a\tDB::KVStore::tryFlushRegionCacheInStorage(DB::TMTContext&, DB::Region const&, std::__1::shared_ptr<DB::Logger> const&, bool) [tiflash+112440154]\n       0x6b3e64f\tDB::KVStore::forceFlushRegionDataImpl(DB::Region&, bool, DB::TMTContext&, DB::RegionTaskLock const&, unsigned long, unsigned long) [tiflash+112453199]\n       0x6b3df68\tDB::KVStore::canFlushRegionDataImpl(std::__1::shared_ptr<DB::Region> const&, unsigned char, bool, DB::TMTContext&, DB::RegionTaskLock const&, unsigned long, unsigned long) [tiflash+112451432]\n       0x6b3e3c1\tDB::KVStore::tryFlushRegionData(unsigned long, bool, bool, DB::TMTContext&, unsigned long, unsigned long) [tiflash+112452545]\n       0x6b5b807\tTryFlushData [tiflash+112572423]\n  0x7f4e55f62a32\t_$LT$engine_store_ffi..observer..TiFlashObserver$u20$as$u20$raftstore..coprocessor..AdminObserver$GT$::pre_exec_admin::h39bffcdc6224fe8d [libtiflash_proxy.so+17558066]\n  0x7f4e56e726dd\traftstore::store::fsm::apply::ApplyDelegate$LT$EK$GT$::apply_raft_cmd::h455e45aa599d2a64 [libtiflash_proxy.so+33351389]\n  0x7f4e56e8b955\traftstore::store::fsm::apply::ApplyDelegate$LT$EK$GT$::process_raft_cmd::hb3499b22373e583d [libtiflash_proxy.so+33454421]\n  0x7f4e56e90e24\traftstore::store::fsm::apply::ApplyDelegate$LT$EK$GT$::handle_raft_committed_entries::h0e79a570c1792a09 [libtiflash_proxy.so+33476132]\n  0x7f4e56e6640c\traftstore::store::fsm::apply::ApplyFsm$LT$EK$GT$::handle_apply::h7ef360f580b34c40 [libtiflash_proxy.so+33301516]\n  0x7f4e56e69a38\traftstore::store::fsm::apply::ApplyFsm$LT$EK$GT$::handle_tasks::hbdf512f0cd58e33c [libtiflash_proxy.so+33315384]\n  0x7f4e56508494\t_$LT$raftstore..store..fsm..apply..ApplyPoller$LT$EK$GT$$u20$as$u20$batch_system..batch..PollHandler$LT$raftstore..store..fsm..apply..ApplyFsm$LT$EK$GT$$C$raftstore..store..fsm..apply..ControlFsm$GT$$GT$::handle_normal::h1a10c2992601e20d [libtiflash_proxy.so+23479444]\n  0x7f4e564a38c7\tbatch_system::batch::Poller$LT$N$C$C$C$Handler$GT$::poll::h1e6b33b2fb50cbab [libtiflash_proxy.so+23066823]\n  0x7f4e56559d82\tstd::sys_common::backtrace::__rust_begin_short_backtrace::h176c5187879ecf12 [libtiflash_proxy.so+23813506]\n  0x7f4e565999be\tcore::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h4eea01b9a3c057b7 [libtiflash_proxy.so+24074686]\n  0x7f4e5753f735\tstd::sys::unix::thread::Thread::new::thread_start::hd2791a9cabec1fda [libtiflash_proxy.so+40482613]\n                \t/rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/std/src/sys/unix/thread.rs:108\n  0x7f4e54a87ea5\tstart_thread [libpthread.so.0+32421]\n  0x7f4e54093b0d\tclone [libc.so.6+1043213]"] [source="DB::DM::DeltaMergeStore::DeltaMergeStore(DB::Context &, bool, const DB::String &, const DB::String &, DB::TableID, bool, const DB::DM::ColumnDefines &, const DB::DM::ColumnDefine &, bool, size_t, const DB::DM::DeltaMergeStore::Settings &)"] [thread_id=50]
[2022/12/13 21:43:19.394 +08:00] [INFO] [DeltaMergeStore.cpp:209] ["Restore DeltaMerge Store start"] [source="table_id=93"] [thread_id=51]
[2022/12/13 21:43:19.394 +08:00] [DEBUG] [DeltaMergeStore.cpp:1483] ["Loading dt files"] [source="table_id=93"] [thread_id=51]
[2022/12/13 21:43:19.394 +08:00] [ERROR] [Exception.cpp:89] ["Code: 0, e.displayText() = DB::Exception: Can not find path for DMFile, file_id=94893, e.what() = DB::Exception, Stack trace:\n\n\n       0x1708073\tStackTrace::StackTrace() [tiflash+24150131]\n       0x1705e76\tDB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+24141430]\n       0x68be70c\tDB::StableDiskDelegator::getDTFilePath(unsigned long, bool) const [tiflash+109831948]\n       0x6a0df89\tDB::DM::StableValueSpace::restore(DB::DM::DMContext&, unsigned long) [tiflash+111206281]\n       0x69a6849\tDB::DM::Segment::restoreSegment(std::__1::shared_ptr<DB::Logger> const&, DB::DM::DMContext&, unsigned long) [tiflash+110782537]\n       0x6949090\tDB::DM::DeltaMergeStore::DeltaMergeStore(DB::Context&, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, long, bool, std::__1::vector<DB::DM::ColumnDefine, std::__1::allocator<DB::DM::ColumnDefine> > const&, DB::DM::ColumnDefine const&, bool, unsigned long, DB::DM::DeltaMergeStore::Settings const&) [tiflash+110399632]\n       0x68d56f1\tDB::StorageDeltaMerge::getAndMaybeInitStore() [tiflash+109926129]\n       0x68d96c8\tDB::StorageDeltaMerge::flushCache(DB::Context const&, DB::DM::RowKeyRange const&, bool) [tiflash+109942472]\n       0x6b3b35a\tDB::KVStore::tryFlushRegionCacheInStorage(DB::TMTContext&, DB::Region const&, std::__1::shared_ptr<DB::Logger> const&, bool) [tiflash+112440154]\n       0x6b3e64f\tDB::KVStore::forceFlushRegionDataImpl(DB::Region&, bool, DB::TMTContext&, DB::RegionTaskLock const&, unsigned long, unsigned long) [tiflash+112453199]\n       0x6b3df68\tDB::KVStore::canFlushRegionDataImpl(std::__1::shared_ptr<DB::Region> const&, unsigned char, bool, DB::TMTContext&, DB::RegionTaskLock const&, unsigned long, unsigned long) [tiflash+112451432]\n       0x6b3e3c1\tDB::KVStore::tryFlushRegionData(unsigned long, bool, bool, DB::TMTContext&, unsigned long, unsigned long) [tiflash+112452545]\n       0x6b5b807\tTryFlushData [tiflash+112572423]\n  0x7f4e55f62a32\t_$LT$engine_store_ffi..observer..TiFlashObserver$u20$as$u20$raftstore..coprocessor..AdminObserver$GT$::pre_exec_admin::h39bffcdc6224fe8d [libtiflash_proxy.so+17558066]\n  0x7f4e56e726dd\traftstore::store::fsm::apply::ApplyDelegate$LT$EK$GT$::apply_raft_cmd::h455e45aa599d2a64 [libtiflash_proxy.so+33351389]\n  0x7f4e56e8b955\traftstore::store::fsm::apply::ApplyDelegate$LT$EK$GT$::process_raft_cmd::hb3499b22373e583d [libtiflash_proxy.so+33454421]\n  0x7f4e56e90e24\traftstore::store::fsm::apply::ApplyDelegate$LT$EK$GT$::handle_raft_committed_entries::h0e79a570c1792a09 [libtiflash_proxy.so+33476132]\n  0x7f4e56e6640c\traftstore::store::fsm::apply::ApplyFsm$LT$EK$GT$::handle_apply::h7ef360f580b34c40 [libtiflash_proxy.so+33301516]\n  0x7f4e56e69a38\traftstore::store::fsm::apply::ApplyFsm$LT$EK$GT$::handle_tasks::hbdf512f0cd58e33c [libtiflash_proxy.so+33315384]\n  0x7f4e56508494\t_$LT$raftstore..store..fsm..apply..ApplyPoller$LT$EK$GT$$u20$as$u20$batch_system..batch..PollHandler$LT$raftstore..store..fsm..apply..ApplyFsm$LT$EK$GT$$C$raftstore..store..fsm..apply..ControlFsm$GT$$GT$::handle_normal::h1a10c2992601e20d [libtiflash_proxy.so+23479444]\n  0x7f4e564a38c7\tbatch_system::batch::Poller$LT$N$C$C$C$Handler$GT$::poll::h1e6b33b2fb50cbab [libtiflash_proxy.so+23066823]\n  0x7f4e56559d82\tstd::sys_common::backtrace::__rust_begin_short_backtrace::h176c5187879ecf12 [libtiflash_proxy.so+23813506]\n  0x7f4e565999be\tcore::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h4eea01b9a3c057b7 [libtiflash_proxy.so+24074686]\n  0x7f4e5753f735\tstd::sys::unix::thread::Thread::new::thread_start::hd2791a9cabec1fda [libtiflash_proxy.so+40482613]\n                \t/rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/std/src/sys/unix/thread.rs:108\n  0x7f4e54a87ea5\tstart_thread [libpthread.so.0+32421]\n  0x7f4e54093b0d\tclone [libc.so.6+1043213]"] [source="uint8_t DB::TryFlushData(DB::EngineStoreServerWrap *, uint64_t, uint8_t, uint64_t, uint64_t)"] [thread_id=50]

wVGYmx9npu

log is in 172-16-5-81 /data2/hongyunyan/tiflash-41350/log

When I restart that time, there is no workload.

4. What is your TiFlash version? (Required)

6.5.0(nightly)

@hongyunyan hongyunyan added type/bug The issue is confirmed as a bug. component/storage labels Dec 14, 2022
@ywqzzy
Copy link
Contributor

ywqzzy commented Dec 14, 2022

@lilinghai
Is it necessary to add a test which run tiup cluster restart multiple times in endless?

@lilinghai
Copy link

@lilinghai Is it necessary to add a test which run tiup cluster restart multiple times in endless?

Great job ! TiFlash has chaos such as restart cases (https://tcms.pingcap.net/dashboard/executions/plan/1326596), but is doesn't find this bug. @ywqzzy can you add a similar tiup cluster restart case if it is convenient ?

@ywqzzy
Copy link
Contributor

ywqzzy commented Dec 15, 2022

@lilinghai Is it necessary to add a test which run tiup cluster restart multiple times in endless?

Great job ! TiFlash has chaos such as restart cases (https://tcms.pingcap.net/dashboard/executions/plan/1326596), but is doesn't find this bug. @ywqzzy can you add a similar tiup cluster restart case if it is convenient ?

Yes, I will add one

@ywqzzy
Copy link
Contributor

ywqzzy commented Dec 20, 2022

@lilinghai Is it necessary to add a test which run tiup cluster restart multiple times in endless?

Great job ! TiFlash has chaos such as restart cases (https://tcms.pingcap.net/dashboard/executions/plan/1326596), but is doesn't find this bug. @ywqzzy can you add a similar tiup cluster restart case if it is convenient ?

Yes, I will add one

Added

@JaySon-Huang
Copy link
Contributor

confirm that this bug is introduced by PageStorage V3, so v5.x.x, v6.1.x does not affected

@JaySon-Huang JaySon-Huang removed affects-5.0 affects-5.1 affects-5.2 affects-5.3 affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.3 affects-6.4 labels Jun 27, 2023
@ti-chi-bot ti-chi-bot added affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-5.3 affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. and removed affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-5.3 affects-5.4 This bug affects the 5.4.x(LTS) versions. labels Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.5 This bug affects the 6.5.x(LTS) versions. component/storage severity/critical type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants