You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary:
Started digging into some TSAN failures in QLTransactionTest.RemoteBootstrap.
Found and fixed a number of issues:
- [#3002] Data race on `yb::log::Log::active_segment_sequence_number()`
Seems this field is protected by a read lock for reads, but was not protected on writes. Turned it
into an atomic.
- [#3007] Race condition between TabletPeer `Init` and `Shutdown`
> std::__1::shared_ptr<yb::tablet::enterprise::Tablet>::get()
> td::__1::shared_ptr<yb::consensus::RaftConsensus>::get()
> src/yb/tablet/tablet_peer.cc:385:17 in yb::tablet::TabletPeer::StartShutdown()
Seems like `Shutdown` did not take the appropriate locks to access either `tablet_` or `consensus_`.
- [#3008] Race condition in thread pool `Worker` Shutdown path:
> #12 yb::rpc::ThreadPool::Impl::Shutdown() /n/users/bogdan/code/yugabyte-db/build/tsan-clang-dynamic-ninja/../../src/yb/rpc/thread_pool.cc:224 (libyrpc.so+0x20c73f)
> #3 yb::rpc::(anonymous namespace)::Worker::Notify() /n/users/bogdan/code/yugabyte-db/build/tsan-clang-dynamic-ninja/../../src/yb/rpc/thread_pool.cc:75 (libyrpc.so+0x20ad6e)
Essentially, we're destroying the vector of workers, but it's possible we still end up trying to notify
them afterwards. Moved some of the code around and expoxed an explicit `Join`. Logic should stay
basically the same. Also moved to shared_ptr instead of raw pointers.
- [#3009] Race condition in Master async RPC task vs CatalogManager reading the task description
> #0 yb::master::PickLeaderReplica::PickReplica(yb::master::TSDescriptor**) /n/users/bogdan/code/yugabyte-db/build/tsan-clang-dynamic-ninja/../../src/yb/master/async_rpc_tasks.cc:95:12 (libmaster.so+0x2450b8)
> #2 yb::master::CatalogManager::SendAddServerRequest(scoped_refptr<yb::master::TabletInfo> const&, yb::consensus::RaftPeerPB_MemberType, yb::consensus::ConsensusStatePB const&, std::__1::basic_string<char, std::__1::char_traits <char>, std::__1::allocator<char> > const&) /n/users/bogdan/code/yugabyte-db/build/tsan-clang-dynamic-ninja/../../src/yb/master/catalog_manager.cc:5046:54
Just removed the log line..
There are a couple more issues I am still seeing:
- [#3010] Another `Long wait for safe op id`, but one seems like a bootstrap bug
Currently, doing a remote bootstrap triggers an inline OpenTablet in TSTabletManager, unlike the
normal ones, which are scheduled through a thread pool. That causes issues, because on shutdown,
we wait for the threadpool tasks to finish / get aborted. However, when done inline, this exposes
race conditions between Init and Shutdown paths for TabletPeer, RaftConsensus, Log, etc.
- [#3011] SEGV during `DisableFailureDetector`, during raft shutdown
Caused by the same race between Start (which creates the timer) and shutdown, which aborts it.
- [#3012] Log Close failures not flipping the state to closed
> F20191106 04:56:48 ../../src/yb/consensus/log_util.cc:874] Check failed: !IsFooterWritten()
Caused by the same race above. If TSTabletManager starts a remote bootstrap, we open a log. If we
shutdown the tablet manager, before finishing a bootstrap, we wipe the data, but then when we
close the log, we error out as files do not exist anymore.
- [#3013] Race condition in Master async RPC tasks state transitions
> F20191106 05:10:05 ../../src/yb/master/async_rpc_tasks.cc:126] Check failed: task_state == MonitoredTaskState::kWaiting State: kScheduling
Seems like there was a race between scheduling the task to run on the reactor thread and only AFTER
flipping the state from kScheduling. This can be a standalone investigation.
Test Plan: `ybd tsan --cxx-test client_ql-transaction-test --gtest_filter QLTransactionTest.RemoteBootstrap -n 100 --tp 4`
Reviewers: mikhail, sergei
Reviewed By: sergei
Subscribers: hector, ybase
Differential Revision: https://phabricator.dev.yugabyte.com/D7529
The text was updated successfully, but these errors were encountered: