[DocDB] YQL system.partitions cache lock contention leads to master unresponsiveness #12950
Labels
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/medium
Medium priority issue
Jira Link: DB-2683
Description
During load balancing with a large number of tablets we see lock contention caused by the yql system partitions cache.
This contention appears to be caused by the partitions vtable refresh thread taking a read lock at the same time that the tablet reports attempt to get a write lock. like so:
Processing 11 tablet reports waiting on the write lock:
11 __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::InboundCall::InboundCallTask::Run();yb::rpc::ServicePoolImpl::Handle();yb::master::MasterHeartbeatIf::Handle();_ZNSt17_Function_handlerIFvSt10shared_ptrIN2yb3rpc11InboundCallEEEZNS1_6master17MasterHeartbeatIf11InitMethodsERK13scoped_refptrINS1_12MetricEntityEEEUlS4_E_E9_M_invokeERKSt9_Any_dataOS4_;_ZN2yb3rpc10HandleCallINS0_19RpcCallPBParamsImplINS_6master20TSHeartbeatRequestPBENS3_21TSHeartbeatResponsePBEEEZZNS3_17MasterHeartbeatIf11InitMethodsERK13scoped_refptrINS_12MetricEntityEEENKUlSt10shared_ptrINS0_11InboundCallEEE_clESF_EUlPKS4_PS5_NS0_10RpcContextEE_EEDaSF_T0_;yb::master::CatalogManager::ProcessTabletReport();yb::master::CatalogManager::ProcessTabletReportBatch();yb::master::YQLPartitionsVTable::ProcessMutatedTablets();__pthread_rwlock_wrlock_slow;(unknown)
173 reads to the system.partitions table blocked by the write lock:
173 yb::rpc::InboundCall::InboundCallTask::Run();yb::rpc::ServicePoolImpl::Handle();yb::tserver::TabletServerServiceIf::Handle();_ZNSt17_Function_handlerIFvSt10shared_ptrIN2yb3rpc11InboundCallEEEZNS1_7tserver21TabletServerServiceIf11InitMethodsERK13scoped_refptrINS1_12MetricEntityEEEUlS4_E0_E9_M_invokeERKSt9_Any_dataOS4_;_ZN2yb3rpc10HandleCallINS0_19RpcCallPBParamsImplINS_7tserver13ReadRequestPBENS3_14ReadResponsePBEEEZZNS3_21TabletServerServiceIf11InitMethodsERK13scoped_refptrINS_12MetricEntityEEENKUlSt10shared_ptrINS0_11InboundCallEEE0_clESF_EUlPKS4_PS5_NS0_10RpcContextEE_EEDaSF_T0_;yb::tserver::TabletServiceImpl::Read();yb::tserver::TabletServiceImpl::CompleteRead();yb::tserver::TabletServiceImpl::DoRead();yb::tserver::TabletServiceImpl::DoReadImpl();yb::master::SystemTablet::HandleQLReadRequest();yb::tablet::AbstractTablet::HandleQLReadRequest();yb::docdb::QLReadOperation::Execute();yb::master::YQLVirtualTable::GetIterator();yb::master::YQLPartitionsVTable::RetrieveData();__pthread_rwlock_rdlock_slow;(unknown)
Need to reevaluate the changes done in 7f65b9d (gh issue #8978), and reduce lock contention by limiting where we grab the lock / introducing a separate lock
Note: this is with
generate_partitions_vtable_on_changes = true
, andpartitions_vtable_cache_refresh_secs = 0
The text was updated successfully, but these errors were encountered: