[docdb] Improve Master Reliability for Large Clusters. #6305

nspiegelberg · 2020-11-06T21:22:27Z

Jira Link: DB-2023
We have been having issues with handling master & tserver failover with high tablet counts. This is a high level task to track progress on addressing that issue. Current Anticipated Work:

Handle Partial Processing of TServer Tablet Reports from Master when it is heavily loaded and will exceed RPC timeout.
Add a Master-side, Lock-free Cache for GetTableLocations / GetTabletLocations.
Coordinate Large Tablet Processing with LoadBalancing. Currently, we commonly manually disable LB during rolling restarts and manually trigger on our larger clusters.
Figure out why Master UI is slow on large Cluster setups for main UI views.

Summary: In preparation for making changes to the Heartbeat path, adding compile-time thread safety annotations to TabletManager. This required adding capability annotations to RWMutex as well. Test Plan: Jenkins Reviewers: timur, zyu, rahuldesirazu, rsami Reviewed By: zyu, rahuldesirazu, rsami Subscribers: rsami, ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D9865

Summary: Currently, we send all tablets that changed on the TServer to the Master within a single heartbeat. This can get extremely large and processing it on the server can exceed the Heartbeat timeout on large clusters, causing server restarts and further instability. We want to limit heartbeat timeouts to more severe problems, like resource issues and software deadlocks. To be more flexible: 1: The TS negotiates a limit on the max number of tablets included in a single heartbeat. With a fixed size, we can now bound the amount of processing in a single RPC. 2: Honor deadline on Master. After every batch of tablet heartbeat processing, the Master checks to see if it is close to the deadline and exits early if so. The heartbeat response only includes the processed tablets so the TS knows what information to send on the next heartbeat. Test Plan: MultiHeartbeat/CreateMultiHBTableStressTest.CreateAndDeleteBigTable/1 MultiHeartbeat/CreateMultiHBTableStressTest.RestartServersAfterCreation/1 CreateSmallHBTableStressTest.TestRestartMasterDuringFullHeartbeat CreateTableStressTest.TestHeartbeatDeadline TsTabletManagerTest.TestTabletReportLimit TsTabletManagerTest.TestTabletReports ClientTest.Capability Reviewers: timur, amitanand, rahuldesirazu, bogdan Reviewed By: bogdan Subscribers: kannan, sergei, ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D9974

Summary: Currently, we send all tablets that changed on the TServer to the Master within a single heartbeat. This can get extremely large and processing it on the server can exceed the Heartbeat timeout on large clusters, causing server restarts and further instability. We want to limit heartbeat timeouts to more severe problems, like resource issues and software deadlocks. To be more flexible: 1: The TS negotiates a limit on the max number of tablets included in a single heartbeat. With a fixed size, we can now bound the amount of processing in a single RPC. 2: Honor deadline on Master. After every batch of tablet heartbeat processing, the Master checks to see if it is close to the deadline and exits early if so. The heartbeat response only includes the processed tablets so the TS knows what information to send on the next heartbeat. Original commit: D9974 / 8bac374 Test Plan: Jenkins: rebase: 2.4 MultiHeartbeat/CreateMultiHBTableStressTest.CreateAndDeleteBigTable/1 MultiHeartbeat/CreateMultiHBTableStressTest.RestartServersAfterCreation/1 CreateSmallHBTableStressTest.TestRestartMasterDuringFullHeartbeat CreateTableStressTest.TestHeartbeatDeadline TsTabletManagerTest.TestTabletReportLimit TsTabletManagerTest.TestTabletReports ClientTest.Capability Reviewers: timur, amitanand, rahuldesirazu, rao, bogdan Reviewed By: bogdan Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D10649

Summary: Small fix for an existing unit test. Noticed that we were occassionally getting a "Bad status: Timed out" error on the unit test. It runs with low timeout numbers, which can cause issues with TSAN. Added the kTimeMultiplier standard fix for this problem. Test Plan: TestHeartbeatDeadline Reviewers: timur, amitanand, rahuldesirazu, bogdan Reviewed By: bogdan Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D13365

nspiegelberg added the area/docdb YugabyteDB core features label Nov 6, 2020

nspiegelberg assigned nspiegelberg, bmatican and rahuldesirazu and unassigned bmatican and rahuldesirazu Nov 6, 2020

nspiegelberg mentioned this issue Nov 10, 2020

[docdb] Improve Master Scalability with High Tablet Count #6304

Closed

bmatican mentioned this issue Dec 16, 2020

[docdb] Implement pagination for TS heartbeats #6189

Closed

bmatican mentioned this issue Apr 26, 2021

[Docs] 2.4.2.0 release notes #8171

Merged

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Jun 9, 2022

yugabyte-ci added kind/enhancement This is an enhancement of an existing feature and removed kind/bug This issue is a bug labels Jul 30, 2022

yugabyte-ci assigned lingamsandeep and unassigned nspiegelberg Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docdb] Improve Master Reliability for Large Clusters. #6305

[docdb] Improve Master Reliability for Large Clusters. #6305

nspiegelberg commented Nov 6, 2020 •

edited by yugabyte-ci

Loading

[docdb] Improve Master Reliability for Large Clusters. #6305

[docdb] Improve Master Reliability for Large Clusters. #6305

Comments

nspiegelberg commented Nov 6, 2020 • edited by yugabyte-ci Loading

nspiegelberg commented Nov 6, 2020 •

edited by yugabyte-ci

Loading