Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docdb] Improve Master Reliability for Large Clusters. #6305

Open
nspiegelberg opened this issue Nov 6, 2020 · 0 comments
Open

[docdb] Improve Master Reliability for Large Clusters. #6305

nspiegelberg opened this issue Nov 6, 2020 · 0 comments
Assignees
Labels
area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue

Comments

@nspiegelberg
Copy link
Contributor

nspiegelberg commented Nov 6, 2020

Jira Link: DB-2023
We have been having issues with handling master & tserver failover with high tablet counts. This is a high level task to track progress on addressing that issue. Current Anticipated Work:

  1. Handle Partial Processing of TServer Tablet Reports from Master when it is heavily loaded and will exceed RPC timeout.
  2. Add a Master-side, Lock-free Cache for GetTableLocations / GetTabletLocations.
  3. Coordinate Large Tablet Processing with LoadBalancing. Currently, we commonly manually disable LB during rolling restarts and manually trigger on our larger clusters.
  4. Figure out why Master UI is slow on large Cluster setups for main UI views.
@nspiegelberg nspiegelberg added the area/docdb YugabyteDB core features label Nov 6, 2020
nspiegelberg added a commit that referenced this issue Nov 11, 2020
Summary:
In preparation for making changes to the Heartbeat path, adding compile-time thread safety
annotations to TabletManager. This required adding capability annotations to RWMutex as well.

Test Plan: Jenkins

Reviewers: timur, zyu, rahuldesirazu, rsami

Reviewed By: zyu, rahuldesirazu, rsami

Subscribers: rsami, ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D9865
nspiegelberg added a commit that referenced this issue Dec 12, 2020
Summary:
Currently, we send all tablets that changed on the TServer to the Master within a single heartbeat.
This can get extremely large and processing it on the server can exceed the Heartbeat timeout on
large clusters, causing server restarts and further instability.  We want to limit heartbeat timeouts
to more severe problems, like resource issues and software deadlocks. To
be more flexible:

1: The TS negotiates a limit on the max number of tablets included in a single heartbeat.  With a
fixed size, we can now bound the amount of processing in a single RPC.

2: Honor deadline on Master.  After every batch of tablet heartbeat processing, the Master checks to
see if it is close to the deadline and exits early if so.  The heartbeat response only includes the
processed tablets so the TS knows what information to send on the next heartbeat.

Test Plan:
MultiHeartbeat/CreateMultiHBTableStressTest.CreateAndDeleteBigTable/1
MultiHeartbeat/CreateMultiHBTableStressTest.RestartServersAfterCreation/1
CreateSmallHBTableStressTest.TestRestartMasterDuringFullHeartbeat
CreateTableStressTest.TestHeartbeatDeadline
TsTabletManagerTest.TestTabletReportLimit
TsTabletManagerTest.TestTabletReports
ClientTest.Capability

Reviewers: timur, amitanand, rahuldesirazu, bogdan

Reviewed By: bogdan

Subscribers: kannan, sergei, ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D9974
nspiegelberg added a commit that referenced this issue Feb 20, 2021
Summary:
Currently, we send all tablets that changed on the TServer to the Master within a single heartbeat.
This can get extremely large and processing it on the server can exceed the Heartbeat timeout on
large clusters, causing server restarts and further instability.  We want to limit heartbeat timeouts
to more severe problems, like resource issues and software deadlocks. To
be more flexible:

1: The TS negotiates a limit on the max number of tablets included in a single heartbeat.  With a
fixed size, we can now bound the amount of processing in a single RPC.

2: Honor deadline on Master.  After every batch of tablet heartbeat processing, the Master checks to
see if it is close to the deadline and exits early if so.  The heartbeat response only includes the
processed tablets so the TS knows what information to send on the next heartbeat.

Original commit: D9974 / 8bac374

Test Plan:
Jenkins: rebase: 2.4

MultiHeartbeat/CreateMultiHBTableStressTest.CreateAndDeleteBigTable/1
MultiHeartbeat/CreateMultiHBTableStressTest.RestartServersAfterCreation/1
CreateSmallHBTableStressTest.TestRestartMasterDuringFullHeartbeat
CreateTableStressTest.TestHeartbeatDeadline
TsTabletManagerTest.TestTabletReportLimit
TsTabletManagerTest.TestTabletReports
ClientTest.Capability

Reviewers: timur, amitanand, rahuldesirazu, rao, bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D10649
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Jun 9, 2022
@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature and removed kind/bug This issue is a bug labels Jul 30, 2022
nspiegelberg added a commit that referenced this issue Aug 19, 2022
Summary:
Small fix for an existing unit test.  Noticed that we were occassionally getting a "Bad
status: Timed out" error on the unit test.  It runs with low timeout numbers, which can cause issues
with TSAN.  Added the kTimeMultiplier standard fix for this problem.

Test Plan: TestHeartbeatDeadline

Reviewers: timur, amitanand, rahuldesirazu, bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D13365
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

5 participants