-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docdb] Improve Master Reliability for Large Clusters. #6305
Labels
area/docdb
YugabyteDB core features
kind/enhancement
This is an enhancement of an existing feature
priority/medium
Medium priority issue
Comments
nspiegelberg
assigned nspiegelberg, bmatican and rahuldesirazu and unassigned bmatican and rahuldesirazu
Nov 6, 2020
nspiegelberg
added a commit
that referenced
this issue
Nov 11, 2020
Summary: In preparation for making changes to the Heartbeat path, adding compile-time thread safety annotations to TabletManager. This required adding capability annotations to RWMutex as well. Test Plan: Jenkins Reviewers: timur, zyu, rahuldesirazu, rsami Reviewed By: zyu, rahuldesirazu, rsami Subscribers: rsami, ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D9865
nspiegelberg
added a commit
that referenced
this issue
Dec 12, 2020
Summary: Currently, we send all tablets that changed on the TServer to the Master within a single heartbeat. This can get extremely large and processing it on the server can exceed the Heartbeat timeout on large clusters, causing server restarts and further instability. We want to limit heartbeat timeouts to more severe problems, like resource issues and software deadlocks. To be more flexible: 1: The TS negotiates a limit on the max number of tablets included in a single heartbeat. With a fixed size, we can now bound the amount of processing in a single RPC. 2: Honor deadline on Master. After every batch of tablet heartbeat processing, the Master checks to see if it is close to the deadline and exits early if so. The heartbeat response only includes the processed tablets so the TS knows what information to send on the next heartbeat. Test Plan: MultiHeartbeat/CreateMultiHBTableStressTest.CreateAndDeleteBigTable/1 MultiHeartbeat/CreateMultiHBTableStressTest.RestartServersAfterCreation/1 CreateSmallHBTableStressTest.TestRestartMasterDuringFullHeartbeat CreateTableStressTest.TestHeartbeatDeadline TsTabletManagerTest.TestTabletReportLimit TsTabletManagerTest.TestTabletReports ClientTest.Capability Reviewers: timur, amitanand, rahuldesirazu, bogdan Reviewed By: bogdan Subscribers: kannan, sergei, ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D9974
nspiegelberg
added a commit
that referenced
this issue
Feb 20, 2021
Summary: Currently, we send all tablets that changed on the TServer to the Master within a single heartbeat. This can get extremely large and processing it on the server can exceed the Heartbeat timeout on large clusters, causing server restarts and further instability. We want to limit heartbeat timeouts to more severe problems, like resource issues and software deadlocks. To be more flexible: 1: The TS negotiates a limit on the max number of tablets included in a single heartbeat. With a fixed size, we can now bound the amount of processing in a single RPC. 2: Honor deadline on Master. After every batch of tablet heartbeat processing, the Master checks to see if it is close to the deadline and exits early if so. The heartbeat response only includes the processed tablets so the TS knows what information to send on the next heartbeat. Original commit: D9974 / 8bac374 Test Plan: Jenkins: rebase: 2.4 MultiHeartbeat/CreateMultiHBTableStressTest.CreateAndDeleteBigTable/1 MultiHeartbeat/CreateMultiHBTableStressTest.RestartServersAfterCreation/1 CreateSmallHBTableStressTest.TestRestartMasterDuringFullHeartbeat CreateTableStressTest.TestHeartbeatDeadline TsTabletManagerTest.TestTabletReportLimit TsTabletManagerTest.TestTabletReports ClientTest.Capability Reviewers: timur, amitanand, rahuldesirazu, rao, bogdan Reviewed By: bogdan Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D10649
yugabyte-ci
added
kind/bug
This issue is a bug
priority/medium
Medium priority issue
labels
Jun 9, 2022
yugabyte-ci
added
kind/enhancement
This is an enhancement of an existing feature
and removed
kind/bug
This issue is a bug
labels
Jul 30, 2022
nspiegelberg
added a commit
that referenced
this issue
Aug 19, 2022
Summary: Small fix for an existing unit test. Noticed that we were occassionally getting a "Bad status: Timed out" error on the unit test. It runs with low timeout numbers, which can cause issues with TSAN. Added the kTimeMultiplier standard fix for this problem. Test Plan: TestHeartbeatDeadline Reviewers: timur, amitanand, rahuldesirazu, bogdan Reviewed By: bogdan Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D13365
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/docdb
YugabyteDB core features
kind/enhancement
This is an enhancement of an existing feature
priority/medium
Medium priority issue
Jira Link: DB-2023
We have been having issues with handling master & tserver failover with high tablet counts. This is a high level task to track progress on addressing that issue. Current Anticipated Work:
The text was updated successfully, but these errors were encountered: