-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tablet throttler: multi-metric support #15988
Merged
shlomi-noach
merged 221 commits into
vitessio:main
from
planetscale:throttler-multi-metrics
Jul 11, 2024
Merged
Tablet throttler: multi-metric support #15988
shlomi-noach
merged 221 commits into
vitessio:main
from
planetscale:throttler-multi-metrics
Jul 11, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Shlomi Noach <[email protected]>
…erResponse_Metric Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
… set Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
…appname+metric names combination Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
…e() invokes it Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
…icnames, and based on checked app it then chooses which metrics to check Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
…of values in appCheckedMetrics Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
…verall non-OK client check Signed-off-by: Shlomi Noach <[email protected]>
…(identify that checked tablet is of lower version) Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
go/vt/vttablet/tabletserver/throttle/mysql/mysql_throttle_metric.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
harshit-gangal
approved these changes
Jul 10, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
harshit-gangal
added
the
release notes (needs details)
This PR needs to be listed in the release notes in a dedicated section (deprecation notice, etc...)
label
Jul 10, 2024
I'll be creating a docs PR. Once it's up, I will merge this PR. |
Signed-off-by: Shlomi Noach <[email protected]>
Documentation PR: vitessio/website#1786 |
Signed-off-by: Shlomi Noach <[email protected]>
shlomi-noach
removed
the
release notes (needs details)
This PR needs to be listed in the release notes in a dedicated section (deprecation notice, etc...)
label
Jul 11, 2024
Updated release notes. |
To be followed up by a few extra PRs, see: |
This was referenced Jul 11, 2024
Merged
This was referenced Jul 25, 2024
venkatraju
pushed a commit
to slackhq/vitess
that referenced
this pull request
Aug 29, 2024
Signed-off-by: Shlomi Noach <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #15624
Preface
This PR adds multi-metrics support to the tablet throttler. It is pretty large as will be explained shortly, and is submitted as
Draft
. I will break it down into hopefully smaller and more manageable PRs, something I could not do in the process of writing this code. To explain a bit, adding multi-metrics support adds a new dimension of complexity to the already multi-dimensioned throttler. I chose do remove legacy or unused dimensions from the existing codebase so as to simplify the result. While this PR is not a redesign (the main elements remain the same), it shaves and refactors a lot of code.This PR comment will detail all changes made, in a form of documentation. I expect this PR comment to later serve as the basis to documentation updates. We'll take a step back, define some concepts, explain how these concepts connect or interact, what user interface we have, and then also discuss internals.
Objective
Requested by multiple production scenarios, we wish to be able to, for example, kick throttling based on both lag as well as load average for some workflows, while allowing others to throttle based on replication lag only, and whilst completely rejecting (or alternatively completely exempting) all other requests.
We seek a fine grained approach that is still maintainable and comprehensible
Timeline
While v20 is aroudn the corner let's not rush it. Not expecting this PR to land in v20 (glad if it will, but only if it happens to).
Breakdown PRs
This branch (
throttler-multi-metrics
) will remain unmerged, and serve as the basis to followup incremental PRs. But I do think the functionality should be merged as a whole. I'm suggesting I'll create a newthrottler-multi-metrics-incremental
branch, and merge the incremental PRs into that branch. Finally, we will mergethrottler-multi-metrics
intomain
, at which time this branch,throttler-multi-metrics
will become implicitly merged.Backwards compatibility
The changes are backwards compatible with single-metric throttlers, in both direction: multi-metric
Primary
to single-metric replica, and single-metricPrimary
to multi-metric replica.Tests
Added entities and functionalities are all tested, the vast majority via unit testing. The
throttler_test.gp
unit test now operated a full blown throttler, mocking a topology server, and mocking replica results or can act as a replica. The case forendtoend
is now mostly in testing the full flow throughvtctldclient
as well as controlling an actual lagging replica.Dcoumentation
Concepts
Tablets
The throttler runs as part of the tablet server. The throttler can be disabled or enabled, based on the tablet throttler configuration as part of the
Keyspace
orSrvKeyspace
in the topo service. All tablets sharing the same keyspace read the same throttler configuration. Thus, all tablet throttlers are all enabled or all disabled, irrespective of shards and tablet types.Tablets in the same shard collaborate. The
Primary
tablet polls the replica tablets, and replica tablets report and sometimes push throttler notifications to thePrimary
.However, we limit the collaboration to specific tablet types, based on
--throttle_tablet_types
VTTablet flag. By default, thePrimary
only collaborates withreplica
tablet types, which means tablets such asbackup
do not affect any throttling behavior.Metrics
The objective of the throttler is to push back work based on database load. Previously, this was done based on a single metric, which could be either the replication lag, or the result of a custom query. Now, the throttler collects multiple metrics. The current supported metrics are:
lag
), measured in seconds.loadavg
), per core, on the tablet server/container.Threads_running
value (threads_running
).custom
) as defined by the user.This list is expected to expand in the future.
All metrics are
float64
values, and are expected to be non-negative. Metrics are identified by names (lag
,loadavg
, etc.)Thresholds
A metric value can be good or bad. Each metric is assigned a threshold. Below that threshold, the metric is good. As of the threshold (equal or higher), the metric is deemed bad. The higher the metric, the worse it is.
Each metric has a "factory default" threshold, e.g.:
5
(5 seconds) forlag
.1.0
(per core) forloadavg
.100
forthreads_running
.Thresholds are positive values. A threshold of
0
is considered undefined.The user can set their own thresholds, overriding the factory defaults. The user defined thresholds are persisted as part of the throttler configuration under the
Keyspace
entry in the topo service.Scopes
We can observe metrics in two scopes:
self
, orshard
.Each tablet's throttler collects metrics from its own tablet and from the MySQL server operated by the tablet. Each tablet then refers to those metrics in the
self
scope.The
Primary
tablet further collects metrics from shard tablets (limited bythrottle_tablet_types
flag as mentioned above). It then uses the maximum (read: worst) value collected, including its own, as theshard
metric value.We can therefore refer scoped metrics. On any tablet, we can query for
self
orshard
metrics:self/loadavg
: the load average on a specific tablet.self/lag
: the lag on a specific tablet. While this makes most sense to query on a replica, it's also an indicative value on thePrimary
. The throttler measures lag using heartbeat injection. In the case of extremely high workload, this value can be indicative of transaction commit latencies.shard/lag
: when querying thePrimary
, this return the highest replication lag across the shard. A replica does not have the collective metrics across the shard, and the value effectively equalsself/lag
.Each metric has a default scope:
lag
defaults toshard
scope.self
scope.Querying a
Primary
tablet for thelag
metric is therefore equal to querying forshard/lag
, and querying forthreads_running
equals to querying forself/threads_running
.For backwards compatibility, it is also possible to query for the
self
or for theshard
metrics, in which case the result is based on either thelag
metric (ifcustom-query
is undefined) or thecustom
metric (ifcustom-query
is defined).Apps
A client that connects to the throttler and asks for throttling advice identifies itself as an "app" (legacy term from a previous incarnation). Example apps are VReplication or the Table Lifecycle. Apps identify by name. Examples:
vreplication
: any VReplication workflow.tablegc
: table lifecycle.online-ddl
: any Online DDL operation, whether Vitess orgh-ost
.vplayer
: a submodule of VReplication.schema-tracker
: the internal schema tracker.Some app names are special:
vitess
: used by the throttlers themselves, when thePrimary
checks the shard replicas, or when a throttler checks itself.always-throttled-app
: useful for testing/troubleshooting, an app whose checks the throttler will always reject.test
: used in testing.all
: a catch-all app, used by app rules and app metrics (see below). If defined, it applies to any app that doesn't have any explicit rules/metrics.Clients can identify by multiple app names, separated with colon. For example, the name
vcopier:d666bbfc_169e_11ef_b0b3_0a43f95f28a3:vreplication:online-ddl
stands for:vreplication
strategy,d666bbfc_169e_11ef_b0b3_0a43f95f28a3
workflow ID,vcopier
.The throttler treats such an app as the combined check of multiple apps, to each it will apply app metric and app rules, as discussed below.
Checks
A check is a request made to the throttler, asking for go/no-go advice. The check identifies by an app name (defaults
vitess
). The throttler looks at the metrics assigned to the app (see below). If all of them are below their respective thresholds, the throttler accepts the request (returns an OK response). If any of those exceed their respective threshold, the throttler rejects the request (returns a non-OK response).Checks are made internally by the various vitess components, and the responses are likewise analyzed internally. The user is also able to invoke a check, for automation or troubleshooting purposes. For example:
The response includes:
200
for "OK")How concepts are combined and used
Metric thresholds
Each metric is assigned a threshold. Vitess supplies factory defaults for these thresholds, but the user may override them manually, like so:
In this example, the
loadavg
metric value is henceforth deemed good if below2.5
. The threshold is stored as part of the keyspace entry in the topo service:The threshold applies to any check for that specific metric (see App Metrics, below) on any tablet in this keyspace. The value of the metric is also reflected in the throttler status:
$ vtctldclient GetThrottlerStatus zone1-0000000101 | jq .metric_thresholds
Use a
0
threshold value to restore the threshold back to factory defaults.App Metrics
By default, when an app checks the throttler, the result is based on replication lag. If the custom query is set, then the result is based on the custom query result. It is possible to assign specific metrics to specific apps, like so:
From that moment on, Online DDL operations will throttle on both high
lag
as well as on highthreads_running
. If either these values exceeds its respective threshold, Online DDL gets throttled. However, it's important to note the scope of the metrics, which is left to the defaults here. To elaborate, it is possible to further indicate metric scopes, for example:In this example, Online DDL will throttle when:
lag
value in all shard tablets exceeds the lag threshold (lag
s default scope isshard
), orthreads_running
on thePrimary
exceeds its threshold (threads_running
's default scope isself
), orloadavg
value in all shard tablets exceeds its threshold (loadavg
's default scope isself
, but the assignment explicitly requiredshard
scope).It's possible to set metrics for the
all
app. Continuing our example setup, we now:Checks made to the throttler by
online-ddl
or any multi-named app such asvcopier:d666bbfc_169e_11ef_b0b3_0a43f95f28a3:vreplication:online-ddl
, throttle based onlag,threads_running,shard/loadavg
, because that's an explicit assignment:Checks made by other apps, e.g.
vreplication
, will now throttle based onlag,custom
.vreplication
does not have any assigned metrics, and therefore falls underall
's assignments.The assignments are visible in the throttler status:
$ vtctldclient GetThrottlerStatus zone1-0000000101 | jq .app_checked_metrics
To deassign metrics from an app, supply an empty value like so:
The special app
vitess
is internally assigned all known metrics, at all times.App rules
This PR has no changes to app rules logic
The user may impose additional throttling rules on any given app. A rule is limited by a duration (after which the rule expires and removed), and can:
0.0
for no extra rejection ..1.0
for complete rejection) before even checking actual metrics/thresholds. This effectively "slows down" the app.Examples:
Throttle
vreplication
app, so that 80% of its checks are denied before even consulting actual metrics. The rule auto-expires after30
minutes. Note: the rest of 20% checks still need to comply with actual metrics/thresholds.Exempt
vreplication
from being throttled, even if metrics exceed their thresholds (e.g. even iflag
is high). Expire after1
hour:The
all
app is accepted, and applies to all apps that do not otherwise have a specific rule. Examples:In the above we push back 25% of checks for all apps, irrespective of actual metrics, except for
online-ddl
checks, where we reject 80% of its checks.In the above we push back 80% of checks from all apps, except for
vreplication
which is completely exempted.It is possible to expire (remove the rule) early via:
$ vtctldclient UpdateThrottlerConfig --unthrottle-app "vreplication" commerce
Commands and flags
These are the
vtctldclient
commands to control or query the tablet throttler:UpdateThrottlerConfig
Enable or disable the throttler:
Set a metric threshold:
Clear a metric threshold (return to "factory defaults"):
Pre multi-metrics compliant, set the "default" threshold (applies to replication lag if custom query is undefined):
$ vtctldclient UpdateThrottlerConfig --threshold "10.0" commerce
Set a custom query:
$ vtctldclient UpdateThrottlerConfig --custom-query "show global status like 'Threads_connected'" commerce
This applies to the
custom
metric. In pre multi-metric throttlers, checks are validated against the custom value. In multi-metric throttlers,lag
andcustom
are distinct metrics, and the user may assign different apps to different metrics as described in this doc.Clear the custom query:
$ vtctldclient UpdateThrottlerConfig --custom-query "" commerce
Assign metrics to an app, use default metric scopes:
Assign metrics to an app, use explicit metric scopes:
Remove assignment from app:
Assign metrics to all apps, except for those which have an explicit assignment:
Throttle an app:
Unthrottle an app (expire early):
$ vtctldclient UpdateThrottlerConfig --unthrottle-app "online-ddl" commerce
Exempt an app:
Unexempting an app is done by removing the rule:
$ vtctldclient UpdateThrottlerConfig --unthrottle-app "vreplication" commerce
Throttle all apps except those that already have a specific rule:
CheckThrottler
Issue a check on a tablet's throttler, optionally identify as some app. Use in automation or in troubleshooting.
Get the response is for a
vreplication
app check:$ vtctldclient CheckThrottler --app-name "vreplication" zone1-0000000101
Normal checks do not renew heartbeat lease. Override to renew heartbeat lease:
$ vtctldclient CheckThrottler --app-name "vreplication" --requests-heartbeats zone1-0000000101
Check as
vitess
app:Force a specific scope, overriding metric defaults or assigned metric scopes:
GetThrottlerStatus
See the state of the throttler, including what the throttles perceives to be current metric values, metrics health, metric thresholds, assigned metrics, app rules, and more.
End of docs.
Related Issue(s)
#15624
Checklist
Deployment Notes