-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tablet throttler: be explicit about client app name, exempt some apps from checks and heartbeat renewals #13195
Conversation
…r when they've been 'check'ed Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
…hey at all want to involve the throttler. Some lightweight clients, such as the schema tracker or the binlog watcher, or messager, do not need the throttler, and since some of these clients are _always on_, we also do not _want_ them to continuously approach the throttler. One side effect of always engaging with the throttler is the infinite renewal of on-demand heartbeats Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
… in vstreamer.Engine. The throttler exempts specific apps from checks and will not renew heartbeats leases for those apps Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
If a new flag is being introduced:
If a workflow is added or modified:
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me. The existing tests should cover things.
I only had minor questions and comments. Let me know what you think. I can quickly review again and approve as needed.
@@ -400,11 +398,11 @@ func TestSchemaChange(t *testing.T) { | |||
// to be strictly higher than started_timestamp | |||
assert.GreaterOrEqual(t, lastThrottledTimestamp, startedTimestamp) | |||
component := row.AsString("component_throttled", "") | |||
assert.Contains(t, []string{string(vreplication.VCopierComponentName), string(vreplication.VPlayerComponentName)}, component) | |||
assert.Contains(t, []string{throttlerapp.VCopierName, throttlerapp.VPlayerName}, component) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it might be worth making a new type that's an alias for string so that what the string represents is even more explicit. This make it clear to future code writers that we're not using arbitrary strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I initially did that but then that affected so many lines of code, casting into and out of string, that I decided to hold off. I can still do that for this PR if you think it's worhtwhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave it up to you. I think it's probably worth it, but we can always do it later too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 37a7c2f. I'm not sure how I feel about it, what do you think?
@@ -1768,7 +1768,7 @@ export MYSQL_PWD | |||
my ($self, %args) = @_; | |||
|
|||
return sub { | |||
if (head("http://localhost:{{VTTABLET_PORT}}/throttler/check?app={{THROTTLER_ONLINE_DDL_APP}}:pt-osc:{{MIGRATION_UUID}}&p=low")) { | |||
if (head("http://localhost:{{VTTABLET_PORT}}/throttler/check?app={{THROTTLER_ONLINE_DDL_APP}}:{{THROTTLER_PT_OSC_APP}}:{{MIGRATION_UUID}}&p=low")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that we didn't support pt-osc? Might be a good opportunity to clean up the code if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's labeled as "experimental". I'd rather not do everything in this PR. We can deprecate pt-osc later.
// recentCheckTickerValue is an ever increasing number, incrementing once per second. | ||
recentCheckTickerValue int64 | ||
// recentCheckValue is set to match or exceed recentCheckTickerValue whenever a "check" was made (other than by the throttler itself). | ||
// when recentCheckValue < recentCheckTickerValue that means there hasn't been a recent check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO 1 second ago is recent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, and that's what the code considers as "recent". 1s
. It's just that we have a 1 second granularity, and so the actual range can go as high as 1.999
.
if checkResult.RecentlyChecked { | ||
// We have just probed a tablet, and it reported back that someone just recently "check"ed it. | ||
// We therefore renew the heartbeats lease. | ||
go throttler.heartbeatWriter.RequestHeartbeats() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut tells me there's the potential for a race here so that we don't request heartbeats for 1+ seconds which could throttle things and impact production traffic unnecessarily. Let's say that the throttler is checked once a second, but it's always just slightly behind the second tick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but heartbeats are leased for some 5s
-10s
, and so there is no race. Leasing heartbeats to 1s
is unadvisable. Now that you mention it, I should probably enforce a reasonable lower limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, makes sense. Thanks! Agreed on enforcing reasonable low and high limits on the on-demand heartbeat lease times.
// This check was made by someone other than the throttler itself, i.e. this came from online-ddl or vreplication or other. | ||
// We mark the fact that someone just made a check. If this is a REPLICA or RDONLY tables, this will be reported back | ||
// to the PRIMARY so that it knows it must renew the heartbeat lease. | ||
atomic.StoreInt64(&throttler.recentCheckValue, 1+atomic.LoadInt64(&throttler.recentCheckTickerValue)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so we add 1 to it. I think this at least largely addresses my earlier noted concern around timing.
Threshold float64 `json:"Threshold"` | ||
Error error `json:"-"` | ||
Message string `json:"Message"` | ||
RecentlyChecked bool `json:"RecentlyChecked"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have no synchronization around reading/writing this value, do we? If it's needed then we could make this an atomic.Bool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed. We create a new CheckResult for each check.
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! ❤️
…io#13195 Signed-off-by: Shlomi Noach <[email protected]>
* Table throttler: --throttler-config-via-topo now defaults to 'true' Signed-off-by: Shlomi Noach <[email protected]> * add deprecation message Signed-off-by: Shlomi Noach <[email protected]> * endtoend tests: remove '--enable-lag-throttler' and use 'UpdateThrottlerConfig' everywhere Signed-off-by: Shlomi Noach <[email protected]> * always use vtctldclient Signed-off-by: Shlomi Noach <[email protected]> * use cluster.VtctldClientProcess Signed-off-by: Shlomi Noach <[email protected]> * disable --throttler-config-via-topo in old throttler tests Signed-off-by: Shlomi Noach <[email protected]> * Remove --throttler-config-via-topo where used, since it now defaults 'true' Signed-off-by: Shlomi Noach <[email protected]> * fix vreplication cluster setup, waiting for throttler config to apply Signed-off-by: Shlomi Noach <[email protected]> * changelog Signed-off-by: Shlomi Noach <[email protected]> * extend throttler threshold Signed-off-by: Shlomi Noach <[email protected]> * a bit more verbose Signed-off-by: Shlomi Noach <[email protected]> * fixed CLI test Signed-off-by: Shlomi Noach <[email protected]> * remove old '--enable-lag-throttler' flag, introduce '--heartbeat_on_demand_duration' Signed-off-by: Shlomi Noach <[email protected]> * more log info in throttler.Open() Signed-off-by: Shlomi Noach <[email protected]> * more logging Signed-off-by: Shlomi Noach <[email protected]> * Revert to --heartbeat_enable Signed-off-by: Shlomi Noach <[email protected]> * Protect throttler config change application with initMutex And in e2e test update the throttler config on the keyspace when it's created. Only wait for the new tablets in a shard to have the throttler enabled when adding a Shard. Signed-off-by: Matt Lord <[email protected]> * More CI testing Signed-off-by: Matt Lord <[email protected]> * CI testing cont Signed-off-by: Matt Lord <[email protected]> * Yes... Signed-off-by: Matt Lord <[email protected]> * Somebody doesn't like force pushes so msg here Signed-off-by: Matt Lord <[email protected]> * Increase on-demand heartbeat duration from 10s to 1m Signed-off-by: Matt Lord <[email protected]> * Use only on-demand heartbeats everywhere Signed-off-by: Matt Lord <[email protected]> * Use same throttler config everywhere Signed-off-by: Matt Lord <[email protected]> * Update all keyspaces and don't fail test on missing JSON keys Signed-off-by: Matt Lord <[email protected]> * Use constant heartbeats in vrepl e2e tests Until #13175 is fixed. Signed-off-by: Matt Lord <[email protected]> * Increase workflow command timeout Signed-off-by: Matt Lord <[email protected]> * Don't wait for throttler on non-serving primaries Signed-off-by: Matt Lord <[email protected]> * #13175 is fixed, therefore re-instating on-deman heartbeats Signed-off-by: Shlomi Noach <[email protected]> * Added ToC Signed-off-by: Shlomi Noach <[email protected]> * Tweak comment and kick CI Signed-off-by: Matt Lord <[email protected]> * Treat isOpen as the ready/running signal. Also align all initMutex usage. Signed-off-by: Matt Lord <[email protected]> * Re-adjust comment Signed-off-by: Matt Lord <[email protected]> * Adjust CheckIsReady() to match OnlineDDL's expectation/usage This was only using IsReady() before, now it's using IsOpen() and IsReady(). Signed-off-by: Matt Lord <[email protected]> * Get rid of log messages from SrvKeyspaceWatcher when no node/key Signed-off-by: Matt Lord <[email protected]> * More corrections/tweaks Signed-off-by: Matt Lord <[email protected]> * Use more convenient/clear new IsRunning function Signed-off-by: Matt Lord <[email protected]> * Revert "Use more convenient/clear new IsRunning function" This reverts commit 9aef276 as this change was not correct. Signed-off-by: Matt Lord <[email protected]> * Further fix correct use of IsOpen(), IsRunning(), IsEnabled() Signed-off-by: Shlomi Noach <[email protected]> * throttler.throttledApps cannot be nil Signed-off-by: Shlomi Noach <[email protected]> * Remove --enable_lag_throttler flag Signed-off-by: Shlomi Noach <[email protected]> * Deprecate --throttler_config_via_topo Signed-off-by: Shlomi Noach <[email protected]> * remove throttler mitigation code, as the problem was solved in #13195 Signed-off-by: Shlomi Noach <[email protected]> * deperecate throttler config flags Signed-off-by: Shlomi Noach <[email protected]> * Removed tabletmanager_throttler and tabletmanager_throttler_custom_config tests Signed-off-by: Shlomi Noach <[email protected]> * changelog Signed-off-by: Shlomi Noach <[email protected]> * remove EnableThrottler() call Signed-off-by: Shlomi Noach <[email protected]> * restore default value Signed-off-by: Shlomi Noach <[email protected]> * update threshold Signed-off-by: Shlomi Noach <[email protected]> * update flags desc Signed-off-by: Shlomi Noach <[email protected]> * using atomic.Bool Signed-off-by: Shlomi Noach <[email protected]> * Update changelog/18.0/18.0.0/summary.md Co-authored-by: Matt Lord <[email protected]> Signed-off-by: Shlomi Noach <[email protected]> * use MarkDeprecated Signed-off-by: Shlomi Noach <[email protected]> * do not expect flags in vttablet --help Signed-off-by: Shlomi Noach <[email protected]> * remove --throttler-config-via-topo from examples scripts Signed-off-by: Shlomi Noach <[email protected]> --------- Signed-off-by: Shlomi Noach <[email protected]> Signed-off-by: Matt Lord <[email protected]> Co-authored-by: Matt Lord <[email protected]>
Description
An extension to #13177. This PR includes everything in #13177 and then adds more logic. I intentionally split the two apart so we can discuss the two changes separately.
This PR is an alternative approach to #13187. It has more changes, but I prefer the approach in this PR over #13187
#13177 says: "if the replica tablet's throttler is checked, then as result the primary tablet's throttler should request more heartbeats".
However, there are some clients that:
The particular clients are these three members of
tabletserver
:For now, we consider all other clients as "need to use the throttler".
With this PR, we formalize the "app name": the name by which a client identifies itself to the throttler.
Generally speaking, this is an arbitrary string, and we've been using these names semi-formally thus far. Some names are
"online-ddl"
,"vreplication"
,"gh-ost"
etc. And command likeALTER VITESS_MIGRATION THROTTLE ALL
tell the throttler to explicitly throttler calls made by specific apps ("online-ddl"
in this example).What's formalized in this PR is:
throttler/app.go
. The names are constants, and now used throughout the code includingendtoend
tests."rowstreamer"
, and the external connector uses the name"external-connector"
.tabletserver
's Engine's throttler. This throttler is used toVStream
rows, and is shared by multiple clients. These particular three clients are of special interest:What's so special about these three? When enabled, they're always running. Always checking. So these three, as others, clearly identify themselves by name when using the `tabletserver's Engine's throttler.
So what we get is:
PRIMARY
or onREPLICA
, ensure to renew the on-demand heartbeat leaseRelated Issue(s)
Checklist
Deployment Notes