-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heartbeat writer can always generate on-demand leased heartbeats, even if not at all configured #16014
Heartbeat writer can always generate on-demand leased heartbeats, even if not at all configured #16014
Conversation
Signed-off-by: Shlomi Noach <[email protected]>
…ill deliver on-demand heartbeats when requested even if configured as disabled Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Momentarily converting to EDIT: done. |
Signed-off-by: Shlomi Noach <[email protected]>
…ectively. We use golang atomic LoadPointer/StorePointer to avoid using mutexes in RequestHeartbeats Signed-off-by: Shlomi Noach <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #16014 +/- ##
==========================================
+ Coverage 68.69% 68.72% +0.02%
==========================================
Files 1547 1547
Lines 198260 198270 +10
==========================================
+ Hits 136202 136261 +59
+ Misses 62058 62009 -49 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this LGTM. Thank you for working on this! ❤️ This will make things much nicer for users.
I had a few questions/concerns that are beyond simple nits.
I'd personally feel better about this not going in just before RC as it could take some time for subtle bugs and unexpected/unintended behaviors to surface. I'd also feel better if at least one important endtoend test which relies on on-demand heartbeats that was previously setting an explicit duration value no longer set a default so that we're testing this new behavior.
@@ -52,8 +52,7 @@ var ( | |||
|
|||
// ReplTracker tracks replication lag. | |||
type ReplTracker struct { | |||
mode string | |||
forceHeartbeat bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so in effect the old forceHeartbeat
behavior is always enabled now rather than ONLY being enabled if the on-demand heartbeat interval is > 0.
Along with the new relevant defaults being:
defaultOnDemandDuration = 10 * time.Second
defaultHeartbeatInterval = 1 * time.Second
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct!
|
||
var ( | ||
// Can be overridden by unit tests | ||
defaultOnDemandDuration = 10 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't we make this the default for the --heartbeat_on_demand_duration
flag? That way the default is visible to users.
It also simplifies the logic here IMO, as then the two modes are HeartbeatConfigTypeOnDemand
and HeartbeatConfigTypeAlways
. It feels awkward to me that HeartbeatConfigTypeNone
effectively becomes HeartbeatConfigTypeOnDemand
and we use a default duration value that is not exposed/visible to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't we make this the default for the --heartbeat_on_demand_duration flag?
Because if we made that the default, then that would change behavior to existing users, who perhaps do not want to enable on-demand heartbeat, and want full heartbeat.
That was also why we did not proceed with #15099
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, then could we make/say that this flag is ignored when --heartbeat_enable
is used? I think that's already the case isn't it? So I'm still not sure what we gain by doing things this way. Do you NOT have on demand heartbeats enabled by default as things are in the PR currently? I thought so. And if so, why would we not enforce the default duration via the flag instead of internally?
Maybe the difference is that the PR as-is limits on demand heartbeats to ONLY being used by the tablet throttler specifically whereas changing the flag would affect other components as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, then could we make/say that this flag is ignored when --heartbeat_enable is used?
Because again that would break existing setups. Think of the current configuration schema as an ancient mistake. Today we would have made the configuration different, but right now we need to stick to the existing rules.
I think that's already the case isn't it?
It is not.
Today, in main
:
- If non of the flags is defined, there's no heartbeat whatsoever (which is what we're addressing here in this PR, enabling a lease of heartbeats by throttler's demand)
- If only
--heartbeat_enable
is set, then heartbeats are constant and always enabled. - If
--heartbeat_on_demand_duration
is defined (whether--heartbeat_enable
is set or not) then heartbeat is generally speaking enabled, but only producing heartbeats on-deman (and on open).
This means if someone only has --heartbeat_enable
in their setup, and I set the flag default to 10s
, this would reduce the generated heartbeats to on-demand duration only, even though the user expects them to be constant and continuous.
However, If we take your suggestion:
could we make/say that this flag is ignored when --heartbeat_enable is used?
Then people already using both flags --heartbeat_enable
+ --heartbeat_on_demand_duration
and who are expecting only a limited lease of heartbeats, wil suddenly see an influx of heartbeats and a bloat of binary logs.
I hope this clarifies a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then people already using both flags --heartbeat_enable + --heartbeat_on_demand_duration and who are expecting only a limited lease of heartbeats, wil suddenly see an influx of heartbeats and a bloat of binary logs.
I thought that they both served the same purpose with regards to the throttler, the only thing --heartbeat_enable
got you was sub-second replication lag metrics via the replication reporter vs. instead using the polling based reporter via --enable_replication_reporter
. Perhaps I'm wrong on that part?
I thought that using both flags would also potentially lead to failures? #14978
I was thinking that 1) (virtually) nobody is using both 2) if they are, they shouldn't be as that can case failures 3) we could take this opportunity to clean this up a bit. Maybe I misunderstood though?
In any event, you've explained that it's more complicated than I had initially thought and we don't need to solve these issues in this PR. 🙂 Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could take this opportunity to clean this up a bit.
I'm in favor. I don't like the existing configuration setup. If we can change things in one shot and without disrupting existing setups, I'm happy to. In which case, as you say, the changes in this PR can be greatly simplified. Let's perhaps make that discussion tomorrow.
t.Run("exhaust lease, no heartbeats", func(t *testing.T) { | ||
<-time.After(tw.onDemandDuration) | ||
currentWrites := writes.Get() | ||
<-time.After(3 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should use the onDemandDuration
value again here and everywhere else that we're using the 3 second literal in these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually explicitly using a different value. It would be confusing to reuse onDemandDuration
. What we want to say is: after onDemandDuration
is expired, no new heartbeats are written, regardless of how many seconds we wait.
If we reused onDemandDuration
, that could imply to the reader that there is come mysterious connection, where after the duration expires, you need it to expire again so as to eliminate heartbeats.
Does that make sense?
} | ||
tw.Open() | ||
defer tw.Close() | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personal preference, but I don't see the value that these code blocks add and use of them is pretty non-standard within Vitess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll explain why I use this: it's allowing me to redeclare a variable (rateLimiter :=
in this case) that I do not need to fear is "leaking" elsewhere. Also, I dislike how in unit testing you want to duplicate some tests, but golang
requires that the first assignment be :=
and then you can only use =
. But then as you shuffle things, you need to again modify :=
to =
or =
to :=
in places that are unrelated to your change. This causes git diffs that are irrelevant to your change (not unlike trailing comma in a table definition on an unrelated index when you add a new index).
So I feel like these code blocks help me isolate:
- actual code/variables
- git diffs
It's a personal preference and I recognize this is not standard in Vitess. FWIW I've already spread that around quite a bit.
assert.NotNil(t, rateLimiter) | ||
} | ||
t.Run("open, initial heartbeats", func(t *testing.T) { | ||
ctx, cancel := context.WithTimeout(context.Background(), tw.onDemandDuration-time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to use the -time.Second
part? That would put us at 2 seconds and it seems like it may be possible that we then only tick with the ticker once before the context is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The reasoning is that I don't want to get past the expiration of the heartbeat lease. This function is checking:
- That heartbeats are generated when on-demand is engaged (upon
Open()
in this case) - And that they keep being generated for the duration of the on-demand lease. The heartbeat interval in this test is
250ms
, and we test every 1 second, just shy of the expiration.
If we go past the expiration, then we can wrongly condlude that heartbeats are not being produced when they should (they shouldn't, past the expiration).
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
…te we still generate heartbeats when both this flag as well as '--heartbeat_enable' are disabled Signed-off-by: Shlomi Noach <[email protected]>
Done in 236f074, affecting |
Signed-off-by: Shlomi Noach <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work on this! ❤️
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
If we're not going to merge it for RC1, then I have further simplification for the concurrency logic, effectively reducing |
This will be the followup to clean up the concurrency logic: |
Signed-off-by: Shlomi Noach <[email protected]>
All right. Since we're still here, I did merge |
Nice work on this RFC and PR, I like the idea and code 👍. I'm giving this a stamp but I suggest there's a 3rd reviewer as I'm still new at this 😅 Related and certainly out-of-scope, this reminded me of a hack Slack made to enable the "constant" heartbeat on some keyspaces: slackhq#207. TL;DR: it's difficult to enable the heartbeat on a live keyspace that never had it - if you add |
Signed-off-by: Shlomi Noach <[email protected]>
Call for additional reviews 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had only a couple of nits. This is much cleaner than the previous forceHeartbeat
implementation 👍
@@ -54,6 +54,23 @@ func TestRateLimiterShort(t *testing.T) { | |||
assert.Less(t, val, 10) | |||
} | |||
|
|||
func TestRateLimiterAllowOne(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always nice to see a new unit test for a new function :)
@@ -33,12 +33,17 @@ import ( | |||
"vitess.io/vitess/go/vt/vttablet/tabletserver/tabletenv" | |||
) | |||
|
|||
const ( | |||
testKeyspaceShard = "test:0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two nits:
- This constant seems to be used exactly once. What is the value of defining it outside of the function where it is used?
- The standard delimiter is
/
so that this would betest/0
. It would be even better to usetest/-
which we have been moving to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constant used to be used elsewhere -- but that has changed. Good catch. Now inside to the function and is test/0
.
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Doc update is in vitessio/website#1762 |
Description
Fixes #15303 : the throttler wishes to
RequestHeartbeats()
and be awarded with a leased heartbeat injection.With this change, the heartbeats writer (as part of repl tracker) is always enabled. It will
open()
andclose()
irrespective of heartbeats configuration (specifically irrespective--heartbeat_enable
). We now identify three different configuration types for the heartbeat writer:--heartbeat_on_demand_duration
) was specified.--heartbeat_enable
is set, andheartbeat_on_demand_duration
is not set)The way heartbeats are produced now depend on said configuration:
HeartbeatConfigTypeAlways
: while open, the writer produces heartbeats at a regular interval.RequetHeartbeats() is meaningless in this mode.
HeartbeatConfigTypeOnDemand
: when opened, the writer produces heartbeats for the configured lease.The heartbeats then expire. Lease can be renewed (after expired) or extended (while running) via
RequetHeartbeats().
HeartbeatConfigTypeNone
: the writer does not initiate any heartbeats. However, RequetHeartbeats()can be called to request a heartbeat lease. The lease duration is predetermined as
defaultOnDemandDuration
.To clarify, if the heartbeat writer is not configured, then:
open()
s andclose()
sopen()ed
RateLimiter
) which is then unused.And so an unconfigured heartbeat writer now does consume some resources, but we can argue those are negligible. The flip side is that a
RequestHeartbeats()
is then always answered.Docs PR: vitessio/website#1762
Related Issue(s)
Checklist
Deployment Notes