Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heartbeat writer can always generate on-demand leased heartbeats, even if not at all configured #16014

Merged
merged 17 commits into from
Jul 3, 2024

Conversation

shlomi-noach
Copy link
Contributor

@shlomi-noach shlomi-noach commented May 27, 2024

Description

Fixes #15303 : the throttler wishes to RequestHeartbeats() and be awarded with a leased heartbeat injection.

With this change, the heartbeats writer (as part of repl tracker) is always enabled. It will open() and close() irrespective of heartbeats configuration (specifically irrespective --heartbeat_enable). We now identify three different configuration types for the heartbeat writer:

  • No configuration: the heartbeat writer is generally disabled and does not produce heartbeats (but see below).
  • On-demand: on-demand heartbeat interval (--heartbeat_on_demand_duration) was specified.
  • Always: the heartbeat writer is always on and produces heartbeats at a regular interval (--heartbeat_enable is set, and heartbeat_on_demand_duration is not set)

The way heartbeats are produced now depend on said configuration:

  • HeartbeatConfigTypeAlways: while open, the writer produces heartbeats at a regular interval.
    RequetHeartbeats() is meaningless in this mode.
  • HeartbeatConfigTypeOnDemand: when opened, the writer produces heartbeats for the configured lease.
    The heartbeats then expire. Lease can be renewed (after expired) or extended (while running) via
    RequetHeartbeats().
  • HeartbeatConfigTypeNone: the writer does not initiate any heartbeats. However, RequetHeartbeats()
    can be called to request a heartbeat lease. The lease duration is predetermined as defaultOnDemandDuration.

To clarify, if the heartbeat writer is not configured, then:

  • It still open()s and close()s
  • It still creates pool connections when open()ed
  • It does activate a ticker (via RateLimiter) which is then unused.

And so an unconfigured heartbeat writer now does consume some resources, but we can argue those are negligible. The flip side is that a RequestHeartbeats() is then always answered.

Docs PR: vitessio/website#1762

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

Signed-off-by: Shlomi Noach <[email protected]>
…ill deliver on-demand heartbeats when requested even if configured as disabled

Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Copy link
Contributor

vitess-bot bot commented May 27, 2024

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels May 27, 2024
@github-actions github-actions bot added this to the v20.0.0 milestone May 27, 2024
@shlomi-noach shlomi-noach added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: VReplication Component: Throttler Component: VTTablet and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 27, 2024
@shlomi-noach shlomi-noach marked this pull request as draft May 27, 2024 08:03
@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented May 27, 2024

Momentarily converting to Draft to work on proper goroutine cleanup.

EDIT: done.

…ectively. We use golang atomic LoadPointer/StorePointer to avoid using mutexes in RequestHeartbeats

Signed-off-by: Shlomi Noach <[email protected]>
@shlomi-noach shlomi-noach marked this pull request as ready for review May 27, 2024 09:45
Copy link

codecov bot commented May 27, 2024

Codecov Report

Attention: Patch coverage is 93.97590% with 5 lines in your changes missing coverage. Please review.

Project coverage is 68.72%. Comparing base (780abdd) to head (81b9d53).
Report is 2 commits behind head on main.

Files Patch % Lines
...t/vttablet/tabletserver/repltracker/repltracker.go 60.00% 4 Missing ⚠️
go/vt/vttablet/tabletserver/repltracker/writer.go 98.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16014      +/-   ##
==========================================
+ Coverage   68.69%   68.72%   +0.02%     
==========================================
  Files        1547     1547              
  Lines      198260   198270      +10     
==========================================
+ Hits       136202   136261      +59     
+ Misses      62058    62009      -49     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this LGTM. Thank you for working on this! ❤️ This will make things much nicer for users.

I had a few questions/concerns that are beyond simple nits.

I'd personally feel better about this not going in just before RC as it could take some time for subtle bugs and unexpected/unintended behaviors to surface. I'd also feel better if at least one important endtoend test which relies on on-demand heartbeats that was previously setting an explicit duration value no longer set a default so that we're testing this new behavior.

@@ -52,8 +52,7 @@ var (

// ReplTracker tracks replication lag.
type ReplTracker struct {
mode string
forceHeartbeat bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so in effect the old forceHeartbeat behavior is always enabled now rather than ONLY being enabled if the on-demand heartbeat interval is > 0.

Along with the new relevant defaults being:

defaultOnDemandDuration  = 10 * time.Second
defaultHeartbeatInterval = 1 * time.Second

Copy link
Contributor Author

@shlomi-noach shlomi-noach Jun 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct!

go/vt/vttablet/tabletserver/repltracker/writer.go Outdated Show resolved Hide resolved

var (
// Can be overridden by unit tests
defaultOnDemandDuration = 10 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't we make this the default for the --heartbeat_on_demand_duration flag? That way the default is visible to users.

It also simplifies the logic here IMO, as then the two modes are HeartbeatConfigTypeOnDemand and HeartbeatConfigTypeAlways. It feels awkward to me that HeartbeatConfigTypeNone effectively becomes HeartbeatConfigTypeOnDemand and we use a default duration value that is not exposed/visible to users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't we make this the default for the --heartbeat_on_demand_duration flag?

Because if we made that the default, then that would change behavior to existing users, who perhaps do not want to enable on-demand heartbeat, and want full heartbeat.

That was also why we did not proceed with #15099

Copy link
Contributor

@mattlord mattlord Jun 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then could we make/say that this flag is ignored when --heartbeat_enable is used? I think that's already the case isn't it? So I'm still not sure what we gain by doing things this way. Do you NOT have on demand heartbeats enabled by default as things are in the PR currently? I thought so. And if so, why would we not enforce the default duration via the flag instead of internally?

Maybe the difference is that the PR as-is limits on demand heartbeats to ONLY being used by the tablet throttler specifically whereas changing the flag would affect other components as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then could we make/say that this flag is ignored when --heartbeat_enable is used?

Because again that would break existing setups. Think of the current configuration schema as an ancient mistake. Today we would have made the configuration different, but right now we need to stick to the existing rules.

I think that's already the case isn't it?

It is not.

Today, in main:

  • If non of the flags is defined, there's no heartbeat whatsoever (which is what we're addressing here in this PR, enabling a lease of heartbeats by throttler's demand)
  • If only --heartbeat_enable is set, then heartbeats are constant and always enabled.
  • If --heartbeat_on_demand_duration is defined (whether --heartbeat_enable is set or not) then heartbeat is generally speaking enabled, but only producing heartbeats on-deman (and on open).

This means if someone only has --heartbeat_enable in their setup, and I set the flag default to 10s, this would reduce the generated heartbeats to on-demand duration only, even though the user expects them to be constant and continuous.

However, If we take your suggestion:

could we make/say that this flag is ignored when --heartbeat_enable is used?

Then people already using both flags --heartbeat_enable + --heartbeat_on_demand_duration and who are expecting only a limited lease of heartbeats, wil suddenly see an influx of heartbeats and a bloat of binary logs.

I hope this clarifies a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then people already using both flags --heartbeat_enable + --heartbeat_on_demand_duration and who are expecting only a limited lease of heartbeats, wil suddenly see an influx of heartbeats and a bloat of binary logs.

I thought that they both served the same purpose with regards to the throttler, the only thing --heartbeat_enable got you was sub-second replication lag metrics via the replication reporter vs. instead using the polling based reporter via --enable_replication_reporter. Perhaps I'm wrong on that part?

I thought that using both flags would also potentially lead to failures? #14978

I was thinking that 1) (virtually) nobody is using both 2) if they are, they shouldn't be as that can case failures 3) we could take this opportunity to clean this up a bit. Maybe I misunderstood though?

In any event, you've explained that it's more complicated than I had initially thought and we don't need to solve these issues in this PR. 🙂 Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could take this opportunity to clean this up a bit.

I'm in favor. I don't like the existing configuration setup. If we can change things in one shot and without disrupting existing setups, I'm happy to. In which case, as you say, the changes in this PR can be greatly simplified. Let's perhaps make that discussion tomorrow.

go/vt/vttablet/tabletserver/repltracker/writer.go Outdated Show resolved Hide resolved
t.Run("exhaust lease, no heartbeats", func(t *testing.T) {
<-time.After(tw.onDemandDuration)
currentWrites := writes.Get()
<-time.After(3 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should use the onDemandDuration value again here and everywhere else that we're using the 3 second literal in these tests.

Copy link
Contributor Author

@shlomi-noach shlomi-noach Jun 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually explicitly using a different value. It would be confusing to reuse onDemandDuration. What we want to say is: after onDemandDuration is expired, no new heartbeats are written, regardless of how many seconds we wait.

If we reused onDemandDuration, that could imply to the reader that there is come mysterious connection, where after the duration expires, you need it to expire again so as to eliminate heartbeats.

Does that make sense?

}
tw.Open()
defer tw.Close()
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal preference, but I don't see the value that these code blocks add and use of them is pretty non-standard within Vitess.

Copy link
Contributor Author

@shlomi-noach shlomi-noach Jun 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll explain why I use this: it's allowing me to redeclare a variable (rateLimiter := in this case) that I do not need to fear is "leaking" elsewhere. Also, I dislike how in unit testing you want to duplicate some tests, but golang requires that the first assignment be := and then you can only use =. But then as you shuffle things, you need to again modify := to = or = to := in places that are unrelated to your change. This causes git diffs that are irrelevant to your change (not unlike trailing comma in a table definition on an unrelated index when you add a new index).
So I feel like these code blocks help me isolate:

  • actual code/variables
  • git diffs

It's a personal preference and I recognize this is not standard in Vitess. FWIW I've already spread that around quite a bit.

assert.NotNil(t, rateLimiter)
}
t.Run("open, initial heartbeats", func(t *testing.T) {
ctx, cancel := context.WithTimeout(context.Background(), tw.onDemandDuration-time.Second)
Copy link
Contributor

@mattlord mattlord May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to use the -time.Second part? That would put us at 2 seconds and it seems like it may be possible that we then only tick with the ticker once before the context is done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The reasoning is that I don't want to get past the expiration of the heartbeat lease. This function is checking:

  • That heartbeats are generated when on-demand is engaged (upon Open() in this case)
  • And that they keep being generated for the duration of the on-demand lease. The heartbeat interval in this test is 250ms, and we test every 1 second, just shy of the expiration.

If we go past the expiration, then we can wrongly condlude that heartbeats are not being produced when they should (they shouldn't, past the expiration).

Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
…te we still generate heartbeats when both this flag as well as '--heartbeat_enable' are disabled

Signed-off-by: Shlomi Noach <[email protected]>
@shlomi-noach
Copy link
Contributor Author

I'd also feel better if at least one important endtoend test which relies on on-demand heartbeats that was previously setting an explicit duration value no longer set a default so that we're testing this new behavior.

Done in 236f074, affecting onlineddl_vrepl test.

Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this! ❤️

@shlomi-noach
Copy link
Contributor Author

If we're not going to merge it for RC1, then I have further simplification for the concurrency logic, effectively reducing n concurrent goroutines to 1.

@shlomi-noach
Copy link
Contributor Author

@shlomi-noach
Copy link
Contributor Author

All right. Since we're still here, I did merge throttler-dynamic-heartbeat-2, see in particular a8673c1.

@timvaillancourt
Copy link
Contributor

Nice work on this RFC and PR, I like the idea and code 👍. I'm giving this a stamp but I suggest there's a 3rd reviewer as I'm still new at this 😅

Related and certainly out-of-scope, this reminded me of a hack Slack made to enable the "constant" heartbeat on some keyspaces: slackhq#207. TL;DR: it's difficult to enable the heartbeat on a live keyspace that never had it - if you add --enable-heartbeat on the replicas first lag will grow forever because the primary hasn't enabled the heartbeat yet. I saw mention of deferred refactoring - let me know if I can help with that and resolve this upstream while I'm at it!

Signed-off-by: Shlomi Noach <[email protected]>
@shlomi-noach
Copy link
Contributor Author

Call for additional reviews 🙏

@frouioui frouioui modified the milestones: v20.0.0, v21.0.0 Jun 20, 2024
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had only a couple of nits. This is much cleaner than the previous forceHeartbeat implementation 👍

@@ -54,6 +54,23 @@ func TestRateLimiterShort(t *testing.T) {
assert.Less(t, val, 10)
}

func TestRateLimiterAllowOne(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always nice to see a new unit test for a new function :)

@@ -33,12 +33,17 @@ import (
"vitess.io/vitess/go/vt/vttablet/tabletserver/tabletenv"
)

const (
testKeyspaceShard = "test:0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two nits:

  • This constant seems to be used exactly once. What is the value of defining it outside of the function where it is used?
  • The standard delimiter is / so that this would be test/0. It would be even better to use test/- which we have been moving to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant used to be used elsewhere -- but that has changed. Good catch. Now inside to the function and is test/0.

@shlomi-noach
Copy link
Contributor Author

Doc update is in vitessio/website#1762

@shlomi-noach shlomi-noach merged commit 79c54e5 into vitessio:main Jul 3, 2024
94 checks passed
@shlomi-noach shlomi-noach deleted the throttler-dynamic-heartbeat branch July 3, 2024 05:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change needed in heartbeat setup due to dynamic throttler config
6 participants