-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
circuit: add probing-based circuit breaker #73641
Conversation
77d948a
to
046008f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 7 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)
-- commits, line 5 at r1:
Consider a link to the relevant issue(s) too.
pkg/util/circuit/circuitbreaker.go, line 77 at r1 (raw file):
// Signal is low-overhead and thus suitable for use in hot code paths. func (b *Breaker) Signal() (chan struct{}, func() error) { return b.errAndCh().ch, b.definitelyErr
If the caller has a need to see the tripping error, shouldn't we make sure to propagate the correct one?
pkg/util/circuit/circuitbreaker.go, line 91 at r1 (raw file):
// already tripped), the error that would be returned from subsequent calls to // Err() is returned. func (b *Breaker) Report(err error) error {
nit: maybe give the return value a name here, to clarify that, unusually, an error response indicates success.
Why do we even return the error here? Looks like there's only a single test call that checks it. Do we expect anyone to care about it?
pkg/util/circuit/circuitbreaker.go, line 106 at r1 (raw file):
} opts := b.Opts() err = opts.ShouldTrip(err)
nit: let's allow this to be unspecified too.
pkg/util/circuit/circuitbreaker.go, line 220 at r1 (raw file):
// from Breaker.Err(), i.e. `errors.Is(err, ErrBreakerOpen()) can be // used to check whether an error originated from some Breaker. func ErrBreakerOpen() error {
nit: why is this a function, rather than exporting errBreakerOpen
?
pkg/util/circuit/circuitbreaker.go, line 224 at r1 (raw file):
} type errAndCh struct {
nit: since we now protect this with a mutex, I don't think there's any point in having this type anymore? We can just use separate fields in Breaker
.
pkg/util/circuit/options.go, line 18 at r1 (raw file):
type Options struct { // Name is the name of a Breaker and will be mentioned in the errors // that particular Breaker generates.
nit: "that a particular"
pkg/util/circuit/options.go, line 24 at r3 (raw file):
// `Breaker.Report`. In particular, it can return nil to ignore certain // errors. ShouldTrip func(err error) error
nit: "should" implies a boolean response, and the error semantics here are unusual. Consider clarifying the name, or using two separate functions: one to decide whether to trip the breaker, and one to annotate the error.
83fb87b
to
7c26c5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! I made changes that I would call substantial and I'm pretty happy with the result.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/util/circuit/circuitbreaker.go, line 91 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: maybe give the return value a name here, to clarify that, unusually, an error response indicates success.
Why do we even return the error here? Looks like there's only a single test call that checks it. Do we expect anyone to care about it?
The original motivation for returning the error was so that the caller could log the annotated error. But this is sort of going the wrong way, we don't want to make it look like that caller failed on the breaker, so it is better to log the un-wrapped error. Removed the error return which also means the linter won't yell at us for not handling it.
pkg/util/circuit/circuitbreaker.go, line 220 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: why is this a function, rather than exporting
errBreakerOpen
?
So that you can't do circuit.ErrBreakerOpen = something
. I realize there's plenty of precedent for ignoring that "problem". Switched to var ErrBreakerOpen
.
pkg/util/circuit/circuitbreaker.go, line 224 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: since we now protect this with a mutex, I don't think there's any point in having this type anymore? We can just use separate fields in
Breaker
.
This was initially left-over from the unsafe.Pointer
approach, but in the meantime I have removed this type and found that I needed to re-add it as it seems like the best way to get error stability from Signal() as well as avoiding allocations in that path.
pkg/util/circuit/options.go, line 24 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: "should" implies a boolean response, and the error semantics here are unusual. Consider clarifying the name, or using two separate functions: one to decide whether to trip the breaker, and one to annotate the error.
I removed this option altogether, as I don't anticipate a need for it (and if we ever do, easy enough to add back).
7c26c5e
to
5bc60ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I really like the changes!
Reviewed 30 of 30 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/util/circuit/circuitbreaker.go, line 114 at r4 (raw file):
// We get to write the error since we have exclusive access via b.mu. b.mu.errAndCh.err = storeErr close(b.mu.errAndCh.ch)
I had to read this a couple of times before I got why this closes the initial channel on the first error. Maybe add a comment clarifying how only on the nil -> !nil transition do we close the channel that actually has any waiters. After that, we close it pre-emptively for later callers.
pkg/util/circuit/circuitbreaker.go, line 115 at r4 (raw file):
b.mu.errAndCh.err = storeErr close(b.mu.errAndCh.ch) b.mu.Unlock()
nano-optimization (?): we can unlock before closing.
pkg/util/circuit/circuitbreaker.go, line 213 at r4 (raw file):
select { case <-eac.ch: eac.maybeTriggerProbe()
Why are we triggering the probe here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/util/circuit/circuitbreaker.go, line 115 at r4 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nano-optimization (?): we can unlock before closing.
Actually, we clearly can't without another reference to the channel. Doesn't seem worth it, nevermind.
5bc60ff
to
57745fe
Compare
These are initially for use in cockroachdb#71806 but were originally conceived of for cockroachdb#70485, which we are not currently prioritizing. Importantly, this circuit breaker does not recruit a fraction of requests to do the probing, which is desirable for both PR cockroachdb#71806 and PR cockroachdb#70485; requests recruited as probes tend to incur high latency and errors, and we don't want SQL client traffic to experience those. Touches cockroachdb#33007. Release note: None
57745fe
to
eed8603
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! Will merge on green (tomorrow).
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/util/circuit/circuitbreaker.go, line 213 at r4 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Why are we triggering the probe here?
I wrote more comments about how this works in general. Basically, when a probe has returned and failed to heal the breaker, we're going to launch the next one only reactively, after the next client bounces off the circuit breaker (which means a call to Err()
that returns non-nil). I think this behavior makes sense (if a probe wants to keep trying for 10 minutes, it can do that rather easily), added a test for it as well.
... or today.. bors r=erikgrinaker |
Build succeeded: |
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR uses the circuit breaker package introduced in cockroachdb#73641 to address issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the "slowness" existing detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, all of this is off via the `kv.replica_circuit_breakers.enabled` cluster setting while we determine which of the many follow-up items to target for 22.1 (see cockroachdb#74705). Primarily though, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); cockroachdb#74707 will address that. Other things not done here that we certainly want in the 22.1 release are - UI work (cockroachdb#74713) - Metrics (cockroachdb#74505) The release note is deferred to cockroachdb#74705 (where breakers are turned on). Release note: None
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR uses the circuit breaker package introduced in cockroachdb#73641 to address issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the existing "slowness" detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, all of this is opt-in via the `kv.replica_circuit_breakers.enabled` cluster setting while we determine which of the many follow-up items to target for 22.1 (see cockroachdb#74705). For example, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); cockroachdb#74707 will address that. Other things not done here that we certainly want in the 22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505). The release note is deferred to cockroachdb#74705 (where breakers are turned on). Release note: None
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR uses the circuit breaker package introduced in cockroachdb#73641 to address issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within the timeout specified by the `kv.replica_circuit_breaker.slow_replication_threshold` cluster setting, the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the existing "slowness" detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, the cluster setting defaults to zero (disabling the entire mechanism) until some necessary follow-up items have been addressed (see cockroachdb#74705). For example, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); cockroachdb#74707 will address that. Other things not done here that we certainly want in the 22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505). The release note is deferred to cockroachdb#74705 (where breakers are turned on). Release note: None
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR uses the circuit breaker package introduced in cockroachdb#73641 to address issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within the timeout specified by the `kv.replica_circuit_breaker.slow_replication_threshold` cluster setting, the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the existing "slowness" detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). Perhaps surprisingly, when the breaker is tripped, *all* traffic to the Replica gets the fail-fast behavior, not just mutations. This is because even though a read may look good to serve based on the lease, we would also need to check for latches, and in particular we would need to fail-fast if there is a transitive dependency on any write (since that write is presumably stuck). This is not trivial and so we don't do it in this first iteration (see cockroachdb#74799). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, the cluster setting defaults to zero (disabling the entire mechanism) until some necessary follow-up items have been addressed (see cockroachdb#74705). For example, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); cockroachdb#74707 will address that. Other things not done here that we certainly want in the 22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505). The release note is deferred to cockroachdb#74705 (where breakers are turned on). Release note: None touchie
Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR uses the circuit breaker package introduced in cockroachdb#73641 to address issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within the timeout specified by the `kv.replica_circuit_breaker.slow_replication_threshold` cluster setting, the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the existing "slowness" detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). Perhaps surprisingly, when the breaker is tripped, *all* traffic to the Replica gets the fail-fast behavior, not just mutations. This is because even though a read may look good to serve based on the lease, we would also need to check for latches, and in particular we would need to fail-fast if there is a transitive dependency on any write (since that write is presumably stuck). This is not trivial and so we don't do it in this first iteration (see cockroachdb#74799). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, the cluster setting defaults to zero (disabling the entire mechanism) until some necessary follow-up items have been addressed (see cockroachdb#74705). For example, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); cockroachdb#74707 will address that. Other things not done here that we certainly want in the 22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505). The release note is deferred to cockroachdb#74705 (where breakers are turned on). Release note: None touchie
71806: kvserver: circuit-break requests to unavailable ranges r=erikgrinaker a=tbg Fixes #33007. Closes #61311. This PR uses the circuit breaker package introduced in #73641 to address issue #33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within `base.SlowRequestThreshold` (15s at time of writing), the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the existing "slowness" detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in #72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, all of this is opt-in via the `kv.replica_circuit_breakers.enabled` cluster setting while we determine which of the many follow-up items to target for 22.1 (see #74705). For example, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); #74707 will address that. Other things not done here that we certainly want in the 22.1 release are UI work (#74713) and metrics (#74505). The release note is deferred to #74705 (where breakers are turned on). Release note: None Co-authored-by: Tobias Grieger <[email protected]>
These are initially for use in #71806 but were originally conceived of
for #70485, which we are not currently prioritizing.
Importantly, this circuit breaker does not recruit a fraction of
requests to do the probing, which is desirable for both PR #71806 and
PR #70485; requests recruited as probes tend to incur high latency and
errors, and we don't want SQL client traffic to experience those.
Release note: None