Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: allow reads on circuit-broken replicas #74799

Closed
tbg opened this issue Jan 13, 2022 · 9 comments
Closed

kvserver: allow reads on circuit-broken replicas #74799

tbg opened this issue Jan 13, 2022 · 9 comments
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@tbg
Copy link
Member

tbg commented Jan 13, 2022

Touches #33007.

Is your feature request related to a problem? Please describe.
As of #71806, when a replica's circuit breaker trips, reads bounce on the breaker. This might be unnecessary in some cases - if there is a valid lease, there is a chance that the read would go through.

Reads may not go through though, as inflight replication proposals hold latches that may block the read.

Describe the solution you'd like

Serve reads "when it is possible" (i.e. when they don't get stuck).

Two solutions present themselves:

use two circuit breakers

To the replication circuit breaker (that's the one we have today), add a latching circuit breaker which trips when latch acquisition takes a long time (or alternatively, if a latch is held for a long time, which catches more problems but may also be more prone to false positives).

Both are checked on the write path, i.e. writes fail fast if either is tripped.
For reads, check only the latter, i.e. if the replication circuit breaker is tripped, reads continue to be served (or at least attempted) until the latching breaker trips as well.

check transitive deps for writes in latching when breaker open

We stick with just one breaker. If a read is waiting for latches and the breaker trips (or is tripped to begin with), visit the transitive set of dependencies and fail fast if it contains a mutation.

This is probably awkward to implement.

Describe alternatives you've considered

Additional context
Transitively computing latch dependencies also came up in the context of #65099 (comment).

Jira issue: CRDB-12271

@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jan 13, 2022
tbg added a commit to tbg/cockroach that referenced this issue Jan 13, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within the timeout specified by the
`kv.replica_circuit_breaker.slow_replication_threshold` cluster setting,
the breaker is tripped, cancelling all inflight and future requests
until the breaker heals. Perhaps surprisingly, the existing "slowness"
detection (the informational "have been waiting ... for proposal"
warning in `executeWriteBatch`) was moved deeper into replication
(`refreshProposalsLocked`), where it now trips the breaker.  This has
the advantage of providing a unified path for lease requests (which
don't go through `executeWriteBatch`) and pipelined writes (which return
before waiting on the inflight replication process). To make this work,
we need to pick a little fight with how leases are (not) refreshed
(cockroachdb#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

Perhaps surprisingly, when the breaker is tripped, *all* traffic to
the Replica gets the fail-fast behavior, not just mutations. This is
because even though a read may look good to serve based on the lease,
we would also need to check for latches, and in particular we would
need to fail-fast if there is a transitive dependency on any write
(since that write is presumably stuck). This is not trivial and so
we don't do it in this first iteration (see cockroachdb#74799).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, the cluster setting defaults to zero (disabling the
entire mechanism) until some necessary follow-up items have been
addressed (see cockroachdb#74705). For example, the breaker-sensitive context
cancelation is a toy implementation that comes with too much of a
performance overhead (one extra goroutine per request); cockroachdb#74707 will
address that.  Other things not done here that we certainly want in the
22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505).

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None

touchie
tbg added a commit to tbg/cockroach that referenced this issue Jan 13, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within the timeout specified by the
`kv.replica_circuit_breaker.slow_replication_threshold` cluster setting,
the breaker is tripped, cancelling all inflight and future requests
until the breaker heals. Perhaps surprisingly, the existing "slowness"
detection (the informational "have been waiting ... for proposal"
warning in `executeWriteBatch`) was moved deeper into replication
(`refreshProposalsLocked`), where it now trips the breaker.  This has
the advantage of providing a unified path for lease requests (which
don't go through `executeWriteBatch`) and pipelined writes (which return
before waiting on the inflight replication process). To make this work,
we need to pick a little fight with how leases are (not) refreshed
(cockroachdb#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

Perhaps surprisingly, when the breaker is tripped, *all* traffic to
the Replica gets the fail-fast behavior, not just mutations. This is
because even though a read may look good to serve based on the lease,
we would also need to check for latches, and in particular we would
need to fail-fast if there is a transitive dependency on any write
(since that write is presumably stuck). This is not trivial and so
we don't do it in this first iteration (see cockroachdb#74799).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, the cluster setting defaults to zero (disabling the
entire mechanism) until some necessary follow-up items have been
addressed (see cockroachdb#74705). For example, the breaker-sensitive context
cancelation is a toy implementation that comes with too much of a
performance overhead (one extra goroutine per request); cockroachdb#74707 will
address that.  Other things not done here that we certainly want in the
22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505).

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None

touchie
@tbg tbg self-assigned this Jan 14, 2022
@tbg
Copy link
Member Author

tbg commented Jan 26, 2022

@nvanbenschoten pointed out (internal) a simplification that I had overlooked: reads don't block on reads, so if a read enters the replica while the breaker is tripped, we can let them in anyway but "somehow" give them a non-blocking latch policy where an error is returned if a blocking latch is encountered. Reads don't block on reads during latching, so the only way for a read to block on a latch is if it encounters a write, and it is reasonable to assume that the write latch is permanent until the replica becomes available again. This isn't quite enough, since reads may also trigger lease acquisitions, which would stall as well, so the read must additionally bail out if the lease is not valid (in effect behaving like a follower read). In fact, follower reads should never encounter latches, so perhaps we can unify things by making follower reads latch-rejectable and simply turning leaseholder-but-tripped reads into follower reads.

Note that there may be some reads that bounce right as the breaker trips, since reads that enter under a healthy breaker will not have this non-block policy. Instead, their context will be cancelled as the breaker trips, sending them back up the stack. If we wanted to, we could let them re-enter but it may not be a big enough issue to add complexity for.

@nvanbenschoten
Copy link
Member

This isn't quite enough, since reads may also trigger lease acquisitions, which would stall as well, so the read must additionally bail out if the lease is not valid (in effect behaving like a follower read).

If the lease is not valid then the read will be promoted to a write when trying to acquire the lease. This write will then hit the breaker and return an error. Is there anything more needed?

Note that there may be some reads that bounce right as the breaker trips, since reads that enter under a healthy breaker will not have this non-block policy. Instead, their context will be cancelled as the breaker trips, sending them back up the stack. If we wanted to, we could let them re-enter but it may not be a big enough issue to add complexity for.

This is a good point that indicates that eagerly setting a policy is probably not the right solution.

An alternative that seems to have legs is to replace the non-blocking policy with a way to "poison" latches. Take Rust's support for Mutex poisoning as an example: https://doc.rust-lang.org/std/sync/struct.Mutex.html#poisoning. After a write has been proposed, it can't drop its latches until the proposal completes. However, if the proposed write was to somehow learn about the unavailability (does it already?), it could then poison its latches while continuing to hold them. This would instruct any request that is currently or will later wait on these latches to fail fast instead of waiting.

A few benefits of this change are that:

  1. it does not require eagerly deciding on a wait policy, which bloats the latch manager API and has the race discussed above.
  2. the poisoning is targetted to exactly the set of spans that are stuck on replication. A request would only be rejected if it directly conflicted with a poisoned latch, not if it indirectly conflicted through a series of dependent latches.
  3. the behavior would apply to writes blocking on latches in the same way as it would apply to reads

I think this kind of approach could allow us to push down the replication circuit breaker into the replication layer (e.g. down into Replica.propose), instead of dealing with context cancellation all the way up in Replica.sendWithoutRangeID.

@nvanbenschoten
Copy link
Member

If the lease is not valid then the read will be promoted to a write when trying to acquire the lease. This write will then hit the breaker and return an error. Is there anything more needed?

This same line of reasoning should also apply to transaction lock contention. I was concerned that with this latch-poisoning proposal, read requests could get blocked on locks indefinitely due to unavailability. However, that does not seem to be the case.

If the unavailability is on the lock holder's txn record's range, then the read-only request will hit the breaker error when it goes to push the record. Either the range will be unavailable with a lease and so the push will fail directly, or it will be unavailable without a lease and so the push will fail when it tries to acquire the lease.

If the unavailability is on the lock's range then the read-only request will hit the breaker error when it goes to resolve the lock.

Either way, the read-only request should not be blocked indefinitely.

@tbg
Copy link
Member Author

tbg commented Feb 15, 2022

If the lease is not valid then the read will be promoted to a write when trying to acquire the lease. This write will then hit the breaker and return an error. Is there anything more needed?

No, that's correct, good point.

However, if the proposed write was to somehow learn about the unavailability (does it already?)

Not in the way that you probably need. The client's context will be canceled. The proposal will remain in the proposals map. However, we could notify pending proposals in refreshProposalsLocked, which is currently tasked with tripping the breaker (though in the future there may be other mechanisms that do so, too).

func (r *Replica) refreshProposalsLocked(
ctx context.Context, refreshAtDelta int, reason refreshRaftReason,
) {

I think this kind of approach could allow us to push down the replication circuit breaker into the replication layer (e.g. down into Replica.propose), instead of dealing with context cancellation all the way up in Replica.sendWithoutRangeID.

That's a good observation and everything you say above the quote is appealing to me as well. There is a question about what kinds of "unavailability" the circuit breaker is supposed to track, and ideally the answer is "any possible one", which would indicate that moving it completely to the replication layer isn't the right approach (as we'd want "slow requests" of any type to consider tripping the breaker, and its probe to exercise bits that require not only replication). But we could still move large parts of it down, definitely the replication/proposal-centric pieces.

@nvanbenschoten
Copy link
Member

There is a question about what kinds of "unavailability" the circuit breaker is supposed to track, and ideally the answer is "any possible one", which would indicate that moving it completely to the replication layer isn't the right approach (as we'd want "slow requests" of any type to consider tripping the breaker, and its probe to exercise bits that require not only replication).

This is interesting. I did not realize the goal was for the circuit-breaker logic to be generalized beyond replication failures. What other kinds of unavailability do you have in mind? Perhaps stalled disks?

An alternative that seems to have legs is to replace the non-blocking policy with a way to "poison" latches.

For what it's worth, this wouldn't be very difficult. Here's a basic prototype that includes most of the necessary plumbing: nvanbenschoten@c1f2f47. I'd be happy to review any changes or work together on this issue if that helps, as I want to make sure we don't reduce follower read availability in this release.

@tbg
Copy link
Member Author

tbg commented Feb 17, 2022

@lunevalex made the good point that follower reads may not "always" be affected in case of a quorum loss. In steady state, only the leaseholder proposes commands. If quorum is lost, since the relevant leases are epoch-based, the lease remains valid (unless liveness heartbeats are affected, which for now I'll assume they're not). So followers should find no reason to trip their circuit breaker; they would only do so when trying to replicate a command, which they would only do to acquire a lease. The leaseholder, on the other hand, would trip (assuming it saw write activity), so we would be in the weird situation in which the leaseholder can't serve follower reads but the followers can. That discrepancy is unfortunate, and your suggestion would fix it. We should have a crack at it.

This is interesting. I did not realize the goal was for the circuit-breaker logic to be generalized beyond replication failures. What other kinds of unavailability do you have in mind? Perhaps stalled disks?

Slow evaluation, deadlocks, that sort of thing, but we haven't really had a conversation. Disk health should be adequately represented by replication; if disks are "slow" but fast enough to replicate commands in time, I don't think the circuit breaker should necessarily get involved unless we're really confident that traffic should be redirected elsewhere (which it may not; if the lease is local then we're the only place this traffic can be served from).

@tbg
Copy link
Member Author

tbg commented Feb 18, 2022

Working on #76673 I had another thought, which is that the poisoning idea gets mildly more complex if we allow certain requests to bypass the breaker. Basically, if you decide to wait on a poisoned latch anyway, you need to poison your own latch, too. (I think that's it).

@tbg
Copy link
Member Author

tbg commented Feb 21, 2022

Nathan and I met late last week to discuss this. Allowing requests to opt out of fail-fast behavior required some tweaks to Nathan's prototype above. I am starting to polish the result in #76858.

The poisoning machinery works like a charm. The remaining challenge amounts to a fairly extensive rework of the Replica circuit breaker plumbing, where we get rid of context cancellation as a primary mechanism. There are some technical challenges to solve (and fixing up the tests might reveal more issues), but nothing insurmountable.

tbg added a commit to tbg/cockroach that referenced this issue Feb 23, 2022
Informs cockroachdb#74799. I hope it will ultimately close it, but not quite there
yet. I left extensive `TODO(during review)` comments.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Mar 1, 2022
Informs cockroachdb#74799. I hope it will ultimately close it, but not quite there
yet. I left extensive `TODO(during review)` comments.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Mar 2, 2022
This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica
circuit breakers (cockroachdb#33007). The earlier implementation relied on context
cancellation and coarsely failed all requests addressing the Replica
when the breaker was tripped.

This had two downsides: First, there was a (small) performance overhead
for implementing the cancellation that was paid even in the common case
of a healthy Replica. Second, and more importantly, the coarseness meant
that we'd potentially fail many requests that would otherwise succeed,
and in particular follower reads.

@nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with
the concept of "poisoning" and that this could result in fine-grained
circuit breaker behavior where only requests that are truly affected by
unavailability (at the replication layer) would be rejected.

This commit implements that strategy:

A request's latches are poisoned if its completion is predicated on the
replication layer being healthy.  In other words, when the breaker
trips, all inflight proposals have their latches poisoned and new
proposals are failed fast. However, and this is the big difference,
reads can potentially still be accepted in either of two scenarios:

- a valid follower read remains valid regardless of the circuit breaker
  status, and also regardless of inflight proposals (follower reads
  don't observe latches).
- a read that can be served under the current lease and which does
  not conflict with any of the stuck proposals in the replication
  layer (= poisoned latches) can also be served.

In short, reads only fail fast if they encounter a poisoned latch or
need to request a lease. (If they opted out of fail-fast behavior,
they behave as today).

Latch poisoning is added as a first-class concept in the `concurrency`
package, and a structured error `PoisonError` is introduced. This error
in particular contains the span and timestamp of the poisoned latch that
prompted the fail-fast.

Lease proposals now always use `poison.Policy_Wait`, leaving the
fail-fast behavior to the caller. This simplifies things since multiple
callers with their own `poison.Policy` can multiplex onto a single
inflight lease proposal.

Addresses cockroachdb#74799.

Release note: None
Release justification: 22.1 project work
tbg added a commit to tbg/cockroach that referenced this issue Mar 3, 2022
This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica
circuit breakers (cockroachdb#33007). The earlier implementation relied on context
cancellation and coarsely failed all requests addressing the Replica
when the breaker was tripped.

This had two downsides: First, there was a (small) performance overhead
for implementing the cancellation that was paid even in the common case
of a healthy Replica. Second, and more importantly, the coarseness meant
that we'd potentially fail many requests that would otherwise succeed,
and in particular follower reads.

@nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with
the concept of "poisoning" and that this could result in fine-grained
circuit breaker behavior where only requests that are truly affected by
unavailability (at the replication layer) would be rejected.

This commit implements that strategy:

A request's latches are poisoned if its completion is predicated on the
replication layer being healthy.  In other words, when the breaker
trips, all inflight proposals have their latches poisoned and new
proposals are failed fast. However, and this is the big difference,
reads can potentially still be accepted in either of two scenarios:

- a valid follower read remains valid regardless of the circuit breaker
  status, and also regardless of inflight proposals (follower reads
  don't observe latches).
- a read that can be served under the current lease and which does
  not conflict with any of the stuck proposals in the replication
  layer (= poisoned latches) can also be served.

In short, reads only fail fast if they encounter a poisoned latch or
need to request a lease. (If they opted out of fail-fast behavior,
they behave as today).

Latch poisoning is added as a first-class concept in the `concurrency`
package, and a structured error `PoisonError` is introduced. This error
in particular contains the span and timestamp of the poisoned latch that
prompted the fail-fast.

Lease proposals now always use `poison.Policy_Wait`, leaving the
fail-fast behavior to the caller. This simplifies things since multiple
callers with their own `poison.Policy` can multiplex onto a single
inflight lease proposal.

Addresses cockroachdb#74799.

Release note: None
Release justification: 22.1 project work
tbg added a commit to tbg/cockroach that referenced this issue Mar 3, 2022
This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica
circuit breakers (cockroachdb#33007). The earlier implementation relied on context
cancellation and coarsely failed all requests addressing the Replica
when the breaker was tripped.

This had two downsides: First, there was a (small) performance overhead
for implementing the cancellation that was paid even in the common case
of a healthy Replica. Second, and more importantly, the coarseness meant
that we'd potentially fail many requests that would otherwise succeed,
and in particular follower reads.

@nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with
the concept of "poisoning" and that this could result in fine-grained
circuit breaker behavior where only requests that are truly affected by
unavailability (at the replication layer) would be rejected.

This commit implements that strategy:

A request's latches are poisoned if its completion is predicated on the
replication layer being healthy.  In other words, when the breaker
trips, all inflight proposals have their latches poisoned and new
proposals are failed fast. However, and this is the big difference,
reads can potentially still be accepted in either of two scenarios:

- a valid follower read remains valid regardless of the circuit breaker
  status, and also regardless of inflight proposals (follower reads
  don't observe latches).
- a read that can be served under the current lease and which does
  not conflict with any of the stuck proposals in the replication
  layer (= poisoned latches) can also be served.

In short, reads only fail fast if they encounter a poisoned latch or
need to request a lease. (If they opted out of fail-fast behavior,
they behave as today).

Latch poisoning is added as a first-class concept in the `concurrency`
package, and a structured error `PoisonError` is introduced. This error
in particular contains the span and timestamp of the poisoned latch that
prompted the fail-fast.

Lease proposals now always use `poison.Policy_Wait`, leaving the
fail-fast behavior to the caller. This simplifies things since multiple
callers with their own `poison.Policy` can multiplex onto a single
inflight lease proposal.

Addresses cockroachdb#74799.

Release note: None
Release justification: 22.1 project work
tbg added a commit to tbg/cockroach that referenced this issue Mar 3, 2022
This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica
circuit breakers (cockroachdb#33007). The earlier implementation relied on context
cancellation and coarsely failed all requests addressing the Replica
when the breaker was tripped.

This had two downsides: First, there was a (small) performance overhead
for implementing the cancellation that was paid even in the common case
of a healthy Replica. Second, and more importantly, the coarseness meant
that we'd potentially fail many requests that would otherwise succeed,
and in particular follower reads.

@nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with
the concept of "poisoning" and that this could result in fine-grained
circuit breaker behavior where only requests that are truly affected by
unavailability (at the replication layer) would be rejected.

This commit implements that strategy:

A request's latches are poisoned if its completion is predicated on the
replication layer being healthy.  In other words, when the breaker
trips, all inflight proposals have their latches poisoned and new
proposals are failed fast. However, and this is the big difference,
reads can potentially still be accepted in either of two scenarios:

- a valid follower read remains valid regardless of the circuit breaker
  status, and also regardless of inflight proposals (follower reads
  don't observe latches).
- a read that can be served under the current lease and which does
  not conflict with any of the stuck proposals in the replication
  layer (= poisoned latches) can also be served.

In short, reads only fail fast if they encounter a poisoned latch or
need to request a lease. (If they opted out of fail-fast behavior,
they behave as today).

Latch poisoning is added as a first-class concept in the `concurrency`
package, and a structured error `PoisonError` is introduced. This error
in particular contains the span and timestamp of the poisoned latch that
prompted the fail-fast.

Lease proposals now always use `poison.Policy_Wait`, leaving the
fail-fast behavior to the caller. This simplifies things since multiple
callers with their own `poison.Policy` can multiplex onto a single
inflight lease proposal.

Addresses cockroachdb#74799.

Release note: None
Release justification: 22.1 project work
tbg added a commit to tbg/cockroach that referenced this issue Mar 3, 2022
This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica
circuit breakers (cockroachdb#33007). The earlier implementation relied on context
cancellation and coarsely failed all requests addressing the Replica
when the breaker was tripped.

This had two downsides: First, there was a (small) performance overhead
for implementing the cancellation that was paid even in the common case
of a healthy Replica. Second, and more importantly, the coarseness meant
that we'd potentially fail many requests that would otherwise succeed,
and in particular follower reads.

@nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with
the concept of "poisoning" and that this could result in fine-grained
circuit breaker behavior where only requests that are truly affected by
unavailability (at the replication layer) would be rejected.

This commit implements that strategy:

A request's latches are poisoned if its completion is predicated on the
replication layer being healthy.  In other words, when the breaker
trips, all inflight proposals have their latches poisoned and new
proposals are failed fast. However, and this is the big difference,
reads can potentially still be accepted in either of two scenarios:

- a valid follower read remains valid regardless of the circuit breaker
  status, and also regardless of inflight proposals (follower reads
  don't observe latches).
- a read that can be served under the current lease and which does
  not conflict with any of the stuck proposals in the replication
  layer (= poisoned latches) can also be served.

In short, reads only fail fast if they encounter a poisoned latch or
need to request a lease. (If they opted out of fail-fast behavior,
they behave as today).

Latch poisoning is added as a first-class concept in the `concurrency`
package, and a structured error `PoisonError` is introduced. This error
in particular contains the span and timestamp of the poisoned latch that
prompted the fail-fast.

Lease proposals now always use `poison.Policy_Wait`, leaving the
fail-fast behavior to the caller. This simplifies things since multiple
callers with their own `poison.Policy` can multiplex onto a single
inflight lease proposal.

Addresses cockroachdb#74799.

Release note: None
Release justification: 22.1 project work
tbg added a commit to tbg/cockroach that referenced this issue Mar 3, 2022
This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica
circuit breakers (cockroachdb#33007). The earlier implementation relied on context
cancellation and coarsely failed all requests addressing the Replica
when the breaker was tripped.

This had two downsides: First, there was a (small) performance overhead
for implementing the cancellation that was paid even in the common case
of a healthy Replica. Second, and more importantly, the coarseness meant
that we'd potentially fail many requests that would otherwise succeed,
and in particular follower reads.

@nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with
the concept of "poisoning" and that this could result in fine-grained
circuit breaker behavior where only requests that are truly affected by
unavailability (at the replication layer) would be rejected.

This commit implements that strategy:

A request's latches are poisoned if its completion is predicated on the
replication layer being healthy.  In other words, when the breaker
trips, all inflight proposals have their latches poisoned and new
proposals are failed fast. However, and this is the big difference,
reads can potentially still be accepted in either of two scenarios:

- a valid follower read remains valid regardless of the circuit breaker
  status, and also regardless of inflight proposals (follower reads
  don't observe latches).
- a read that can be served under the current lease and which does
  not conflict with any of the stuck proposals in the replication
  layer (= poisoned latches) can also be served.

In short, reads only fail fast if they encounter a poisoned latch or
need to request a lease. (If they opted out of fail-fast behavior,
they behave as today).

Latch poisoning is added as a first-class concept in the `concurrency`
package, and a structured error `PoisonError` is introduced. This error
in particular contains the span and timestamp of the poisoned latch that
prompted the fail-fast.

Lease proposals now always use `poison.Policy_Wait`, leaving the
fail-fast behavior to the caller. This simplifies things since multiple
callers with their own `poison.Policy` can multiplex onto a single
inflight lease proposal.

Addresses cockroachdb#74799.

Release note: None
Release justification: 22.1 project work
craig bot pushed a commit that referenced this issue Mar 3, 2022
76858: kvserver: allow circuit-breaker to serve reads r=erikgrinaker a=tbg

This commit revamps an earlier implementation (#71806) of per-Replica
circuit breakers (#33007). The earlier implementation relied on context
cancellation and coarsely failed all requests addressing the Replica
when the breaker was tripped.

This had two downsides: First, there was a (small) performance overhead
for implementing the cancellation that was paid even in the common case
of a healthy Replica. Second, and more importantly, the coarseness meant
that we'd potentially fail many requests that would otherwise succeed,
and in particular follower reads.

`@nvanbenschoten` suggested in #74799 that latching could be extended with
the concept of "poisoning" and that this could result in fine-grained
circuit breaker behavior where only requests that are truly affected by
unavailability (at the replication layer) would be rejected.

This commit implements that strategy:

A request's latches are poisoned if its completion is predicated on the
replication layer being healthy.  In other words, when the breaker
trips, all inflight proposals have their latches poisoned and new
proposals are failed fast. However, and this is the big difference,
reads can potentially still be accepted in either of two scenarios:

- a valid follower read remains valid regardless of the circuit breaker
  status, and also regardless of inflight proposals (follower reads
  don't observe latches).
- a read that can be served under the current lease and which does
  not conflict with any of the stuck proposals in the replication
  layer (= poisoned latches) can also be served.

In short, reads only fail fast if they encounter a poisoned latch or
need to request a lease. (If they opted out of fail-fast behavior,
they behave as today).

Latch poisoning is added as a first-class concept in the `concurrency`
package, and a structured error `PoisonError` is introduced. This error
in particular contains the span and timestamp of the poisoned latch that
prompted the fail-fast.

Lease proposals now always use `poison.Policy_Wait`, leaving the
fail-fast behavior to the caller. This simplifies things since multiple
callers with their own `poison.Policy` can multiplex onto a single
inflight lease proposal.

Addresses #74799.

Release note: None
Release justification: 22.1 project work

77221: changefeedccl: Fix flaky test. r=miretskiy a=miretskiy

Fix flaky TestChangefeedHandlesDrainingNodes test.
The source of the flake was that cluster setting updates propagate
asynchronously to the other nodes in the cluster.  Thus, it was possible
for the test to flake because some of the nodes were observing the
old value for the setting.

The flake is fixed by introducing testing utility function that
sets the setting and ensures the setting propagates to all nodes in
the test cluster.

Fixes #76806

Release Notes: none

77317: util/log: remove Safe in favor of redact.Safe r=yuzefovich a=yuzefovich

My desire to make this change is to break the dependency of `treewindow`
on `util/log` (which is a part of the effort to clean up the
dependencies of `execgen`).

Addresses: #77234.

Release note: None

Release justification: low risk change to clean up the dependencies
a bit.

Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@tbg
Copy link
Member Author

tbg commented Mar 3, 2022

Implemented in #76858.

@tbg tbg closed this as completed Mar 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

2 participants