kvserver: allow reads on circuit-broken replicas #74799

tbg · 2022-01-13T14:33:27Z

Touches #33007.

Is your feature request related to a problem? Please describe.
As of #71806, when a replica's circuit breaker trips, reads bounce on the breaker. This might be unnecessary in some cases - if there is a valid lease, there is a chance that the read would go through.

Reads may not go through though, as inflight replication proposals hold latches that may block the read.

Describe the solution you'd like

Serve reads "when it is possible" (i.e. when they don't get stuck).

Two solutions present themselves:

use two circuit breakers

To the replication circuit breaker (that's the one we have today), add a latching circuit breaker which trips when latch acquisition takes a long time (or alternatively, if a latch is held for a long time, which catches more problems but may also be more prone to false positives).

Both are checked on the write path, i.e. writes fail fast if either is tripped.
For reads, check only the latter, i.e. if the replication circuit breaker is tripped, reads continue to be served (or at least attempted) until the latching breaker trips as well.

check transitive deps for writes in latching when breaker open

We stick with just one breaker. If a read is waiting for latches and the breaker trips (or is tripped to begin with), visit the transitive set of dependencies and fail fast if it contains a mutation.

This is probably awkward to implement.

Describe alternatives you've considered

Additional context
Transitively computing latch dependencies also came up in the context of #65099 (comment).

Jira issue: CRDB-12271

Fixes cockroachdb#33007. Closes cockroachdb#61311. This PR uses the circuit breaker package introduced in cockroachdb#73641 to address issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly refuse traffic that it believes would simply hang. Concretely, every request to the Replica gets a cancelable context that is sensitive to the circuit breaker tripping (relying on context cancellation makes sure we don't end up with a second channel that needs to be plumbed to everyone and their dog; Go makes it really difficult to join two cancellation chains); if the breaker is tripped when the request arrives, it is refused right away. In either case, the outgoing error is augmented to carry information about the tripped breaker. This isn't 100% trivial, since retreating from in-flight replication typically causes an `AmbiguousResultError`, and while we could avoid it in some cases we can't in all. A similar problem occurs when the canceled request was waiting on a lease, in which case the result is a NotLeaseholderError. For any request that made it in, if it enters replication but does not manage to succeed within the timeout specified by the `kv.replica_circuit_breaker.slow_replication_threshold` cluster setting, the breaker is tripped, cancelling all inflight and future requests until the breaker heals. Perhaps surprisingly, the existing "slowness" detection (the informational "have been waiting ... for proposal" warning in `executeWriteBatch`) was moved deeper into replication (`refreshProposalsLocked`), where it now trips the breaker. This has the advantage of providing a unified path for lease requests (which don't go through `executeWriteBatch`) and pipelined writes (which return before waiting on the inflight replication process). To make this work, we need to pick a little fight with how leases are (not) refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal was first inserted (despite potential reproposals). Perhaps surprisingly, when the breaker is tripped, *all* traffic to the Replica gets the fail-fast behavior, not just mutations. This is because even though a read may look good to serve based on the lease, we would also need to check for latches, and in particular we would need to fail-fast if there is a transitive dependency on any write (since that write is presumably stuck). This is not trivial and so we don't do it in this first iteration (see cockroachdb#74799). A tripped breaker deploys a background probe which sends a `ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is roughly the case whenever replicating a command through Raft is possible for the Replica, either by appending to the log as the Raft leader, or by forwarding to the Raft leader. A tiny bit of special casing allows requests made by the probe to bypass the breaker. As of this PR, the cluster setting defaults to zero (disabling the entire mechanism) until some necessary follow-up items have been addressed (see cockroachdb#74705). For example, the breaker-sensitive context cancelation is a toy implementation that comes with too much of a performance overhead (one extra goroutine per request); cockroachdb#74707 will address that. Other things not done here that we certainly want in the 22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505). The release note is deferred to cockroachdb#74705 (where breakers are turned on). Release note: None touchie

tbg · 2022-01-26T08:43:07Z

@nvanbenschoten pointed out (internal) a simplification that I had overlooked: reads don't block on reads, so if a read enters the replica while the breaker is tripped, we can let them in anyway but "somehow" give them a non-blocking latch policy where an error is returned if a blocking latch is encountered. Reads don't block on reads during latching, so the only way for a read to block on a latch is if it encounters a write, and it is reasonable to assume that the write latch is permanent until the replica becomes available again. This isn't quite enough, since reads may also trigger lease acquisitions, which would stall as well, so the read must additionally bail out if the lease is not valid (in effect behaving like a follower read). In fact, follower reads should never encounter latches, so perhaps we can unify things by making follower reads latch-rejectable and simply turning leaseholder-but-tripped reads into follower reads.

Note that there may be some reads that bounce right as the breaker trips, since reads that enter under a healthy breaker will not have this non-block policy. Instead, their context will be cancelled as the breaker trips, sending them back up the stack. If we wanted to, we could let them re-enter but it may not be a big enough issue to add complexity for.

nvanbenschoten · 2022-02-14T15:38:45Z

This isn't quite enough, since reads may also trigger lease acquisitions, which would stall as well, so the read must additionally bail out if the lease is not valid (in effect behaving like a follower read).

If the lease is not valid then the read will be promoted to a write when trying to acquire the lease. This write will then hit the breaker and return an error. Is there anything more needed?

Note that there may be some reads that bounce right as the breaker trips, since reads that enter under a healthy breaker will not have this non-block policy. Instead, their context will be cancelled as the breaker trips, sending them back up the stack. If we wanted to, we could let them re-enter but it may not be a big enough issue to add complexity for.

This is a good point that indicates that eagerly setting a policy is probably not the right solution.

An alternative that seems to have legs is to replace the non-blocking policy with a way to "poison" latches. Take Rust's support for Mutex poisoning as an example: https://doc.rust-lang.org/std/sync/struct.Mutex.html#poisoning. After a write has been proposed, it can't drop its latches until the proposal completes. However, if the proposed write was to somehow learn about the unavailability (does it already?), it could then poison its latches while continuing to hold them. This would instruct any request that is currently or will later wait on these latches to fail fast instead of waiting.

A few benefits of this change are that:

it does not require eagerly deciding on a wait policy, which bloats the latch manager API and has the race discussed above.
the poisoning is targetted to exactly the set of spans that are stuck on replication. A request would only be rejected if it directly conflicted with a poisoned latch, not if it indirectly conflicted through a series of dependent latches.
the behavior would apply to writes blocking on latches in the same way as it would apply to reads

I think this kind of approach could allow us to push down the replication circuit breaker into the replication layer (e.g. down into Replica.propose), instead of dealing with context cancellation all the way up in Replica.sendWithoutRangeID.

nvanbenschoten · 2022-02-14T21:07:35Z

If the lease is not valid then the read will be promoted to a write when trying to acquire the lease. This write will then hit the breaker and return an error. Is there anything more needed?

This same line of reasoning should also apply to transaction lock contention. I was concerned that with this latch-poisoning proposal, read requests could get blocked on locks indefinitely due to unavailability. However, that does not seem to be the case.

If the unavailability is on the lock holder's txn record's range, then the read-only request will hit the breaker error when it goes to push the record. Either the range will be unavailable with a lease and so the push will fail directly, or it will be unavailable without a lease and so the push will fail when it tries to acquire the lease.

If the unavailability is on the lock's range then the read-only request will hit the breaker error when it goes to resolve the lock.

Either way, the read-only request should not be blocked indefinitely.

tbg · 2022-02-15T13:04:27Z

If the lease is not valid then the read will be promoted to a write when trying to acquire the lease. This write will then hit the breaker and return an error. Is there anything more needed?

No, that's correct, good point.

However, if the proposed write was to somehow learn about the unavailability (does it already?)

Not in the way that you probably need. The client's context will be canceled. The proposal will remain in the proposals map. However, we could notify pending proposals in refreshProposalsLocked, which is currently tasked with tripping the breaker (though in the future there may be other mechanisms that do so, too).

cockroach/pkg/kv/kvserver/replica_raft.go

Lines 1123 to 1125 in c5d146b

    
           func (r *Replica) refreshProposalsLocked( 
        
           	ctx context.Context, refreshAtDelta int, reason refreshRaftReason, 
        
           ) {

I think this kind of approach could allow us to push down the replication circuit breaker into the replication layer (e.g. down into Replica.propose), instead of dealing with context cancellation all the way up in Replica.sendWithoutRangeID.

That's a good observation and everything you say above the quote is appealing to me as well. There is a question about what kinds of "unavailability" the circuit breaker is supposed to track, and ideally the answer is "any possible one", which would indicate that moving it completely to the replication layer isn't the right approach (as we'd want "slow requests" of any type to consider tripping the breaker, and its probe to exercise bits that require not only replication). But we could still move large parts of it down, definitely the replication/proposal-centric pieces.

nvanbenschoten · 2022-02-16T03:48:05Z

There is a question about what kinds of "unavailability" the circuit breaker is supposed to track, and ideally the answer is "any possible one", which would indicate that moving it completely to the replication layer isn't the right approach (as we'd want "slow requests" of any type to consider tripping the breaker, and its probe to exercise bits that require not only replication).

This is interesting. I did not realize the goal was for the circuit-breaker logic to be generalized beyond replication failures. What other kinds of unavailability do you have in mind? Perhaps stalled disks?

An alternative that seems to have legs is to replace the non-blocking policy with a way to "poison" latches.

For what it's worth, this wouldn't be very difficult. Here's a basic prototype that includes most of the necessary plumbing: nvanbenschoten@c1f2f47. I'd be happy to review any changes or work together on this issue if that helps, as I want to make sure we don't reduce follower read availability in this release.

tbg · 2022-02-17T13:48:07Z

@lunevalex made the good point that follower reads may not "always" be affected in case of a quorum loss. In steady state, only the leaseholder proposes commands. If quorum is lost, since the relevant leases are epoch-based, the lease remains valid (unless liveness heartbeats are affected, which for now I'll assume they're not). So followers should find no reason to trip their circuit breaker; they would only do so when trying to replicate a command, which they would only do to acquire a lease. The leaseholder, on the other hand, would trip (assuming it saw write activity), so we would be in the weird situation in which the leaseholder can't serve follower reads but the followers can. That discrepancy is unfortunate, and your suggestion would fix it. We should have a crack at it.

This is interesting. I did not realize the goal was for the circuit-breaker logic to be generalized beyond replication failures. What other kinds of unavailability do you have in mind? Perhaps stalled disks?

Slow evaluation, deadlocks, that sort of thing, but we haven't really had a conversation. Disk health should be adequately represented by replication; if disks are "slow" but fast enough to replicate commands in time, I don't think the circuit breaker should necessarily get involved unless we're really confident that traffic should be redirected elsewhere (which it may not; if the lease is local then we're the only place this traffic can be served from).

tbg · 2022-02-18T12:09:46Z

Working on #76673 I had another thought, which is that the poisoning idea gets mildly more complex if we allow certain requests to bypass the breaker. Basically, if you decide to wait on a poisoned latch anyway, you need to poison your own latch, too. (I think that's it).

tbg · 2022-02-21T23:16:27Z

Nathan and I met late last week to discuss this. Allowing requests to opt out of fail-fast behavior required some tweaks to Nathan's prototype above. I am starting to polish the result in #76858.

The poisoning machinery works like a charm. The remaining challenge amounts to a fairly extensive rework of the Replica circuit breaker plumbing, where we get rid of context cancellation as a primary mechanism. There are some technical challenges to solve (and fixing up the tests might reveal more issues), but nothing insurmountable.

Informs cockroachdb#74799. I hope it will ultimately close it, but not quite there yet. I left extensive `TODO(during review)` comments. Release note: None

@nvanbenschoten

This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica circuit breakers (cockroachdb#33007). The earlier implementation relied on context cancellation and coarsely failed all requests addressing the Replica when the breaker was tripped. This had two downsides: First, there was a (small) performance overhead for implementing the cancellation that was paid even in the common case of a healthy Replica. Second, and more importantly, the coarseness meant that we'd potentially fail many requests that would otherwise succeed, and in particular follower reads. @nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with the concept of "poisoning" and that this could result in fine-grained circuit breaker behavior where only requests that are truly affected by unavailability (at the replication layer) would be rejected. This commit implements that strategy: A request's latches are poisoned if its completion is predicated on the replication layer being healthy. In other words, when the breaker trips, all inflight proposals have their latches poisoned and new proposals are failed fast. However, and this is the big difference, reads can potentially still be accepted in either of two scenarios: - a valid follower read remains valid regardless of the circuit breaker status, and also regardless of inflight proposals (follower reads don't observe latches). - a read that can be served under the current lease and which does not conflict with any of the stuck proposals in the replication layer (= poisoned latches) can also be served. In short, reads only fail fast if they encounter a poisoned latch or need to request a lease. (If they opted out of fail-fast behavior, they behave as today). Latch poisoning is added as a first-class concept in the `concurrency` package, and a structured error `PoisonError` is introduced. This error in particular contains the span and timestamp of the poisoned latch that prompted the fail-fast. Lease proposals now always use `poison.Policy_Wait`, leaving the fail-fast behavior to the caller. This simplifies things since multiple callers with their own `poison.Policy` can multiplex onto a single inflight lease proposal. Addresses cockroachdb#74799. Release note: None Release justification: 22.1 project work

@nvanbenschoten

This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica circuit breakers (cockroachdb#33007). The earlier implementation relied on context cancellation and coarsely failed all requests addressing the Replica when the breaker was tripped. This had two downsides: First, there was a (small) performance overhead for implementing the cancellation that was paid even in the common case of a healthy Replica. Second, and more importantly, the coarseness meant that we'd potentially fail many requests that would otherwise succeed, and in particular follower reads. @nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with the concept of "poisoning" and that this could result in fine-grained circuit breaker behavior where only requests that are truly affected by unavailability (at the replication layer) would be rejected. This commit implements that strategy: A request's latches are poisoned if its completion is predicated on the replication layer being healthy. In other words, when the breaker trips, all inflight proposals have their latches poisoned and new proposals are failed fast. However, and this is the big difference, reads can potentially still be accepted in either of two scenarios: - a valid follower read remains valid regardless of the circuit breaker status, and also regardless of inflight proposals (follower reads don't observe latches). - a read that can be served under the current lease and which does not conflict with any of the stuck proposals in the replication layer (= poisoned latches) can also be served. In short, reads only fail fast if they encounter a poisoned latch or need to request a lease. (If they opted out of fail-fast behavior, they behave as today). Latch poisoning is added as a first-class concept in the `concurrency` package, and a structured error `PoisonError` is introduced. This error in particular contains the span and timestamp of the poisoned latch that prompted the fail-fast. Lease proposals now always use `poison.Policy_Wait`, leaving the fail-fast behavior to the caller. This simplifies things since multiple callers with their own `poison.Policy` can multiplex onto a single inflight lease proposal. Addresses cockroachdb#74799. Release note: None Release justification: 22.1 project work

@nvanbenschoten

This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica circuit breakers (cockroachdb#33007). The earlier implementation relied on context cancellation and coarsely failed all requests addressing the Replica when the breaker was tripped. This had two downsides: First, there was a (small) performance overhead for implementing the cancellation that was paid even in the common case of a healthy Replica. Second, and more importantly, the coarseness meant that we'd potentially fail many requests that would otherwise succeed, and in particular follower reads. @nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with the concept of "poisoning" and that this could result in fine-grained circuit breaker behavior where only requests that are truly affected by unavailability (at the replication layer) would be rejected. This commit implements that strategy: A request's latches are poisoned if its completion is predicated on the replication layer being healthy. In other words, when the breaker trips, all inflight proposals have their latches poisoned and new proposals are failed fast. However, and this is the big difference, reads can potentially still be accepted in either of two scenarios: - a valid follower read remains valid regardless of the circuit breaker status, and also regardless of inflight proposals (follower reads don't observe latches). - a read that can be served under the current lease and which does not conflict with any of the stuck proposals in the replication layer (= poisoned latches) can also be served. In short, reads only fail fast if they encounter a poisoned latch or need to request a lease. (If they opted out of fail-fast behavior, they behave as today). Latch poisoning is added as a first-class concept in the `concurrency` package, and a structured error `PoisonError` is introduced. This error in particular contains the span and timestamp of the poisoned latch that prompted the fail-fast. Lease proposals now always use `poison.Policy_Wait`, leaving the fail-fast behavior to the caller. This simplifies things since multiple callers with their own `poison.Policy` can multiplex onto a single inflight lease proposal. Addresses cockroachdb#74799. Release note: None Release justification: 22.1 project work

@nvanbenschoten

This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica circuit breakers (cockroachdb#33007). The earlier implementation relied on context cancellation and coarsely failed all requests addressing the Replica when the breaker was tripped. This had two downsides: First, there was a (small) performance overhead for implementing the cancellation that was paid even in the common case of a healthy Replica. Second, and more importantly, the coarseness meant that we'd potentially fail many requests that would otherwise succeed, and in particular follower reads. @nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with the concept of "poisoning" and that this could result in fine-grained circuit breaker behavior where only requests that are truly affected by unavailability (at the replication layer) would be rejected. This commit implements that strategy: A request's latches are poisoned if its completion is predicated on the replication layer being healthy. In other words, when the breaker trips, all inflight proposals have their latches poisoned and new proposals are failed fast. However, and this is the big difference, reads can potentially still be accepted in either of two scenarios: - a valid follower read remains valid regardless of the circuit breaker status, and also regardless of inflight proposals (follower reads don't observe latches). - a read that can be served under the current lease and which does not conflict with any of the stuck proposals in the replication layer (= poisoned latches) can also be served. In short, reads only fail fast if they encounter a poisoned latch or need to request a lease. (If they opted out of fail-fast behavior, they behave as today). Latch poisoning is added as a first-class concept in the `concurrency` package, and a structured error `PoisonError` is introduced. This error in particular contains the span and timestamp of the poisoned latch that prompted the fail-fast. Lease proposals now always use `poison.Policy_Wait`, leaving the fail-fast behavior to the caller. This simplifies things since multiple callers with their own `poison.Policy` can multiplex onto a single inflight lease proposal. Addresses cockroachdb#74799. Release note: None Release justification: 22.1 project work

@nvanbenschoten

This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica circuit breakers (cockroachdb#33007). The earlier implementation relied on context cancellation and coarsely failed all requests addressing the Replica when the breaker was tripped. This had two downsides: First, there was a (small) performance overhead for implementing the cancellation that was paid even in the common case of a healthy Replica. Second, and more importantly, the coarseness meant that we'd potentially fail many requests that would otherwise succeed, and in particular follower reads. @nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with the concept of "poisoning" and that this could result in fine-grained circuit breaker behavior where only requests that are truly affected by unavailability (at the replication layer) would be rejected. This commit implements that strategy: A request's latches are poisoned if its completion is predicated on the replication layer being healthy. In other words, when the breaker trips, all inflight proposals have their latches poisoned and new proposals are failed fast. However, and this is the big difference, reads can potentially still be accepted in either of two scenarios: - a valid follower read remains valid regardless of the circuit breaker status, and also regardless of inflight proposals (follower reads don't observe latches). - a read that can be served under the current lease and which does not conflict with any of the stuck proposals in the replication layer (= poisoned latches) can also be served. In short, reads only fail fast if they encounter a poisoned latch or need to request a lease. (If they opted out of fail-fast behavior, they behave as today). Latch poisoning is added as a first-class concept in the `concurrency` package, and a structured error `PoisonError` is introduced. This error in particular contains the span and timestamp of the poisoned latch that prompted the fail-fast. Lease proposals now always use `poison.Policy_Wait`, leaving the fail-fast behavior to the caller. This simplifies things since multiple callers with their own `poison.Policy` can multiplex onto a single inflight lease proposal. Addresses cockroachdb#74799. Release note: None Release justification: 22.1 project work

@nvanbenschoten

This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica circuit breakers (cockroachdb#33007). The earlier implementation relied on context cancellation and coarsely failed all requests addressing the Replica when the breaker was tripped. This had two downsides: First, there was a (small) performance overhead for implementing the cancellation that was paid even in the common case of a healthy Replica. Second, and more importantly, the coarseness meant that we'd potentially fail many requests that would otherwise succeed, and in particular follower reads. @nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with the concept of "poisoning" and that this could result in fine-grained circuit breaker behavior where only requests that are truly affected by unavailability (at the replication layer) would be rejected. This commit implements that strategy: A request's latches are poisoned if its completion is predicated on the replication layer being healthy. In other words, when the breaker trips, all inflight proposals have their latches poisoned and new proposals are failed fast. However, and this is the big difference, reads can potentially still be accepted in either of two scenarios: - a valid follower read remains valid regardless of the circuit breaker status, and also regardless of inflight proposals (follower reads don't observe latches). - a read that can be served under the current lease and which does not conflict with any of the stuck proposals in the replication layer (= poisoned latches) can also be served. In short, reads only fail fast if they encounter a poisoned latch or need to request a lease. (If they opted out of fail-fast behavior, they behave as today). Latch poisoning is added as a first-class concept in the `concurrency` package, and a structured error `PoisonError` is introduced. This error in particular contains the span and timestamp of the poisoned latch that prompted the fail-fast. Lease proposals now always use `poison.Policy_Wait`, leaving the fail-fast behavior to the caller. This simplifies things since multiple callers with their own `poison.Policy` can multiplex onto a single inflight lease proposal. Addresses cockroachdb#74799. Release note: None Release justification: 22.1 project work

76858: kvserver: allow circuit-breaker to serve reads r=erikgrinaker a=tbg This commit revamps an earlier implementation (#71806) of per-Replica circuit breakers (#33007). The earlier implementation relied on context cancellation and coarsely failed all requests addressing the Replica when the breaker was tripped. This had two downsides: First, there was a (small) performance overhead for implementing the cancellation that was paid even in the common case of a healthy Replica. Second, and more importantly, the coarseness meant that we'd potentially fail many requests that would otherwise succeed, and in particular follower reads. `@nvanbenschoten` suggested in #74799 that latching could be extended with the concept of "poisoning" and that this could result in fine-grained circuit breaker behavior where only requests that are truly affected by unavailability (at the replication layer) would be rejected. This commit implements that strategy: A request's latches are poisoned if its completion is predicated on the replication layer being healthy. In other words, when the breaker trips, all inflight proposals have their latches poisoned and new proposals are failed fast. However, and this is the big difference, reads can potentially still be accepted in either of two scenarios: - a valid follower read remains valid regardless of the circuit breaker status, and also regardless of inflight proposals (follower reads don't observe latches). - a read that can be served under the current lease and which does not conflict with any of the stuck proposals in the replication layer (= poisoned latches) can also be served. In short, reads only fail fast if they encounter a poisoned latch or need to request a lease. (If they opted out of fail-fast behavior, they behave as today). Latch poisoning is added as a first-class concept in the `concurrency` package, and a structured error `PoisonError` is introduced. This error in particular contains the span and timestamp of the poisoned latch that prompted the fail-fast. Lease proposals now always use `poison.Policy_Wait`, leaving the fail-fast behavior to the caller. This simplifies things since multiple callers with their own `poison.Policy` can multiplex onto a single inflight lease proposal. Addresses #74799. Release note: None Release justification: 22.1 project work 77221: changefeedccl: Fix flaky test. r=miretskiy a=miretskiy Fix flaky TestChangefeedHandlesDrainingNodes test. The source of the flake was that cluster setting updates propagate asynchronously to the other nodes in the cluster. Thus, it was possible for the test to flake because some of the nodes were observing the old value for the setting. The flake is fixed by introducing testing utility function that sets the setting and ensures the setting propagates to all nodes in the test cluster. Fixes #76806 Release Notes: none 77317: util/log: remove Safe in favor of redact.Safe r=yuzefovich a=yuzefovich My desire to make this change is to break the dependency of `treewindow` on `util/log` (which is a part of the effort to clean up the dependencies of `execgen`). Addresses: #77234. Release note: None Release justification: low risk change to clean up the dependencies a bit. Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>

tbg · 2022-03-03T22:35:58Z

Implemented in #76858.

tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jan 13, 2022

tbg self-assigned this Jan 14, 2022

tbg mentioned this issue Jan 14, 2022

kvserver: circuit-break requests to unavailable ranges #33007

Closed

tbg mentioned this issue Feb 21, 2022

kvserver: allow circuit-breaker to serve reads #76858

Merged

tbg added a commit to tbg/cockroach that referenced this issue Feb 23, 2022

kvserver: allow circuit-breaker to serve reads

d7e418c

Informs cockroachdb#74799. I hope it will ultimately close it, but not quite there yet. I left extensive `TODO(during review)` comments. Release note: None

tbg added a commit to tbg/cockroach that referenced this issue Mar 1, 2022

kvserver: allow circuit-breaker to serve reads

a472d0e

Informs cockroachdb#74799. I hope it will ultimately close it, but not quite there yet. I left extensive `TODO(during review)` comments. Release note: None

tbg closed this as completed Mar 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: allow reads on circuit-broken replicas #74799

kvserver: allow reads on circuit-broken replicas #74799

tbg commented Jan 13, 2022 •

edited by cockroach-jira-scripts

Loading

tbg commented Jan 26, 2022

nvanbenschoten commented Feb 14, 2022

nvanbenschoten commented Feb 14, 2022

tbg commented Feb 15, 2022

nvanbenschoten commented Feb 16, 2022

tbg commented Feb 17, 2022

tbg commented Feb 18, 2022

tbg commented Feb 21, 2022 •

edited

Loading

tbg commented Mar 3, 2022

kvserver: allow reads on circuit-broken replicas #74799

kvserver: allow reads on circuit-broken replicas #74799

Comments

tbg commented Jan 13, 2022 • edited by cockroach-jira-scripts Loading

use two circuit breakers

check transitive deps for writes in latching when breaker open

tbg commented Jan 26, 2022

nvanbenschoten commented Feb 14, 2022

nvanbenschoten commented Feb 14, 2022

tbg commented Feb 15, 2022

nvanbenschoten commented Feb 16, 2022

tbg commented Feb 17, 2022

tbg commented Feb 18, 2022

tbg commented Feb 21, 2022 • edited Loading

tbg commented Mar 3, 2022

tbg commented Jan 13, 2022 •

edited by cockroach-jira-scripts

Loading

tbg commented Feb 21, 2022 •

edited

Loading