Skip to content

Commit

Permalink
kv: circuit-break requests to unavailable replicas
Browse files Browse the repository at this point in the history
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within `base.SlowRequestThreshold` (15s at time of
writing), the breaker is tripped, cancelling all inflight and future
requests until the breaker heals. Perhaps surprisingly, the "slowness"
existing detection (the informational "have been waiting ... for
proposal" warning in `executeWriteBatch`) was moved deeper into
replication (`refreshProposalsLocked`), where it now trips the breaker.
This has the advantage of providing a unified path for lease requests
(which don't go through `executeWriteBatch`) and pipelined writes (which
return before waiting on the inflight replication process). To make this
work, we need to pick a little fight with how leases are (not)
refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal
was first inserted (despite potential reproposals).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, all of this is off via the
`kv.replica_circuit_breakers.enabled` cluster setting while we determine
which of the many follow-up items to target for 22.1 (see cockroachdb#74705).
Primarily though, the breaker-sensitive context cancelation is a toy
implementation that comes with too much of a performance overhead (one
extra goroutine per request); cockroachdb#74707 will address that.

Other things not done here that we certainly want in the 22.1 release
are

- UI work (cockroachdb#74713)
- Metrics (cockroachdb#74505)

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None
  • Loading branch information
tbg committed Jan 11, 2022
1 parent f68a5a7 commit 50cc08c
Show file tree
Hide file tree
Showing 20 changed files with 1,003 additions and 48 deletions.
26 changes: 26 additions & 0 deletions demo.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#!/bin/bash

set -euo pipefail

roachprod destroy local || true
roachprod create -n 5 local

roachprod put local cockroach
roachprod start local:1-5
sleep 60
roachprod run local:1 -- ./cockroach workload init kv --splits 100
roachprod run local:1 -- ./cockroach workload run kv --max-rate 100 --read-percent 0 --duration=10s {pgurl:1-5}

roachprod sql local:1 -- -e 'set cluster setting kv.replica_circuit_breakers.enabled = true'

roachprod stop local:1,2

roachprod run local:1 -- ./cockroach workload run kv --max-rate 100 --read-percent 0 --tolerate-errors --duration=60000s {pgurl:3-5} 2> errors.txt

# TODO: sometimes don't get 100 errors/sec but we get stuck somewhere, use this
# below to chase down the ranges and to look into it. Consider filtering on
# numPending > 0 to get the more interesting ones.
# curl http://localhost:26264/_status/ranges/local | jq '.ranges[] | select((.state.circuitBreakerError == "") and (.span.startKey | test("Table/56")))' | less

# TODO: problem ranges should highlight circ breakers
# TODO: careful with whole-cluster restarts (or restarts of nodes that are required for quorum); is there any KV access in the start path that wll fatal the node on a hard error (such as an open circuit breaker error from somewhere?)
4 changes: 4 additions & 0 deletions pkg/kv/kvserver/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ go_library(
"replica_application_state_machine.go",
"replica_backpressure.go",
"replica_batch_updates.go",
"replica_circuit_breaker.go",
"replica_closedts.go",
"replica_command.go",
"replica_consistency.go",
Expand Down Expand Up @@ -162,6 +163,7 @@ go_library(
"//pkg/util",
"//pkg/util/admission",
"//pkg/util/bufalloc",
"//pkg/util/circuit",
"//pkg/util/contextutil",
"//pkg/util/ctxgroup",
"//pkg/util/encoding",
Expand Down Expand Up @@ -221,6 +223,7 @@ go_test(
"client_rangefeed_test.go",
"client_relocate_range_test.go",
"client_replica_backpressure_test.go",
"client_replica_circuit_breaker_test.go",
"client_replica_gc_test.go",
"client_replica_test.go",
"client_spanconfigs_test.go",
Expand Down Expand Up @@ -374,6 +377,7 @@ go_test(
"//pkg/ts/tspb",
"//pkg/util",
"//pkg/util/caller",
"//pkg/util/circuit",
"//pkg/util/contextutil",
"//pkg/util/ctxgroup",
"//pkg/util/encoding",
Expand Down
Loading

0 comments on commit 50cc08c

Please sign in to comment.