kvserver: circuit-break requests to unavailable ranges #33007

derkan · 2018-12-11T07:47:01Z

Describe the problem

I'm testing multi-region deployment. I've used performance tuning docs, installed multi-datacenter cluster. DC names are DCPRI and DCSEC with 3 virtual machines in each.
When I shutdown system in DCSEC, cluster on DCPRI is unresponsive, no SQL's work, just waits. Also, admin GUI timeouts.

Is it because my configuration, bug or is it expected behaviour?

Here is start parameters of cockroach instances:

# DCPRI/Node 1
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcpri --listen-addr=10.35.14.101 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25
# DCPRI/Node 4
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcpri --listen-addr=10.35.14.102 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25
# DCPRI/Node 3
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcpri --listen-addr=10.35.14.103 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25
# Node 4 DCSEC
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcsec --listen-addr=10.35.14.104 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25
# Node 5 DCSEC
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcsec --listen-addr=10.35.14.105 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25
# Node 6 DCSEC
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcsec --listen-addr=10.35.14.106 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25

Logs from the nodes in DCPRI
DCPRI/Node1:

W181210 15:30:45.781750 136672 vendor/google.golang.org/grpc/clientconn.go:942  Failed to dial 10.10.14.105:26257: context canceled; please retry.
W181210 15:30:45.781790 139 storage/store.go:3910  [n1,s1] handle raft ready: 1.3s [processed=0]
I181210 15:30:46.795946 136580 server/authentication.go:374  Web session error: http: named cookie not present
I181210 15:30:46.799351 136228 server/authentication.go:374  Web session error: http: named cookie not present
W181210 15:30:46.909608 136729 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.104:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.10.14.104:26257: connect: no route to host". Reconnecting...
W181210 15:30:46.909690 136729 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.104:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W181210 15:30:46.909704 136729 vendor/google.golang.org/grpc/clientconn.go:942  Failed to dial 10.10.14.104:26257: grpc: the connection is closing; please retry.
W181210 15:30:46.909709 142 storage/store.go:3910  [n1,s1] handle raft ready: 1.8s [processed=0]

DCPRI/Node4:

W181210 15:31:22.968999 71750 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.105:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.10.14.105:26257: connect: no route to host". Reconnecting...
I181210 15:31:22.969066 136 rpc/nodedialer/nodedialer.go:91  [n4] unable to connect to n6: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.10.14.105:26257: connect: no route to host"
W181210 15:31:22.969068 71750 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.105:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W181210 15:31:22.969084 136 storage/store.go:3910  [n4,s4] handle raft ready: 1.1s [processed=0]
W181210 15:31:22.969095 71750 vendor/google.golang.org/grpc/clientconn.go:942  Failed to dial 10.10.14.105:26257: context canceled; please retry.
W181210 15:31:23.772921 71770 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.106:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.10.14.106:26257: connect: no route to host". Reconnecting...
W181210 15:31:23.772980 71770 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.106:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W181210 15:31:23.772988 71770 vendor/google.golang.org/grpc/clientconn.go:942  Failed to dial 10.10.14.106:26257: grpc: the connection is closing; please retry.
W181210 15:31:23.773029 112 storage/store.go:3910  [n4,s4] handle raft ready: 1.3s [processed=0]

DCPRI/Node3:

I181210 15:31:50.649944 139 server/status/runtime.go:465  [n3] runtime stats: 253 MiB RSS, 224 goroutines, 140 MiB/18 MiB/175 MiB GO alloc/idle/total, 56 MiB/72 MiB CGO alloc/total, 24.1 CGO/sec, 0.7/0.3 %(u/s)time, 0.0 %gc (0x), 108 KiB/80 KiB (r/w)net
W181210 15:31:50.941579 69897 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.106:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.10.14.106:26257: connect: no route to host". Reconnecting...
W181210 15:31:50.941670 69897 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.106:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W181210 15:31:50.941683 69897 vendor/google.golang.org/grpc/clientconn.go:942  Failed to dial 10.10.14.106:26257: grpc: the connection is closing; please retry.
W181210 15:31:51.217542 69854 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.105:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.10.14.105:26257: connect: no route to host". Reconnecting...
W181210 15:31:51.217615 31 storage/store.go:3910  [n3,s3] handle raft ready: 1.2s [processed=0]
W181210 15:31:51.217622 69854 vendor/google.golang.org/grpc/clientconn.go:1293  grpc: addrConn.createTransport failed to connect to {10.10.14.105:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W181210 15:31:51.217648 69854 vendor/google.golang.org/grpc/clientconn.go:942  Failed to dial 10.10.14.105:26257: context canceled; please retry.

Epic: CRDB-2553

Jira issue: CRDB-6349

The text was updated successfully, but these errors were encountered:

derkan · 2018-12-11T10:15:51Z

We had a chat on gitter with @mberhault . He kindly gave information how CockroachDB's cluster works on multiple datacenters. @mberhault suggested this doc and this doc which were helpful.

But unfortunately, while having 2 datacenters, if one DC is lost, db cluster in other DC will not be available for our services, that means no data is available afterwards. (Same thing for 3 DC cluster if 2 of 3 DC are lost).

I think for CockroachDB, we need a mechanism to promote last available DC to be online if others are down. It maybe manually or automatic. I had issue #32524 in single DC-3 node cluster and still I can not recover some rows from some tables. SQL's on these tables waits forever. Also this forever wait behaviour should not be exist, instead it should raise exception that some ranges are not accessible. Because applications are blocked with these waits.

knz · 2018-12-11T10:45:04Z

@tim-o please accompany this discussion

tim-o · 2018-12-11T18:47:43Z

Hey @derkan - it sounds like there are two requests here:

Raise an exception if ranges are not accessible, and
Support a use case where a cluster can continue running after losing half of its replicas.

Is that correct? If so, the first certainly seems reasonable but the second would be a very large change. We do have work ongoing to allow single replicas to return reads if quorum can't be reached ("follower reads"), but this would not allow writes.

Let me know if that would suit your use case and if I've captured your feedback above.

Thanks!

derkan · 2018-12-12T05:17:08Z

Thanks @tim-o

Raise an exception for unaccessible ranges : Yes. It is better than digging in logs for the cause of wating and locking services using DB(of course with a reasonable timeout).
Make cluster continue running if half of the replicas are lost: Yes. We're 'cockroach`, right? :-) It makes sense for allowing reads in this situation, giving time to DBA to add new nodes to cluster. And an important thing is, it will allow DBA to make a backup. If it is possible, maybe active nodes can let writing to the replicas which they own?

tim-o · 2018-12-13T16:43:32Z

The first definitely seems uncontroversial to me. @awoods187 can you comment on the second request above? Does this fall under an existing project or is it net new / unspecced?

@derkan we definitely do want to be fault tolerant, but I'd be concerned that there's no way to break a 'tie' in a world where we allow a cluster to operate with consensus from only 50% of the nodes. I'll leave it to someone more familiar with core to comment on that though.

tbg · 2019-01-15T12:49:17Z

Falling back to some kind of "inconsistent reads" mode when quorum is lost is very difficult to implement correctly. For starters, none of the internal invariants of the SQL layer are expected to hold in that situation.

knz · 2019-01-22T18:36:18Z

I have requalified the issue to become a feature request for a timeout on requests to unavailable ranges.

Unfortunately at this time our internal work backlog is a bit congested, but we'll be on the lookout for this UX improvement. Thank you for suggesting.

@awoods187 @piyush-singh this request pertains to the UX of partially unavailable clusters. Maybe this fits into a larger roadmap picture you already have?

tbg · 2019-01-22T22:01:01Z

FWIW, this is straightforward to implement. Tentatively putting this on my plate for 2.2, it would also aid our debugging immensely.

34296: storage: improve message on slow Raft proposal r=petermattis a=tbg Touches #33007. Release note: None 34589: importccl: fix flaky test TestImportCSVStmt r=rytaft a=rytaft `TestImportCSVStmt` tests that `IMPORT` jobs appear in a certain order in the `system.jobs` table. Automatic statistics were causing this test to be flaky since `CreateStats` jobs were present in the jobs table as well, in an unpredictable order. This commit fixes the problem by only selecting `IMPORT` jobs from the jobs table. Fixes #34568 Release note: None 34660: storage: make RaftTruncatedState unreplicated r=bdarnell a=tbg This isn't 100% polished yet, but generally ready for review. ---- See #34287. Today, Raft (or preemptive) snapshots include the past Raft log, that is, log entries which are already reflected in the state of the snapshot. Fundamentally, this is because we have historically used a replicated TruncatedState. TruncatedState essentially tells us what the first index in the log is (though it also includes a Term). If the TruncatedState cannot diverge across replicas, we *must* send the whole log in snapshots, as the first log index must match what the TruncatedState claims it is. The Raft log is typically, but not necessarily small. Log truncations are driven by a queue and use a complex decision process. That decision process can be faulty and even if it isn't, the queue could be held up. Besides, even when the Raft log contains only very few entries, these entries may be quite large (see SSTable ingestion during RESTORE). All this motivates that we don't want to (be forced to) send the Raft log as part of snapshots, and in turn we need the TruncatedState to be unreplicated. This change migrates the TruncatedState into unreplicated keyspace. It does not yet allow snapshots to avoid sending the past Raft log, but that is a relatively straightforward follow-up change. VersionUnreplicatedRaftTruncatedState, when active, moves the truncated state into unreplicated keyspace on log truncations. The migration works as follows: 1. at any log position, the replicas of a Range either use the new (unreplicated) key or the old one, and exactly one of them exists. 2. When a log truncation evaluates under the new cluster version, it initiates the migration by deleting the old key. Under the old cluster version, it behaves like today, updating the replicated truncated state. 3. The deletion signals new code downstream of Raft and triggers a write to the new, unreplicated, key (atomic with the deletion of the old key). 4. Future log truncations don't write any replicated data any more, but (like before) send along the TruncatedState which is written downstream of Raft atomically with the deletion of the log entries. This actually uses the same code as 3. What's new is that the truncated state needs to be verified before replacing a previous one. If replicas disagree about their truncated state, it's possible for replica X at FirstIndex=100 to apply a truncated state update that sets FirstIndex to, say, 50 (proposed by a replica with a "longer" historical log). In that case, the truncated state update must be ignored (this is straightforward downstream-of-Raft code). 5. When a split trigger evaluates, it seeds the RHS with the legacy key iff the LHS uses the legacy key, and the unreplicated key otherwise. This makes sure that the invariant that all replicas agree on the state of the migration is upheld. 6. When a snapshot is applied, the receiver is told whether the snapshot contains a legacy key. If not, it writes the truncated state (which is part of the snapshot metadata) in its unreplicated version. Otherwise it doesn't have to do anything (the range will migrate later). The following diagram visualizes the above. Note that it abuses sequence diagrams to get a nice layout; the vertical lines belonging to NewState and OldState don't imply any particular ordering of operations. ``` ┌────────┐ ┌────────┐ │OldState│ │NewState│ └───┬────┘ └───┬────┘ │ Bootstrap under old version │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ │ │ │ Bootstrap under new version │ │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ │ │─ ─ ┐ │ | Log truncation under old version │< ─ ┘ │ │ │─ ─ ┐ │ │ | Snapshot │ │< ─ ┘ │ │ │ │ │─ ─ ┐ │ │ | Snapshot │ │< ─ ┘ │ │ │ Log truncation under new version │ │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─>│ │ │ │ │─ ─ ┐ │ │ | Log truncation under new version │ │< ─ ┘ │ │ │ │─ ─ ┐ │ │ | Log truncation under old version │ │< ─ ┘ (necessarily running new binary) ``` Release note: None 34762: distsqlplan: fix error in union planning r=jordanlewis a=jordanlewis Previously, if 2 inputs to a UNION ALL had identical post processing except for renders, further post processing on top of that union all could invalidate the plan and cause errors or crashes. Fixes #34437. Release note (bug fix): fix a planning crash during UNION ALL operations that had projections, filters or renders directly on top of the UNION ALL in some cases. 34767: sql: reuse already allocated memory for the cache in a row container r=yuzefovich a=yuzefovich Previously, we would always allocate new memory for every row that we put in the cache of DiskBackedIndexedRowContainer and simply discard the memory underlying the row that we remove from the cache. Now, we're reusing that memory. Release note: None 34779: opt: add stats to tpch xform test r=justinj a=justinj Since we have stats by default now, this should be the default testing mechanism. I left in tpch-no-stats since we also have that for tpcc, just for safety. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]> Co-authored-by: Rebecca Taft <[email protected]> Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Justin Jaffray <[email protected]>

tbg · 2021-02-16T15:44:21Z

@nvanbenschoten also points out that with parallel commits in a single phase, even txns that have their first write to an unavailable range may not realize they should stop there. This is due to pipelining. From that pov it also makes sense to fail-fast so that these txns don't keep their unrelated intents alive. Also, our deadlock detection won't work if txns are anchored on unavailable ranges.

@nvanbenschoten

This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica circuit breakers (cockroachdb#33007). The earlier implementation relied on context cancellation and coarsely failed all requests addressing the Replica when the breaker was tripped. This had two downsides: First, there was a (small) performance overhead for implementing the cancellation that was paid even in the common case of a healthy Replica. Second, and more importantly, the coarseness meant that we'd potentially fail many requests that would otherwise succeed, and in particular follower reads. @nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with the concept of "poisoning" and that this could result in fine-grained circuit breaker behavior where only requests that are truly affected by unavailability (at the replication layer) would be rejected. This commit implements that strategy: A request's latches are poisoned if its completion is predicated on the replication layer being healthy. In other words, when the breaker trips, all inflight proposals have their latches poisoned and new proposals are failed fast. However, and this is the big difference, reads can potentially still be accepted in either of two scenarios: - a valid follower read remains valid regardless of the circuit breaker status, and also regardless of inflight proposals (follower reads don't observe latches). - a read that can be served under the current lease and which does not conflict with any of the stuck proposals in the replication layer (= poisoned latches) can also be served. In short, reads only fail fast if they encounter a poisoned latch or need to request a lease. (If they opted out of fail-fast behavior, they behave as today). Latch poisoning is added as a first-class concept in the `concurrency` package, and a structured error `PoisonError` is introduced. This error in particular contains the span and timestamp of the poisoned latch that prompted the fail-fast. Lease proposals now always use `poison.Policy_Wait`, leaving the fail-fast behavior to the caller. This simplifies things since multiple callers with their own `poison.Policy` can multiplex onto a single inflight lease proposal. Addresses cockroachdb#74799. Release note: None Release justification: 22.1 project work

This adds a roachtest that runs TPCC (with a low warehouse count and `--tolerate-errors`), loses quorum half-way through, and checks the prometheus metrics of the workload for fail-fast behavior. Touches cockroachdb#33007. Release note: None Release justification: testing-only change

@nvanbenschoten

This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica circuit breakers (cockroachdb#33007). The earlier implementation relied on context cancellation and coarsely failed all requests addressing the Replica when the breaker was tripped. This had two downsides: First, there was a (small) performance overhead for implementing the cancellation that was paid even in the common case of a healthy Replica. Second, and more importantly, the coarseness meant that we'd potentially fail many requests that would otherwise succeed, and in particular follower reads. @nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with the concept of "poisoning" and that this could result in fine-grained circuit breaker behavior where only requests that are truly affected by unavailability (at the replication layer) would be rejected. This commit implements that strategy: A request's latches are poisoned if its completion is predicated on the replication layer being healthy. In other words, when the breaker trips, all inflight proposals have their latches poisoned and new proposals are failed fast. However, and this is the big difference, reads can potentially still be accepted in either of two scenarios: - a valid follower read remains valid regardless of the circuit breaker status, and also regardless of inflight proposals (follower reads don't observe latches). - a read that can be served under the current lease and which does not conflict with any of the stuck proposals in the replication layer (= poisoned latches) can also be served. In short, reads only fail fast if they encounter a poisoned latch or need to request a lease. (If they opted out of fail-fast behavior, they behave as today). Latch poisoning is added as a first-class concept in the `concurrency` package, and a structured error `PoisonError` is introduced. This error in particular contains the span and timestamp of the poisoned latch that prompted the fail-fast. Lease proposals now always use `poison.Policy_Wait`, leaving the fail-fast behavior to the caller. This simplifies things since multiple callers with their own `poison.Policy` can multiplex onto a single inflight lease proposal. Addresses cockroachdb#74799. Release note: None Release justification: 22.1 project work

@nvanbenschoten

This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica circuit breakers (cockroachdb#33007). The earlier implementation relied on context cancellation and coarsely failed all requests addressing the Replica when the breaker was tripped. This had two downsides: First, there was a (small) performance overhead for implementing the cancellation that was paid even in the common case of a healthy Replica. Second, and more importantly, the coarseness meant that we'd potentially fail many requests that would otherwise succeed, and in particular follower reads. @nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with the concept of "poisoning" and that this could result in fine-grained circuit breaker behavior where only requests that are truly affected by unavailability (at the replication layer) would be rejected. This commit implements that strategy: A request's latches are poisoned if its completion is predicated on the replication layer being healthy. In other words, when the breaker trips, all inflight proposals have their latches poisoned and new proposals are failed fast. However, and this is the big difference, reads can potentially still be accepted in either of two scenarios: - a valid follower read remains valid regardless of the circuit breaker status, and also regardless of inflight proposals (follower reads don't observe latches). - a read that can be served under the current lease and which does not conflict with any of the stuck proposals in the replication layer (= poisoned latches) can also be served. In short, reads only fail fast if they encounter a poisoned latch or need to request a lease. (If they opted out of fail-fast behavior, they behave as today). Latch poisoning is added as a first-class concept in the `concurrency` package, and a structured error `PoisonError` is introduced. This error in particular contains the span and timestamp of the poisoned latch that prompted the fail-fast. Lease proposals now always use `poison.Policy_Wait`, leaving the fail-fast behavior to the caller. This simplifies things since multiple callers with their own `poison.Policy` can multiplex onto a single inflight lease proposal. Addresses cockroachdb#74799. Release note: None Release justification: 22.1 project work

@nvanbenschoten

This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica circuit breakers (cockroachdb#33007). The earlier implementation relied on context cancellation and coarsely failed all requests addressing the Replica when the breaker was tripped. This had two downsides: First, there was a (small) performance overhead for implementing the cancellation that was paid even in the common case of a healthy Replica. Second, and more importantly, the coarseness meant that we'd potentially fail many requests that would otherwise succeed, and in particular follower reads. @nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with the concept of "poisoning" and that this could result in fine-grained circuit breaker behavior where only requests that are truly affected by unavailability (at the replication layer) would be rejected. This commit implements that strategy: A request's latches are poisoned if its completion is predicated on the replication layer being healthy. In other words, when the breaker trips, all inflight proposals have their latches poisoned and new proposals are failed fast. However, and this is the big difference, reads can potentially still be accepted in either of two scenarios: - a valid follower read remains valid regardless of the circuit breaker status, and also regardless of inflight proposals (follower reads don't observe latches). - a read that can be served under the current lease and which does not conflict with any of the stuck proposals in the replication layer (= poisoned latches) can also be served. In short, reads only fail fast if they encounter a poisoned latch or need to request a lease. (If they opted out of fail-fast behavior, they behave as today). Latch poisoning is added as a first-class concept in the `concurrency` package, and a structured error `PoisonError` is introduced. This error in particular contains the span and timestamp of the poisoned latch that prompted the fail-fast. Lease proposals now always use `poison.Policy_Wait`, leaving the fail-fast behavior to the caller. This simplifies things since multiple callers with their own `poison.Policy` can multiplex onto a single inflight lease proposal. Addresses cockroachdb#74799. Release note: None Release justification: 22.1 project work

@nvanbenschoten

This commit revamps an earlier implementation (cockroachdb#71806) of per-Replica circuit breakers (cockroachdb#33007). The earlier implementation relied on context cancellation and coarsely failed all requests addressing the Replica when the breaker was tripped. This had two downsides: First, there was a (small) performance overhead for implementing the cancellation that was paid even in the common case of a healthy Replica. Second, and more importantly, the coarseness meant that we'd potentially fail many requests that would otherwise succeed, and in particular follower reads. @nvanbenschoten suggested in cockroachdb#74799 that latching could be extended with the concept of "poisoning" and that this could result in fine-grained circuit breaker behavior where only requests that are truly affected by unavailability (at the replication layer) would be rejected. This commit implements that strategy: A request's latches are poisoned if its completion is predicated on the replication layer being healthy. In other words, when the breaker trips, all inflight proposals have their latches poisoned and new proposals are failed fast. However, and this is the big difference, reads can potentially still be accepted in either of two scenarios: - a valid follower read remains valid regardless of the circuit breaker status, and also regardless of inflight proposals (follower reads don't observe latches). - a read that can be served under the current lease and which does not conflict with any of the stuck proposals in the replication layer (= poisoned latches) can also be served. In short, reads only fail fast if they encounter a poisoned latch or need to request a lease. (If they opted out of fail-fast behavior, they behave as today). Latch poisoning is added as a first-class concept in the `concurrency` package, and a structured error `PoisonError` is introduced. This error in particular contains the span and timestamp of the poisoned latch that prompted the fail-fast. Lease proposals now always use `poison.Policy_Wait`, leaving the fail-fast behavior to the caller. This simplifies things since multiple callers with their own `poison.Policy` can multiplex onto a single inflight lease proposal. Addresses cockroachdb#74799. Release note: None Release justification: 22.1 project work

76858: kvserver: allow circuit-breaker to serve reads r=erikgrinaker a=tbg This commit revamps an earlier implementation (#71806) of per-Replica circuit breakers (#33007). The earlier implementation relied on context cancellation and coarsely failed all requests addressing the Replica when the breaker was tripped. This had two downsides: First, there was a (small) performance overhead for implementing the cancellation that was paid even in the common case of a healthy Replica. Second, and more importantly, the coarseness meant that we'd potentially fail many requests that would otherwise succeed, and in particular follower reads. `@nvanbenschoten` suggested in #74799 that latching could be extended with the concept of "poisoning" and that this could result in fine-grained circuit breaker behavior where only requests that are truly affected by unavailability (at the replication layer) would be rejected. This commit implements that strategy: A request's latches are poisoned if its completion is predicated on the replication layer being healthy. In other words, when the breaker trips, all inflight proposals have their latches poisoned and new proposals are failed fast. However, and this is the big difference, reads can potentially still be accepted in either of two scenarios: - a valid follower read remains valid regardless of the circuit breaker status, and also regardless of inflight proposals (follower reads don't observe latches). - a read that can be served under the current lease and which does not conflict with any of the stuck proposals in the replication layer (= poisoned latches) can also be served. In short, reads only fail fast if they encounter a poisoned latch or need to request a lease. (If they opted out of fail-fast behavior, they behave as today). Latch poisoning is added as a first-class concept in the `concurrency` package, and a structured error `PoisonError` is introduced. This error in particular contains the span and timestamp of the poisoned latch that prompted the fail-fast. Lease proposals now always use `poison.Policy_Wait`, leaving the fail-fast behavior to the caller. This simplifies things since multiple callers with their own `poison.Policy` can multiplex onto a single inflight lease proposal. Addresses #74799. Release note: None Release justification: 22.1 project work 77221: changefeedccl: Fix flaky test. r=miretskiy a=miretskiy Fix flaky TestChangefeedHandlesDrainingNodes test. The source of the flake was that cluster setting updates propagate asynchronously to the other nodes in the cluster. Thus, it was possible for the test to flake because some of the nodes were observing the old value for the setting. The flake is fixed by introducing testing utility function that sets the setting and ensures the setting propagates to all nodes in the test cluster. Fixes #76806 Release Notes: none 77317: util/log: remove Safe in favor of redact.Safe r=yuzefovich a=yuzefovich My desire to make this change is to break the dependency of `treewindow` on `util/log` (which is a part of the effort to clean up the dependencies of `execgen`). Addresses: #77234. Release note: None Release justification: low risk change to clean up the dependencies a bit. Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>

`ReplicaUnavailableError` was previously a leaf error, so the only way to attach additional information "on the way out" was to wrap *it* with wrapper errors. This made it difficult to read the messages since the result and cause were reversed. This became exacerbated with the recent addition of `PoisonedError`, which should really be a cause of `ReplicaUnavailableError` too. In this commit, we make ReplicaUnavailableError a wrapper error and rearrange breaker errors such that if a PoisonError occurs, it is a cause of the ReplicaUnavailableError. The change in `pkg/kv/kvserver/testdata/replica_unavailable_error.txt` illustrates the improved error users will see (and which will thus be reported to us). As a wrapping error, I needed to register the error with `cockroachdb/errors` to allow for proper encoding/decoding. I'm unsure whether this worked properly before, but now it definitely does (as it is tested). Testing was improved to check for presence of `ReplicaUnavailableError` in all breaker errors. Touches cockroachdb#33007. Release justification: UX improvement for existing functionality Release note: None

tbg · 2022-03-04T15:27:46Z

It's usually 1, but it can't hurt to detect cases in which we're looping around. This came up when writing tests for cockroachdb#33007. Release note: None

`RangeFeed`s are long-running operations with a high fixed cost (catch-up scan). They can fall behind for various reasons, replicas becoming unavailable being one of them. However, when a replica's circuit breaker trips, this should not abort any pending (or new, for that matter) RangeFeeds. Add a test that asserts that this is the case. Touches cockroachdb#33007. Touches cockroachdb#76146. Release note: None

While running the nightly roachtest suite with circuit breakers enabled (cockroachdb#76146) I observed undesirable interactions between circuit breakers and bulk operations. Bulk operations tend to overload the cluster. When this happens, circuit breakers may fire "spuriously" (at the very least if liveness is lost across multiple nodes for some time), which returns hard errors to the bulk job. The job will then roll back any work it has done, which may also fail due to the circuit breaker (leaving the job in limbo). For long-running bulk jobs, it seems desirable to bypass the circuit breaker entirely. When unavailability occurs, the job will not be able to make progress, but it should be left to the operator whether to cancel it or not; after all, once the outage resolves the job may be able to resume normally. This PR adds machinery that allows request types to bypass circuit breakers. Concretely, any batch that contains either of the following request types: - Export - AddSSTable - RevertRange - ClearRange - GC - Probe Touches cockroachdb#33007. Release note: None

`ReplicaUnavailableError` was previously a leaf error, so the only way to attach additional information "on the way out" was to wrap *it* with wrapper errors. This made it difficult to read the messages since the result and cause were reversed. This became exacerbated with the recent addition of `PoisonedError`, which should really be a cause of `ReplicaUnavailableError` too. In this commit, we make ReplicaUnavailableError a wrapper error and rearrange breaker errors such that if a PoisonError occurs, it is a cause of the ReplicaUnavailableError. The change in `pkg/kv/kvserver/testdata/replica_unavailable_error.txt` illustrates the improved error users will see (and which will thus be reported to us). As a wrapping error, I needed to register the error with `cockroachdb/errors` to allow for proper encoding/decoding. I'm unsure whether this worked properly before, but now it definitely does (as it is tested). Testing was improved to check for presence of `ReplicaUnavailableError` in all breaker errors. Touches cockroachdb#33007. Release justification: UX improvement for existing functionality Release note: None

77365: roachpb,kvserver: make ReplicaUnavailableError a wrapping error r=erikgrinaker a=tbg `ReplicaUnavailableError` was previously a leaf error, so the only way to attach additional information "on the way out" was to wrap *it* with wrapper errors. This made it difficult to read the messages since the result and cause were reversed. This became exacerbated with the recent addition of `PoisonedError`, which should really be a cause of `ReplicaUnavailableError` too. In this commit, we make ReplicaUnavailableError a wrapper error and rearrange breaker errors such that if a PoisonError occurs, it is a cause of the ReplicaUnavailableError. The change in `pkg/kv/kvserver/testdata/replica_unavailable_error.txt` illustrates the improved error users will see (and which will thus be reported to us). As a wrapping error, I needed to register the error with `cockroachdb/errors` to allow for proper encoding/decoding. I'm unsure whether this worked properly before, but now it definitely does (as it is tested). Testing was improved to check for presence of `ReplicaUnavailableError` in all breaker errors. Touches #33007. Release justification: UX improvement for existing functionality Release note: None Co-authored-by: Tobias Grieger <[email protected]>

76989: importccl: import avro files with logical time types r=HonoreDB,stevendanna a=msbutler Previously, the user had to stringify all time cols in their avro file before importing into CRDB cluster. This change allows avro columns with the following avro types to get imported directly into: long.time-micros, int.time-millis, long.timestamp-micros,long.timestamp-millis, and int.date. Notably, we still don't allow local-timestamp-micros and local-timestamp-millis to get imported directly because avro file reader we use, goAvro, does not suppor them. If there's demand, future work could be done to support importing the decimal avro type, the only other avro logical type goAvro supports. Release justification: low risk, high benefit changes to existing functionality Release note (sql change): IMPORT INTO with AVRO now supports avro files with the following avro types: long.time-micros, int.time-millis, long.timestamp-micros,long.timestamp-millis, and int.date. This feature only works if the user has created a crdb table with column types with match certain avro types. Specifically: AVRO | CRDB time-* | TIME timestamp_* | TIMESTAMP date | DATE 77360: roachpb: modernize AmbiguousResultError r=erikgrinaker a=tbg AmbiguousResultError previously had a `WrappedErr *Error` field. Wrapping `Error` inside of itself is a really bad idea! Luckily, it wasn't used by anyone, but it is being used now, with replica circuit breakers. Before we cement this in a production release, give this error type a do-over and let it "properly" wrap an error. I performed an audit to make sure we were always checking for ambiguous results first, i.e. looked for bugs of the form ```go switch err.(type) { case *roachpb.SomeRetriableError: // Retry request. continue case *AmbiguousResultError: return err } ``` where `err` might be an `AmbiguousResultError` wrapping a `NotRetriableError`. I didn't find anything; we always carry out the ambiguous result check first. Also, very few errors are wrapped in an ambiguous result, and most of them could actually take precedence. If for example an ambiguous result wraps a lease error, the result is not ambiguous. We don't try to exploit that at all in this commit, and likely never will since this isn't a common situation. Release justification: improvement to new functionality (#33007) Release note: None 77416: kvserver: change invariant checking in raftLogQueue r=erikgrinaker a=sumeerbhola The meaning of truncateDecisionIndex.FirstIndex has changed in the presence of loosely coupled truncation. We now consider the raft log to be empty for purposes of truncation when [FirstIndex,LastIndex] is an empty interval. Fixes #77394 Release justification: Bug fix by changing an assertion that was no longer correct. Release note: None 77437: sql/catalog/lease: update a comment r=ajwerner a=ajwerner The comment referenced a variable which has since been renamed. Release justification: non-production code change Release note: None Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: sumeerbhola <[email protected]> Co-authored-by: Andrew Werner <[email protected]>

tbg · 2022-03-24T13:53:24Z

Replica circuit breakers are enabled on master and release-22.1. There are follow-up issues filed for future extensions, see above.

This adds a roachtest that runs TPCC (with a low warehouse count and `--tolerate-errors`), loses quorum half-way through, and checks the prometheus metrics of the workload for fail-fast behavior. It also checks that follower reads and bounded staleness reads work as expected at timestamps at which the cluster was known to have been healthy. Touches cockroachdb#33007. Release note: None

knz added O-community Originated from the community C-question A question rather than an issue. No code/spec/doc change needed. labels Dec 11, 2018

knz changed the title ~~Multi-region install, if one DC is down, other cluster is frozen.~~ storage: make requests to unavailable ranges time out with a clear error message Jan 22, 2019

tbg self-assigned this Jan 22, 2019

tbg mentioned this issue Jan 28, 2019

storage: improve message on slow Raft proposal #34296

Merged

tbg added the O-postmortem Originated from a Postmortem action item. label Feb 22, 2021

tbg changed the title ~~storage: make requests to unavailable ranges time out with a clear error message~~ kvserver: make requests to unavailable ranges time out with a clear error message Feb 23, 2021

tbg mentioned this issue Feb 23, 2021

sql: kv txn held open while sql txn waits for ROLLBACK #60915

Open

This was referenced Mar 3, 2022

roachpb,kvserver: make ReplicaUnavailableError a wrapping error #77365

Merged

roachpb: modernize AmbiguousResultError #77360

Merged

kvserver: add generalized circuit-breaking to catch, e.g., mutex deadlocks #77366

Open

RajivTS pushed a commit to RajivTS/cockroach that referenced this issue Mar 6, 2022

kvserver: print attempts on slow lease

3685c31

It's usually 1, but it can't hurt to detect cases in which we're looping around. This came up when writing tests for cockroachdb#33007. Release note: None

tbg closed this as completed Mar 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: circuit-break requests to unavailable ranges #33007

kvserver: circuit-break requests to unavailable ranges #33007

derkan commented Dec 11, 2018 •

edited by cockroach-jira-scripts

Loading

derkan commented Dec 11, 2018 •

edited

Loading

knz commented Dec 11, 2018

tim-o commented Dec 11, 2018

derkan commented Dec 12, 2018 •

edited

Loading

tim-o commented Dec 13, 2018

tbg commented Jan 15, 2019

knz commented Jan 22, 2019

tbg commented Jan 22, 2019

tbg commented Feb 16, 2021

tbg commented Mar 4, 2022 •

edited

Loading

tbg commented Mar 24, 2022

kvserver: circuit-break requests to unavailable ranges #33007

kvserver: circuit-break requests to unavailable ranges #33007

Comments

derkan commented Dec 11, 2018 • edited by cockroach-jira-scripts Loading

derkan commented Dec 11, 2018 • edited Loading

knz commented Dec 11, 2018

tim-o commented Dec 11, 2018

derkan commented Dec 12, 2018 • edited Loading

tim-o commented Dec 13, 2018

tbg commented Jan 15, 2019

knz commented Jan 22, 2019

tbg commented Jan 22, 2019

tbg commented Feb 16, 2021

tbg commented Mar 4, 2022 • edited Loading

tbg commented Mar 24, 2022

derkan commented Dec 11, 2018 •

edited by cockroach-jira-scripts

Loading

derkan commented Dec 11, 2018 •

edited

Loading

derkan commented Dec 12, 2018 •

edited

Loading

tbg commented Mar 4, 2022 •

edited

Loading