kvserver: consider additional triggers for replica circuit breakers #74712

tbg · 2022-01-11T22:26:38Z

Touches #33007.

Is your feature request related to a problem? Please describe.

As of #71806, the per-replica circuit breaker trips when a proposal is in-flight for more than 15s. However, there are other scenarios in which tripping the probe may be worthwhile. For example, a read-only command may have deadlocked. It would thus never release its latches, possibly blocking all write traffic to the range.

The current trigger may also never fire. If there isn't a lease, and the client has a statement timeout of <15s throughout, lease proposals (which have their unique reproposal treatment, #74711) never stick around for long enough to trigger the breaker (if there is a lease, the actual client write will be in the proposals map and will stay there even if the client gives up, so in that case it works, or should, anyway; most clients are pipelined writes so it's the common case too, in some sense).

Another scenario is when a node goes non-live. This is an event associated with potential loss of quorum. Each Store could iterate through its replicas and, based on the active descriptor, consider tripping the breaker based on an expectation that this range (and thus replica) will now be unavailable. This will generally move affected replicas into the desirable fail-fast regime more efficiently; at the time of writing (#71806), replicas will fail-fast 15s after they have received a request that got stuck. When most ranges receive only sporadic writes, clients will see 15+s latencies for some time, which is not the UX we're ultimately after.

As more triggers are introduced, the risk of false positives (i.e. tripping the breaker of a healthy replica) increases. We could consider a semantic change: instead of outright tripping the breaker, the breaker's probe is triggered. If the probe passes - all is well, the breaker never trips. If the probe fails, it trips the breaker.

Related (figuring out what "slow" request means): #74509
Related (more coverage for the probe): #74701
Related (limits of per-Replica circuit breakers): #74616 #74503

Jira issue: CRDB-12225

Currently, Pebble will emit a fatal (or error, if configured) log event in a situation where a single write or sync operation is exceeds the the `MaxSyncDuration`. By default, this value is set to `60s`, but can be configured with the `storage.max_sync_duration` setting. Recent incidents have demonstrated that the current default value is most likely too high. For example, stalled disk operations that prevent a node heartbeating within 4.5 seconds will result in the node shedding all leases. Failing faster in this case is desirable. There also exist situations in which stalled disk operations on a single node can adversely affect throughput for an entire cluster (see cockroachlabs/support#1571 and cockroachlabs/support#1564). Lowering the timeout improves the recovery time. Lower the default value to `20s`, to strike a balance between being able to crash the process earlier in the event of a hardware failure (hard or soft), while also allowing ample time for a slow disk operation to clear in the transient case. Update the corresponding value in the logging package. Release note (ops change): The default value for `storage.max_sync_duration` has been lowered from `60s` to `20s`. Cockroach will exit sooner with a fatal error if a single slow disk operation exceeds this value. Touches cockroachdb#80942, cockroachdb#74712.

81075: storage,log: reduce max sync duration default timeouts r=erikgrinaker,sumeerbhola a=nicktrav Currently, Pebble will emit a fatal (or error, if configured) log event in a situation where a single write or sync operation is exceeds the the `MaxSyncDuration`. By default, this value is set to `60s`, but can be configured with the `storage.max_sync_duration` setting. Recent incidents have demonstrated that the current default value is most likely too high. For example, stalled disk operations that prevent a node heartbeating within 4.5 seconds will result in the node shedding all leases. Failing faster in this case is desirable. There also exist situations in which stalled disk operations on a single node can adversely affect throughput for an entire cluster (see cockroachlabs/support#1571 and cockroachlabs/support#1564). Lowering the timeout improves the recovery time. Lower the default value to `20s`, to strike a balance between being able to crash the process earlier in the event of a hardware failure (hard or soft), while also allowing ample time for a slow disk operation to clear in the transient case. Update the corresponding value in the logging package. Release note (ops change): The default value for `storage.max_sync_duration` has been lowered from `60s` to `20s`. Cockroach will exit sooner with a fatal error if a single slow disk operation exceeds this value. Touches #80942, #74712. 81468: docs: clarify descriptions for tracing cluster settings r=andreimatei,arulajmani a=michae2 The descriptions for sql.trace.txn.enable_threshold and sql.trace.stmt.enable_threshold suggested that tracing was only enabled for transactions and statements longer than the threshold duration. This is not true, however: tracing is enabled for everything, and only _logged_ for transactions and statements longer than the threshold duration. Make this clear in the descriptions. Also make the description for sql.trace.session_eventlog.enabled a "single" sentence to match the style of other descriptions. Release note: None Co-authored-by: Nick Travers <[email protected]> Co-authored-by: Michael Erickson <[email protected]>

Currently, Pebble will emit a fatal (or error, if configured) log event in a situation where a single write or sync operation is exceeds the the `MaxSyncDuration`. By default, this value is set to `60s`, but can be configured with the `storage.max_sync_duration` setting. Recent incidents have demonstrated that the current default value is most likely too high. For example, stalled disk operations that prevent a node heartbeating within 4.5 seconds will result in the node shedding all leases. Failing faster in this case is desirable. There also exist situations in which stalled disk operations on a single node can adversely affect throughput for an entire cluster (see cockroachlabs/support#1571 and cockroachlabs/support#1564). Lowering the timeout improves the recovery time. Lower the default value to `20s`, to strike a balance between being able to crash the process earlier in the event of a hardware failure (hard or soft), while also allowing ample time for a slow disk operation to clear in the transient case. Update the corresponding value in the logging package. Release note (ops change): The default value for `storage.max_sync_duration` has been lowered from `60s` to `20s`. Cockroach will exit sooner with a fatal error if a single slow disk operation exceeds this value. Touches #80942, #74712.

github-actions · 2023-11-27T11:06:07Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jan 11, 2022

tbg self-assigned this Jan 14, 2022

tbg mentioned this issue Jan 14, 2022

kvserver: circuit-break requests to unavailable ranges #33007

Closed

tbg mentioned this issue Feb 10, 2022

kvserver: slow_replication_threshold := 1m #76146

Merged

erikgrinaker mentioned this issue May 4, 2022

storage: improve detection / handling of degraded storage device(s) #80942

Closed

nicktrav mentioned this issue May 5, 2022

storage,log: reduce max sync duration default timeouts #81075

Merged

erikgrinaker mentioned this issue May 6, 2022

kvserver: disk stall prevents lease transfer #81100

Closed

blathers-crl bot mentioned this issue May 19, 2022

release-22.1: storage,log: reduce max sync duration default timeouts #81496

Merged

tbg removed their assignment May 31, 2022

erikgrinaker added the T-kv-replication label Jun 2, 2022

nicktrav mentioned this issue Jan 26, 2023

log: decrease ExitTimeoutOnFatalLog #96018

Open

github-actions bot added the no-issue-activity label Nov 27, 2023

github-actions bot added the X-stale label Dec 11, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 11, 2023

exalate-issue-sync bot closed this as completed Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: consider additional triggers for replica circuit breakers #74712

kvserver: consider additional triggers for replica circuit breakers #74712

tbg commented Jan 11, 2022 •

edited by cockroach-jira-scripts

Loading

github-actions bot commented Nov 27, 2023

kvserver: consider additional triggers for replica circuit breakers #74712

kvserver: consider additional triggers for replica circuit breakers #74712

Comments

tbg commented Jan 11, 2022 • edited by cockroach-jira-scripts Loading

github-actions bot commented Nov 27, 2023

tbg commented Jan 11, 2022 •

edited by cockroach-jira-scripts

Loading