Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: consider additional triggers for replica circuit breakers #74712

Closed
tbg opened this issue Jan 11, 2022 · 1 comment
Closed

kvserver: consider additional triggers for replica circuit breakers #74712

tbg opened this issue Jan 11, 2022 · 1 comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity X-stale

Comments

@tbg
Copy link
Member

tbg commented Jan 11, 2022

Touches #33007.

Is your feature request related to a problem? Please describe.

As of #71806, the per-replica circuit breaker trips when a proposal is in-flight for more than 15s. However, there are other scenarios in which tripping the probe may be worthwhile. For example, a read-only command may have deadlocked. It would thus never release its latches, possibly blocking all write traffic to the range.

The current trigger may also never fire. If there isn't a lease, and the client has a statement timeout of <15s throughout, lease proposals (which have their unique reproposal treatment, #74711) never stick around for long enough to trigger the breaker (if there is a lease, the actual client write will be in the proposals map and will stay there even if the client gives up, so in that case it works, or should, anyway; most clients are pipelined writes so it's the common case too, in some sense).

Another scenario is when a node goes non-live. This is an event associated with potential loss of quorum. Each Store could iterate through its replicas and, based on the active descriptor, consider tripping the breaker based on an expectation that this range (and thus replica) will now be unavailable. This will generally move affected replicas into the desirable fail-fast regime more efficiently; at the time of writing (#71806), replicas will fail-fast 15s after they have received a request that got stuck. When most ranges receive only sporadic writes, clients will see 15+s latencies for some time, which is not the UX we're ultimately after.

As more triggers are introduced, the risk of false positives (i.e. tripping the breaker of a healthy replica) increases. We could consider a semantic change: instead of outright tripping the breaker, the breaker's probe is triggered. If the probe passes - all is well, the breaker never trips. If the probe fails, it trips the breaker.

Related (figuring out what "slow" request means): #74509
Related (more coverage for the probe): #74701
Related (limits of per-Replica circuit breakers): #74616 #74503

Jira issue: CRDB-12225

@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jan 11, 2022
@tbg tbg self-assigned this Jan 14, 2022
nicktrav added a commit to nicktrav/cockroach that referenced this issue May 5, 2022
Currently, Pebble will emit a fatal (or error, if configured) log event
in a situation where a single write or sync operation is exceeds the the
`MaxSyncDuration`. By default, this value is set to `60s`, but can be
configured with the `storage.max_sync_duration` setting.

Recent incidents have demonstrated that the current default value is
most likely too high. For example, stalled disk operations that prevent
a node heartbeating within 4.5 seconds will result in the node shedding
all leases. Failing faster in this case is desirable.

There also exist situations in which stalled disk operations on a single
node can adversely affect throughput for an entire cluster (see
cockroachlabs/support#1571 and cockroachlabs/support#1564). Lowering the
timeout improves the recovery time.

Lower the default value to `20s`, to strike a balance between being able
to crash the process earlier in the event of a hardware failure (hard or
soft), while also allowing ample time for a slow disk operation to clear
in the transient case.

Update the corresponding value in the logging package.

Release note (ops change): The default value for
`storage.max_sync_duration` has been lowered from `60s` to `20s`.
Cockroach will exit sooner with a fatal error if a single slow disk
operation exceeds this value.

Touches cockroachdb#80942, cockroachdb#74712.
craig bot pushed a commit that referenced this issue May 18, 2022
81075: storage,log: reduce max sync duration default timeouts r=erikgrinaker,sumeerbhola a=nicktrav

Currently, Pebble will emit a fatal (or error, if configured) log event
in a situation where a single write or sync operation is exceeds the the
`MaxSyncDuration`. By default, this value is set to `60s`, but can be
configured with the `storage.max_sync_duration` setting.

Recent incidents have demonstrated that the current default value is
most likely too high. For example, stalled disk operations that prevent
a node heartbeating within 4.5 seconds will result in the node shedding
all leases. Failing faster in this case is desirable.

There also exist situations in which stalled disk operations on a single
node can adversely affect throughput for an entire cluster (see
cockroachlabs/support#1571 and cockroachlabs/support#1564). Lowering the
timeout improves the recovery time.

Lower the default value to `20s`, to strike a balance between being able
to crash the process earlier in the event of a hardware failure (hard or
soft), while also allowing ample time for a slow disk operation to clear
in the transient case.

Update the corresponding value in the logging package.

Release note (ops change): The default value for
`storage.max_sync_duration` has been lowered from `60s` to `20s`.
Cockroach will exit sooner with a fatal error if a single slow disk
operation exceeds this value.

Touches #80942, #74712.

81468: docs: clarify descriptions for tracing cluster settings r=andreimatei,arulajmani a=michae2

The descriptions for sql.trace.txn.enable_threshold and
sql.trace.stmt.enable_threshold suggested that tracing was only enabled
for transactions and statements longer than the threshold duration. This
is not true, however: tracing is enabled for everything, and only
_logged_ for transactions and statements longer than the threshold
duration. Make this clear in the descriptions. Also make the description
for sql.trace.session_eventlog.enabled a "single" sentence to match the
style of other descriptions.

Release note: None

Co-authored-by: Nick Travers <[email protected]>
Co-authored-by: Michael Erickson <[email protected]>
blathers-crl bot pushed a commit that referenced this issue May 19, 2022
Currently, Pebble will emit a fatal (or error, if configured) log event
in a situation where a single write or sync operation is exceeds the the
`MaxSyncDuration`. By default, this value is set to `60s`, but can be
configured with the `storage.max_sync_duration` setting.

Recent incidents have demonstrated that the current default value is
most likely too high. For example, stalled disk operations that prevent
a node heartbeating within 4.5 seconds will result in the node shedding
all leases. Failing faster in this case is desirable.

There also exist situations in which stalled disk operations on a single
node can adversely affect throughput for an entire cluster (see
cockroachlabs/support#1571 and cockroachlabs/support#1564). Lowering the
timeout improves the recovery time.

Lower the default value to `20s`, to strike a balance between being able
to crash the process earlier in the event of a hardware failure (hard or
soft), while also allowing ample time for a slow disk operation to clear
in the transient case.

Update the corresponding value in the logging package.

Release note (ops change): The default value for
`storage.max_sync_duration` has been lowered from `60s` to `20s`.
Cockroach will exit sooner with a fatal error if a single slow disk
operation exceeds this value.

Touches #80942, #74712.
@tbg tbg removed their assignment May 31, 2022
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity X-stale
Projects
None yet
Development

No branches or pull requests

2 participants