-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: consider additional triggers for replica circuit breakers #74712
Labels
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
no-issue-activity
X-stale
Comments
tbg
added
the
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
label
Jan 11, 2022
nicktrav
added a commit
to nicktrav/cockroach
that referenced
this issue
May 5, 2022
Currently, Pebble will emit a fatal (or error, if configured) log event in a situation where a single write or sync operation is exceeds the the `MaxSyncDuration`. By default, this value is set to `60s`, but can be configured with the `storage.max_sync_duration` setting. Recent incidents have demonstrated that the current default value is most likely too high. For example, stalled disk operations that prevent a node heartbeating within 4.5 seconds will result in the node shedding all leases. Failing faster in this case is desirable. There also exist situations in which stalled disk operations on a single node can adversely affect throughput for an entire cluster (see cockroachlabs/support#1571 and cockroachlabs/support#1564). Lowering the timeout improves the recovery time. Lower the default value to `20s`, to strike a balance between being able to crash the process earlier in the event of a hardware failure (hard or soft), while also allowing ample time for a slow disk operation to clear in the transient case. Update the corresponding value in the logging package. Release note (ops change): The default value for `storage.max_sync_duration` has been lowered from `60s` to `20s`. Cockroach will exit sooner with a fatal error if a single slow disk operation exceeds this value. Touches cockroachdb#80942, cockroachdb#74712.
craig bot
pushed a commit
that referenced
this issue
May 18, 2022
81075: storage,log: reduce max sync duration default timeouts r=erikgrinaker,sumeerbhola a=nicktrav Currently, Pebble will emit a fatal (or error, if configured) log event in a situation where a single write or sync operation is exceeds the the `MaxSyncDuration`. By default, this value is set to `60s`, but can be configured with the `storage.max_sync_duration` setting. Recent incidents have demonstrated that the current default value is most likely too high. For example, stalled disk operations that prevent a node heartbeating within 4.5 seconds will result in the node shedding all leases. Failing faster in this case is desirable. There also exist situations in which stalled disk operations on a single node can adversely affect throughput for an entire cluster (see cockroachlabs/support#1571 and cockroachlabs/support#1564). Lowering the timeout improves the recovery time. Lower the default value to `20s`, to strike a balance between being able to crash the process earlier in the event of a hardware failure (hard or soft), while also allowing ample time for a slow disk operation to clear in the transient case. Update the corresponding value in the logging package. Release note (ops change): The default value for `storage.max_sync_duration` has been lowered from `60s` to `20s`. Cockroach will exit sooner with a fatal error if a single slow disk operation exceeds this value. Touches #80942, #74712. 81468: docs: clarify descriptions for tracing cluster settings r=andreimatei,arulajmani a=michae2 The descriptions for sql.trace.txn.enable_threshold and sql.trace.stmt.enable_threshold suggested that tracing was only enabled for transactions and statements longer than the threshold duration. This is not true, however: tracing is enabled for everything, and only _logged_ for transactions and statements longer than the threshold duration. Make this clear in the descriptions. Also make the description for sql.trace.session_eventlog.enabled a "single" sentence to match the style of other descriptions. Release note: None Co-authored-by: Nick Travers <[email protected]> Co-authored-by: Michael Erickson <[email protected]>
blathers-crl bot
pushed a commit
that referenced
this issue
May 19, 2022
Currently, Pebble will emit a fatal (or error, if configured) log event in a situation where a single write or sync operation is exceeds the the `MaxSyncDuration`. By default, this value is set to `60s`, but can be configured with the `storage.max_sync_duration` setting. Recent incidents have demonstrated that the current default value is most likely too high. For example, stalled disk operations that prevent a node heartbeating within 4.5 seconds will result in the node shedding all leases. Failing faster in this case is desirable. There also exist situations in which stalled disk operations on a single node can adversely affect throughput for an entire cluster (see cockroachlabs/support#1571 and cockroachlabs/support#1564). Lowering the timeout improves the recovery time. Lower the default value to `20s`, to strike a balance between being able to crash the process earlier in the event of a hardware failure (hard or soft), while also allowing ample time for a slow disk operation to clear in the transient case. Update the corresponding value in the logging package. Release note (ops change): The default value for `storage.max_sync_duration` has been lowered from `60s` to `20s`. Cockroach will exit sooner with a fatal error if a single slow disk operation exceeds this value. Touches #80942, #74712.
We have marked this issue as stale because it has been inactive for |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
no-issue-activity
X-stale
Touches #33007.
Is your feature request related to a problem? Please describe.
As of #71806, the per-replica circuit breaker trips when a proposal is in-flight for more than 15s. However, there are other scenarios in which tripping the probe may be worthwhile. For example, a read-only command may have deadlocked. It would thus never release its latches, possibly blocking all write traffic to the range.
The current trigger may also never fire. If there isn't a lease, and the client has a statement timeout of <15s throughout, lease proposals (which have their unique reproposal treatment, #74711) never stick around for long enough to trigger the breaker (if there is a lease, the actual client write will be in the proposals map and will stay there even if the client gives up, so in that case it works, or should, anyway; most clients are pipelined writes so it's the common case too, in some sense).
Another scenario is when a node goes non-live. This is an event associated with potential loss of quorum. Each Store could iterate through its replicas and, based on the active descriptor, consider tripping the breaker based on an expectation that this range (and thus replica) will now be unavailable. This will generally move affected replicas into the desirable fail-fast regime more efficiently; at the time of writing (#71806), replicas will fail-fast 15s after they have received a request that got stuck. When most ranges receive only sporadic writes, clients will see 15+s latencies for some time, which is not the UX we're ultimately after.
As more triggers are introduced, the risk of false positives (i.e. tripping the breaker of a healthy replica) increases. We could consider a semantic change: instead of outright tripping the breaker, the breaker's probe is triggered. If the probe passes - all is well, the breaker never trips. If the probe fails, it trips the breaker.
Related (figuring out what "slow" request means): #74509
Related (more coverage for the probe): #74701
Related (limits of per-Replica circuit breakers): #74616 #74503
Jira issue: CRDB-12225
The text was updated successfully, but these errors were encountered: