-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage,log: reduce max sync duration default timeouts #81075
Conversation
Currently, Pebble will emit a fatal (or error, if configured) log event in a situation where a single write or sync operation is exceeds the the `MaxSyncDuration`. By default, this value is set to `60s`, but can be configured with the `storage.max_sync_duration` setting. Recent incidents have demonstrated that the current default value is most likely too high. For example, stalled disk operations that prevent a node heartbeating within 4.5 seconds will result in the node shedding all leases. Failing faster in this case is desirable. There also exist situations in which stalled disk operations on a single node can adversely affect throughput for an entire cluster (see cockroachlabs/support#1571 and cockroachlabs/support#1564). Lowering the timeout improves the recovery time. Lower the default value to `20s`, to strike a balance between being able to crash the process earlier in the event of a hardware failure (hard or soft), while also allowing ample time for a slow disk operation to clear in the transient case. Update the corresponding value in the logging package. Release note (ops change): The default value for `storage.max_sync_duration` has been lowered from `60s` to `20s`. Cockroach will exit sooner with a fatal error if a single slow disk operation exceeds this value. Touches cockroachdb#80942, cockroachdb#74712.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Do we happen to have any metrics/logs on disk stalls? Would be interesting to survey the CC clusters and look at the tail distribution.
Good call. Let me see what I can dig up. |
Seems like panics due to logging stalling is seems to happen a bunch in the wild. This could all be from perpetually broken clusters though, so results might be skewed: https://sentry.io/organizations/cockroach-labs/issues/?query=%22disk+stall+detected%22&statsPeriod=14d Logs from our own clusters will be better. Will keep digging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once @erikgrinaker's question is addressed.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nicktrav)
There's some more context in this internal thread. Given there's a cloud provider persistent disk outage happening at the time I write this (internal ticket), it would be interesting to see if that shakes out some more useful data on this. |
I took a look at what we observe on our CC clusters. In the last 14 days, the instances of "disk slowness" were very localized (as opposed to having some constant "baseline" of event happening in the background). The threshold for logging these events is 2s, so the vast majority of faults clear within that time (we should be able to safely assume that we have sufficient exposure to these types of storage faults, across multiple clouds across multiple regions). Based on this, I don't believe there's a risk of seeing a sudden uptick in these fatal log events. During this 14 day window there was a GCP persistent disk outage in us-central1 (mentioned above), that seemed to affect a handful of nodes on various clusters. I did see some fatal log lines (at the 60s+ time scale). There were also some nodes on which the fault cleared well in excess of the proposed 20s limit, but below the 60s threshold. These nodes would have crashed with the proposed change, whereas before they would not have. If we assume that the node's leases were already shed, crashing the node earlier seems like a win in this case. @erikgrinaker - wdyt? Also - how do we feel about backporting this? Seems like a behavior change. |
Thanks for checking Nick! Since the background rate is ~0, and the failure mode during an actual outage is stalled ranges (due to lease transfer issues in #81100), I think this makes sense.
The 22.1 backport should be fine, since it's still early days. Not so sure about 21.2 and 21.1 -- I'd lean towards not backporting there, out of an abundance of caution. Can we make these parameters a cluster setting? If so, affected users can drop this themselves on older versions, which is probably sufficient. |
👍
Yeah, it would be nice if this was configurable. I'm going to dig into this a little more, separately, as the plumbing required in the logging package was not as straight forward as I'd anticipated. I'll land this as is, and follow up. TFTRs! |
bors r=erikgrinaker,sumeerbhola |
Build succeeded: |
Currently, Pebble will emit a fatal (or error, if configured) log event
in a situation where a single write or sync operation is exceeds the the
MaxSyncDuration
. By default, this value is set to60s
, but can beconfigured with the
storage.max_sync_duration
setting.Recent incidents have demonstrated that the current default value is
most likely too high. For example, stalled disk operations that prevent
a node heartbeating within 4.5 seconds will result in the node shedding
all leases. Failing faster in this case is desirable.
There also exist situations in which stalled disk operations on a single
node can adversely affect throughput for an entire cluster (see
cockroachlabs/support#1571 and cockroachlabs/support#1564). Lowering the
timeout improves the recovery time.
Lower the default value to
20s
, to strike a balance between being ableto crash the process earlier in the event of a hardware failure (hard or
soft), while also allowing ample time for a slow disk operation to clear
in the transient case.
Update the corresponding value in the logging package.
Release note (ops change): The default value for
storage.max_sync_duration
has been lowered from60s
to20s
.Cockroach will exit sooner with a fatal error if a single slow disk
operation exceeds this value.
Touches #80942, #74712.