storage,log: reduce max sync duration default timeouts #81075

nicktrav · 2022-05-05T22:03:32Z

Currently, Pebble will emit a fatal (or error, if configured) log event
in a situation where a single write or sync operation is exceeds the the
MaxSyncDuration. By default, this value is set to 60s, but can be
configured with the storage.max_sync_duration setting.

Recent incidents have demonstrated that the current default value is
most likely too high. For example, stalled disk operations that prevent
a node heartbeating within 4.5 seconds will result in the node shedding
all leases. Failing faster in this case is desirable.

There also exist situations in which stalled disk operations on a single
node can adversely affect throughput for an entire cluster (see
cockroachlabs/support#1571 and cockroachlabs/support#1564). Lowering the
timeout improves the recovery time.

Lower the default value to 20s, to strike a balance between being able
to crash the process earlier in the event of a hardware failure (hard or
soft), while also allowing ample time for a slow disk operation to clear
in the transient case.

Update the corresponding value in the logging package.

Release note (ops change): The default value for
storage.max_sync_duration has been lowered from 60s to 20s.
Cockroach will exit sooner with a fatal error if a single slow disk
operation exceeds this value.

Touches #80942, #74712.

Currently, Pebble will emit a fatal (or error, if configured) log event in a situation where a single write or sync operation is exceeds the the `MaxSyncDuration`. By default, this value is set to `60s`, but can be configured with the `storage.max_sync_duration` setting. Recent incidents have demonstrated that the current default value is most likely too high. For example, stalled disk operations that prevent a node heartbeating within 4.5 seconds will result in the node shedding all leases. Failing faster in this case is desirable. There also exist situations in which stalled disk operations on a single node can adversely affect throughput for an entire cluster (see cockroachlabs/support#1571 and cockroachlabs/support#1564). Lowering the timeout improves the recovery time. Lower the default value to `20s`, to strike a balance between being able to crash the process earlier in the event of a hardware failure (hard or soft), while also allowing ample time for a slow disk operation to clear in the transient case. Update the corresponding value in the logging package. Release note (ops change): The default value for `storage.max_sync_duration` has been lowered from `60s` to `20s`. Cockroach will exit sooner with a fatal error if a single slow disk operation exceeds this value. Touches cockroachdb#80942, cockroachdb#74712.

cockroach-teamcity · 2022-05-05T22:03:40Z

This change is

erikgrinaker

LGTM!

Do we happen to have any metrics/logs on disk stalls? Would be interesting to survey the CC clusters and look at the tail distribution.

nicktrav · 2022-05-05T22:50:34Z

Do we happen to have any metrics/logs on disk stalls?

Good call. Let me see what I can dig up.

nicktrav · 2022-05-05T22:54:01Z

Seems like panics due to logging stalling is seems to happen a bunch in the wild. This could all be from perpetually broken clusters though, so results might be skewed: https://sentry.io/organizations/cockroach-labs/issues/?query=%22disk+stall+detected%22&statsPeriod=14d

Logs from our own clusters will be better. Will keep digging.

sumeerbhola

once @erikgrinaker's question is addressed.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nicktrav)

nicktrav · 2022-05-06T14:16:31Z

There's some more context in this internal thread.

Given there's a cloud provider persistent disk outage happening at the time I write this (internal ticket), it would be interesting to see if that shakes out some more useful data on this.

nicktrav · 2022-05-10T13:50:46Z

I took a look at what we observe on our CC clusters.

In the last 14 days, the instances of "disk slowness" were very localized (as opposed to having some constant "baseline" of event happening in the background). The threshold for logging these events is 2s, so the vast majority of faults clear within that time (we should be able to safely assume that we have sufficient exposure to these types of storage faults, across multiple clouds across multiple regions). Based on this, I don't believe there's a risk of seeing a sudden uptick in these fatal log events.

During this 14 day window there was a GCP persistent disk outage in us-central1 (mentioned above), that seemed to affect a handful of nodes on various clusters. I did see some fatal log lines (at the 60s+ time scale). There were also some nodes on which the fault cleared well in excess of the proposed 20s limit, but below the 60s threshold. These nodes would have crashed with the proposed change, whereas before they would not have. If we assume that the node's leases were already shed, crashing the node earlier seems like a win in this case.

@erikgrinaker - wdyt?

Also - how do we feel about backporting this? Seems like a behavior change.

erikgrinaker · 2022-05-10T14:00:26Z

Thanks for checking Nick! Since the background rate is ~0, and the failure mode during an actual outage is stalled ranges (due to lease transfer issues in #81100), I think this makes sense.

Also - how do we feel about backporting this? Seems like a behavior change.

The 22.1 backport should be fine, since it's still early days. Not so sure about 21.2 and 21.1 -- I'd lean towards not backporting there, out of an abundance of caution. Can we make these parameters a cluster setting? If so, affected users can drop this themselves on older versions, which is probably sufficient.

nicktrav · 2022-05-18T23:05:32Z

The 22.1 backport should be fine

👍

Can we make these parameters a cluster setting?

Yeah, it would be nice if this was configurable. I'm going to dig into this a little more, separately, as the plumbing required in the logging package was not as straight forward as I'd anticipated.

I'll land this as is, and follow up.

TFTRs!

nicktrav · 2022-05-18T23:06:05Z

bors r=erikgrinaker,sumeerbhola

craig · 2022-05-19T00:45:31Z

Build succeeded:

GitHub CI (Cockroach)

nicktrav requested review from erikgrinaker and sumeerbhola May 5, 2022 22:03

nicktrav requested review from a team as code owners May 5, 2022 22:03

erikgrinaker approved these changes May 5, 2022

View reviewed changes

sumeerbhola approved these changes May 6, 2022

View reviewed changes

nicktrav added the backport-22.1.x label May 10, 2022

nicktrav self-assigned this May 10, 2022

nicktrav removed the backport-22.1.x label May 10, 2022

nicktrav added the backport-22.1.x label May 18, 2022

craig bot merged commit 392d62a into cockroachdb:master May 19, 2022

blathers-crl bot mentioned this pull request May 19, 2022

release-22.1: storage,log: reduce max sync duration default timeouts #81496

Merged

cockroach-teamcity mentioned this pull request May 19, 2022

storage,log: reduce max sync duration default timeouts cockroachdb/docs#13889

Closed

nicktrav deleted the nickt.disk-stall-timeouts branch May 19, 2022 02:38

nicktrav mentioned this pull request Nov 1, 2022

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #89100

Closed

nicktrav mentioned this pull request Jan 26, 2023

log: decrease ExitTimeoutOnFatalLog #96018

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage,log: reduce max sync duration default timeouts #81075

storage,log: reduce max sync duration default timeouts #81075

nicktrav commented May 5, 2022

cockroach-teamcity commented May 5, 2022

erikgrinaker left a comment

nicktrav commented May 5, 2022

nicktrav commented May 5, 2022

sumeerbhola left a comment

nicktrav commented May 6, 2022

nicktrav commented May 10, 2022

erikgrinaker commented May 10, 2022

nicktrav commented May 18, 2022

nicktrav commented May 18, 2022

craig bot commented May 19, 2022

storage,log: reduce max sync duration default timeouts #81075

storage,log: reduce max sync duration default timeouts #81075

Conversation

nicktrav commented May 5, 2022

cockroach-teamcity commented May 5, 2022

erikgrinaker left a comment

Choose a reason for hiding this comment

nicktrav commented May 5, 2022

nicktrav commented May 5, 2022

sumeerbhola left a comment

Choose a reason for hiding this comment

nicktrav commented May 6, 2022

nicktrav commented May 10, 2022

erikgrinaker commented May 10, 2022

nicktrav commented May 18, 2022

nicktrav commented May 18, 2022

craig bot commented May 19, 2022