Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage,log: reduce max sync duration default timeouts #81075

Merged
merged 1 commit into from
May 19, 2022

Conversation

nicktrav
Copy link
Collaborator

@nicktrav nicktrav commented May 5, 2022

Currently, Pebble will emit a fatal (or error, if configured) log event
in a situation where a single write or sync operation is exceeds the the
MaxSyncDuration. By default, this value is set to 60s, but can be
configured with the storage.max_sync_duration setting.

Recent incidents have demonstrated that the current default value is
most likely too high. For example, stalled disk operations that prevent
a node heartbeating within 4.5 seconds will result in the node shedding
all leases. Failing faster in this case is desirable.

There also exist situations in which stalled disk operations on a single
node can adversely affect throughput for an entire cluster (see
cockroachlabs/support#1571 and cockroachlabs/support#1564). Lowering the
timeout improves the recovery time.

Lower the default value to 20s, to strike a balance between being able
to crash the process earlier in the event of a hardware failure (hard or
soft), while also allowing ample time for a slow disk operation to clear
in the transient case.

Update the corresponding value in the logging package.

Release note (ops change): The default value for
storage.max_sync_duration has been lowered from 60s to 20s.
Cockroach will exit sooner with a fatal error if a single slow disk
operation exceeds this value.

Touches #80942, #74712.

Currently, Pebble will emit a fatal (or error, if configured) log event
in a situation where a single write or sync operation is exceeds the the
`MaxSyncDuration`. By default, this value is set to `60s`, but can be
configured with the `storage.max_sync_duration` setting.

Recent incidents have demonstrated that the current default value is
most likely too high. For example, stalled disk operations that prevent
a node heartbeating within 4.5 seconds will result in the node shedding
all leases. Failing faster in this case is desirable.

There also exist situations in which stalled disk operations on a single
node can adversely affect throughput for an entire cluster (see
cockroachlabs/support#1571 and cockroachlabs/support#1564). Lowering the
timeout improves the recovery time.

Lower the default value to `20s`, to strike a balance between being able
to crash the process earlier in the event of a hardware failure (hard or
soft), while also allowing ample time for a slow disk operation to clear
in the transient case.

Update the corresponding value in the logging package.

Release note (ops change): The default value for
`storage.max_sync_duration` has been lowered from `60s` to `20s`.
Cockroach will exit sooner with a fatal error if a single slow disk
operation exceeds this value.

Touches cockroachdb#80942, cockroachdb#74712.
@nicktrav nicktrav requested review from a team as code owners May 5, 2022 22:03
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Do we happen to have any metrics/logs on disk stalls? Would be interesting to survey the CC clusters and look at the tail distribution.

@nicktrav
Copy link
Collaborator Author

nicktrav commented May 5, 2022

Do we happen to have any metrics/logs on disk stalls?

Good call. Let me see what I can dig up.

@nicktrav
Copy link
Collaborator Author

nicktrav commented May 5, 2022

Seems like panics due to logging stalling is seems to happen a bunch in the wild. This could all be from perpetually broken clusters though, so results might be skewed: https://sentry.io/organizations/cockroach-labs/issues/?query=%22disk+stall+detected%22&statsPeriod=14d

Logs from our own clusters will be better. Will keep digging.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: once @erikgrinaker's question is addressed.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nicktrav)

@nicktrav
Copy link
Collaborator Author

nicktrav commented May 6, 2022

There's some more context in this internal thread.

Given there's a cloud provider persistent disk outage happening at the time I write this (internal ticket), it would be interesting to see if that shakes out some more useful data on this.

@nicktrav
Copy link
Collaborator Author

I took a look at what we observe on our CC clusters.

In the last 14 days, the instances of "disk slowness" were very localized (as opposed to having some constant "baseline" of event happening in the background). The threshold for logging these events is 2s, so the vast majority of faults clear within that time (we should be able to safely assume that we have sufficient exposure to these types of storage faults, across multiple clouds across multiple regions). Based on this, I don't believe there's a risk of seeing a sudden uptick in these fatal log events.

During this 14 day window there was a GCP persistent disk outage in us-central1 (mentioned above), that seemed to affect a handful of nodes on various clusters. I did see some fatal log lines (at the 60s+ time scale). There were also some nodes on which the fault cleared well in excess of the proposed 20s limit, but below the 60s threshold. These nodes would have crashed with the proposed change, whereas before they would not have. If we assume that the node's leases were already shed, crashing the node earlier seems like a win in this case.

@erikgrinaker - wdyt?

Also - how do we feel about backporting this? Seems like a behavior change.

@erikgrinaker
Copy link
Contributor

Thanks for checking Nick! Since the background rate is ~0, and the failure mode during an actual outage is stalled ranges (due to lease transfer issues in #81100), I think this makes sense.

Also - how do we feel about backporting this? Seems like a behavior change.

The 22.1 backport should be fine, since it's still early days. Not so sure about 21.2 and 21.1 -- I'd lean towards not backporting there, out of an abundance of caution. Can we make these parameters a cluster setting? If so, affected users can drop this themselves on older versions, which is probably sufficient.

@nicktrav
Copy link
Collaborator Author

The 22.1 backport should be fine

👍

Can we make these parameters a cluster setting?

Yeah, it would be nice if this was configurable. I'm going to dig into this a little more, separately, as the plumbing required in the logging package was not as straight forward as I'd anticipated.

I'll land this as is, and follow up.

TFTRs!

@nicktrav
Copy link
Collaborator Author

bors r=erikgrinaker,sumeerbhola

@craig
Copy link
Contributor

craig bot commented May 19, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants