Skip to content

Commit

Permalink
storage,log: reduce max sync duration default timeouts
Browse files Browse the repository at this point in the history
Currently, Pebble will emit a fatal (or error, if configured) log event
in a situation where a single write or sync operation is exceeds the the
`MaxSyncDuration`. By default, this value is set to `60s`, but can be
configured with the `storage.max_sync_duration` setting.

Recent incidents have demonstrated that the current default value is
most likely too high. For example, stalled disk operations that prevent
a node heartbeating within 4.5 seconds will result in the node shedding
all leases. Failing faster in this case is desirable.

There also exist situations in which stalled disk operations on a single
node can adversely affect throughput for an entire cluster (see
cockroachlabs/support#1571 and cockroachlabs/support#1564). Lowering the
timeout improves the recovery time.

Lower the default value to `20s`, to strike a balance between being able
to crash the process earlier in the event of a hardware failure (hard or
soft), while also allowing ample time for a slow disk operation to clear
in the transient case.

Update the corresponding value in the logging package.

Release note (ops change): The default value for
`storage.max_sync_duration` has been lowered from `60s` to `20s`.
Cockroach will exit sooner with a fatal error if a single slow disk
operation exceeds this value.

Touches cockroachdb#80942, cockroachdb#74712.
  • Loading branch information
nicktrav committed May 5, 2022
1 parent da04bc2 commit c57dd84
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 2 deletions.
2 changes: 1 addition & 1 deletion pkg/storage/pebble.go
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ import (
const maxSyncDurationFatalOnExceededDefault = true

// Default for MaxSyncDuration below.
var maxSyncDurationDefault = envutil.EnvOrDefaultDuration("COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT", 60*time.Second)
var maxSyncDurationDefault = envutil.EnvOrDefaultDuration("COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT", 20*time.Second)

// MaxSyncDuration is the threshold above which an observed engine sync duration
// triggers either a warning or a fatal error.
Expand Down
2 changes: 1 addition & 1 deletion pkg/util/log/log_flush.go
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ const syncInterval = 30
// In practice, even a fraction of that would indicate a problem. This metric's
// default should ideally match its sister metric in the storage engine, set by
// COCKROACH_ENGINE_MAX_SYNC_DURATION.
var maxSyncDuration = envutil.EnvOrDefaultDuration("COCKROACH_LOG_MAX_SYNC_DURATION", 60*time.Second)
var maxSyncDuration = envutil.EnvOrDefaultDuration("COCKROACH_LOG_MAX_SYNC_DURATION", 20*time.Second)

// syncWarnDuration is the threshold after which a slow disk warning is written
// to the log and to stderr.
Expand Down

0 comments on commit c57dd84

Please sign in to comment.