Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
81075: storage,log: reduce max sync duration default timeouts r=erikgrinaker,sumeerbhola a=nicktrav

Currently, Pebble will emit a fatal (or error, if configured) log event
in a situation where a single write or sync operation is exceeds the the
`MaxSyncDuration`. By default, this value is set to `60s`, but can be
configured with the `storage.max_sync_duration` setting.

Recent incidents have demonstrated that the current default value is
most likely too high. For example, stalled disk operations that prevent
a node heartbeating within 4.5 seconds will result in the node shedding
all leases. Failing faster in this case is desirable.

There also exist situations in which stalled disk operations on a single
node can adversely affect throughput for an entire cluster (see
cockroachlabs/support#1571 and cockroachlabs/support#1564). Lowering the
timeout improves the recovery time.

Lower the default value to `20s`, to strike a balance between being able
to crash the process earlier in the event of a hardware failure (hard or
soft), while also allowing ample time for a slow disk operation to clear
in the transient case.

Update the corresponding value in the logging package.

Release note (ops change): The default value for
`storage.max_sync_duration` has been lowered from `60s` to `20s`.
Cockroach will exit sooner with a fatal error if a single slow disk
operation exceeds this value.

Touches cockroachdb#80942, cockroachdb#74712.

81468: docs: clarify descriptions for tracing cluster settings r=andreimatei,arulajmani a=michae2

The descriptions for sql.trace.txn.enable_threshold and
sql.trace.stmt.enable_threshold suggested that tracing was only enabled
for transactions and statements longer than the threshold duration. This
is not true, however: tracing is enabled for everything, and only
_logged_ for transactions and statements longer than the threshold
duration. Make this clear in the descriptions. Also make the description
for sql.trace.session_eventlog.enabled a "single" sentence to match the
style of other descriptions.

Release note: None

Co-authored-by: Nick Travers <[email protected]>
Co-authored-by: Michael Erickson <[email protected]>
  • Loading branch information
3 people committed May 18, 2022
3 parents 1e1ff14 + c57dd84 + 381c3c4 commit 392d62a
Show file tree
Hide file tree
Showing 5 changed files with 21 additions and 17 deletions.
6 changes: 3 additions & 3 deletions docs/generated/settings/settings-for-tenants.txt
Original file line number Diff line number Diff line change
Expand Up @@ -265,9 +265,9 @@ sql.telemetry.query_sampling.enabled boolean false when set to true, executed qu
sql.temp_object_cleaner.cleanup_interval duration 30m0s how often to clean up orphaned temporary objects
sql.temp_object_cleaner.wait_interval duration 30m0s how long after creation a temporary object will be cleaned up
sql.trace.log_statement_execute boolean false set to true to enable logging of executed statements
sql.trace.session_eventlog.enabled boolean false set to true to enable session tracing. Note that enabling this may have a non-trivial negative performance impact.
sql.trace.stmt.enable_threshold duration 0s duration beyond which all statements are traced (set to 0 to disable). This applies to individual statements within a transaction and is therefore finer-grained than sql.trace.txn.enable_threshold.
sql.trace.txn.enable_threshold duration 0s duration beyond which all transactions are traced (set to 0 to disable). This setting is coarser grained thansql.trace.stmt.enable_threshold because it applies to all statements within a transaction as well as client communication (e.g. retries).
sql.trace.session_eventlog.enabled boolean false set to true to enable session tracing; note that enabling this may have a negative performance impact
sql.trace.stmt.enable_threshold duration 0s enables tracing on all statements; statements executing for longer than this duration will have their trace logged (set to 0 to disable); note that enabling this may have a negative performance impact; this setting applies to individual statements within a transaction and is therefore finer-grained than sql.trace.txn.enable_threshold
sql.trace.txn.enable_threshold duration 0s enables tracing on all transactions; transactions open for longer than this duration will have their trace logged (set to 0 to disable); note that enabling this may have a negative performance impact; this setting is coarser-grained than sql.trace.stmt.enable_threshold because it applies to all statements within a transaction as well as client communication (e.g. retries)
sql.ttl.default_delete_batch_size integer 100 default amount of rows to delete in a single query during a TTL job
sql.ttl.default_delete_rate_limit integer 0 default delete rate limit for all TTL jobs. Use 0 to signify no rate limit.
sql.ttl.default_range_concurrency integer 1 default amount of ranges to process at once during a TTL delete
Expand Down
6 changes: 3 additions & 3 deletions docs/generated/settings/settings.html
Original file line number Diff line number Diff line change
Expand Up @@ -196,9 +196,9 @@
<tr><td><code>sql.temp_object_cleaner.cleanup_interval</code></td><td>duration</td><td><code>30m0s</code></td><td>how often to clean up orphaned temporary objects</td></tr>
<tr><td><code>sql.temp_object_cleaner.wait_interval</code></td><td>duration</td><td><code>30m0s</code></td><td>how long after creation a temporary object will be cleaned up</td></tr>
<tr><td><code>sql.trace.log_statement_execute</code></td><td>boolean</td><td><code>false</code></td><td>set to true to enable logging of executed statements</td></tr>
<tr><td><code>sql.trace.session_eventlog.enabled</code></td><td>boolean</td><td><code>false</code></td><td>set to true to enable session tracing. Note that enabling this may have a non-trivial negative performance impact.</td></tr>
<tr><td><code>sql.trace.stmt.enable_threshold</code></td><td>duration</td><td><code>0s</code></td><td>duration beyond which all statements are traced (set to 0 to disable). This applies to individual statements within a transaction and is therefore finer-grained than sql.trace.txn.enable_threshold.</td></tr>
<tr><td><code>sql.trace.txn.enable_threshold</code></td><td>duration</td><td><code>0s</code></td><td>duration beyond which all transactions are traced (set to 0 to disable). This setting is coarser grained thansql.trace.stmt.enable_threshold because it applies to all statements within a transaction as well as client communication (e.g. retries).</td></tr>
<tr><td><code>sql.trace.session_eventlog.enabled</code></td><td>boolean</td><td><code>false</code></td><td>set to true to enable session tracing; note that enabling this may have a negative performance impact</td></tr>
<tr><td><code>sql.trace.stmt.enable_threshold</code></td><td>duration</td><td><code>0s</code></td><td>enables tracing on all statements; statements executing for longer than this duration will have their trace logged (set to 0 to disable); note that enabling this may have a negative performance impact; this setting applies to individual statements within a transaction and is therefore finer-grained than sql.trace.txn.enable_threshold</td></tr>
<tr><td><code>sql.trace.txn.enable_threshold</code></td><td>duration</td><td><code>0s</code></td><td>enables tracing on all transactions; transactions open for longer than this duration will have their trace logged (set to 0 to disable); note that enabling this may have a negative performance impact; this setting is coarser-grained than sql.trace.stmt.enable_threshold because it applies to all statements within a transaction as well as client communication (e.g. retries)</td></tr>
<tr><td><code>sql.ttl.default_delete_batch_size</code></td><td>integer</td><td><code>100</code></td><td>default amount of rows to delete in a single query during a TTL job</td></tr>
<tr><td><code>sql.ttl.default_delete_rate_limit</code></td><td>integer</td><td><code>0</code></td><td>default delete rate limit for all TTL jobs. Use 0 to signify no rate limit.</td></tr>
<tr><td><code>sql.ttl.default_range_concurrency</code></td><td>integer</td><td><code>1</code></td><td>default amount of ranges to process at once during a TTL delete</td></tr>
Expand Down
22 changes: 13 additions & 9 deletions pkg/sql/exec_util.go
Original file line number Diff line number Diff line change
Expand Up @@ -221,10 +221,12 @@ var secondaryTenantZoneConfigsEnabled = settings.RegisterBoolSetting(
var traceTxnThreshold = settings.RegisterDurationSetting(
settings.TenantWritable,
"sql.trace.txn.enable_threshold",
"duration beyond which all transactions are traced (set to 0 to "+
"disable). This setting is coarser grained than"+
"sql.trace.stmt.enable_threshold because it applies to all statements "+
"within a transaction as well as client communication (e.g. retries).", 0,
"enables tracing on all transactions; transactions open for longer than "+
"this duration will have their trace logged (set to 0 to disable); "+
"note that enabling this may have a negative performance impact; "+
"this setting is coarser-grained than sql.trace.stmt.enable_threshold "+
"because it applies to all statements within a transaction as well as "+
"client communication (e.g. retries)", 0,
).WithPublic()

// TraceStmtThreshold is identical to traceTxnThreshold except it applies to
Expand All @@ -234,9 +236,11 @@ var traceTxnThreshold = settings.RegisterDurationSetting(
var TraceStmtThreshold = settings.RegisterDurationSetting(
settings.TenantWritable,
"sql.trace.stmt.enable_threshold",
"duration beyond which all statements are traced (set to 0 to disable). "+
"This applies to individual statements within a transaction and is therefore "+
"finer-grained than sql.trace.txn.enable_threshold.",
"enables tracing on all statements; statements executing for longer than "+
"this duration will have their trace logged (set to 0 to disable); "+
"note that enabling this may have a negative performance impact; "+
"this setting applies to individual statements within a transaction and "+
"is therefore finer-grained than sql.trace.txn.enable_threshold",
0,
).WithPublic()

Expand All @@ -247,8 +251,8 @@ var TraceStmtThreshold = settings.RegisterDurationSetting(
var traceSessionEventLogEnabled = settings.RegisterBoolSetting(
settings.TenantWritable,
"sql.trace.session_eventlog.enabled",
"set to true to enable session tracing. "+
"Note that enabling this may have a non-trivial negative performance impact.",
"set to true to enable session tracing; "+
"note that enabling this may have a negative performance impact",
false,
).WithPublic()

Expand Down
2 changes: 1 addition & 1 deletion pkg/storage/pebble.go
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ import (
const maxSyncDurationFatalOnExceededDefault = true

// Default for MaxSyncDuration below.
var maxSyncDurationDefault = envutil.EnvOrDefaultDuration("COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT", 60*time.Second)
var maxSyncDurationDefault = envutil.EnvOrDefaultDuration("COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT", 20*time.Second)

// MaxSyncDuration is the threshold above which an observed engine sync duration
// triggers either a warning or a fatal error.
Expand Down
2 changes: 1 addition & 1 deletion pkg/util/log/log_flush.go
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ const syncInterval = 30
// In practice, even a fraction of that would indicate a problem. This metric's
// default should ideally match its sister metric in the storage engine, set by
// COCKROACH_ENGINE_MAX_SYNC_DURATION.
var maxSyncDuration = envutil.EnvOrDefaultDuration("COCKROACH_LOG_MAX_SYNC_DURATION", 60*time.Second)
var maxSyncDuration = envutil.EnvOrDefaultDuration("COCKROACH_LOG_MAX_SYNC_DURATION", 20*time.Second)

// syncWarnDuration is the threshold after which a slow disk warning is written
// to the log and to stderr.
Expand Down

0 comments on commit 392d62a

Please sign in to comment.