Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: improve disk stall error message #67856

Closed
shermanCRL opened this issue Jul 21, 2021 · 11 comments · Fixed by cockroachdb/pebble#2281
Closed

storage: improve disk stall error message #67856

shermanCRL opened this issue Jul 21, 2021 · 11 comments · Fixed by cockroachdb/pebble#2281
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. E-quick-win Likely to be a quick win for someone experienced. E-starter Might be suitable for a starter project for new employees or team members. sync-me-8 T-storage Storage Team

Comments

@shermanCRL
Copy link
Contributor

shermanCRL commented Jul 21, 2021

A client recently saw “disk stall” errors that led them to assume a hardware problem, and so they took disks offline, which led to further problems.

It appears that several situations can lead to such an error -- hitting the max # of files (1000), or other things. Is it correct that “disk stall” errors can result from distinct underlying situations? If so, is it possible to return different error messages, so users can better troubleshoot?

(If this should be in the Pebble repo, let me know.)

Jira issue: CRDB-8742

Epic CRDB-20293

@shermanCRL shermanCRL added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Jul 21, 2021
@mwang1026
Copy link

Whoever picks this up, let's do a kickoff where we vet the right UX around what messages to display upon detecting which disk issues

@nicktrav
Copy link
Collaborator

nicktrav commented Jun 8, 2022

I think cockroachlabs/support#1630 also highlighted the need for some disambiguation. In this case it was some way of determining the difference between "slow", "huge write going as fast as it can" and "stalled". We incorrectly assumed the latter, and therefore a hardware issue, which lessened the efficacy of our initial response.

cc: @jbowens @bananabrick

@jbowens
Copy link
Collaborator

jbowens commented Jun 9, 2022

For cockroachlabs/support#1630, I think it would've been helpful for the logged message to include the length of the write buffer that failed to write within the configured duration.

@nicktrav nicktrav added E-starter Might be suitable for a starter project for new employees or team members. E-quick-win Likely to be a quick win for someone experienced. labels Jun 10, 2022
@joshimhoff
Copy link
Collaborator

@jbowens, I have a few Qs for you as TL! I'm taking this over as a starter issue. Looks like cockroachdb/pebble#1672 from @nicktrav adds the op type but not the length of the write buffer. Unless someone tells me not to, I will add the length of the write buffer too, in case of a "sized write", as this will make the starter issue a bit more difficult, which will help me learn storage stuff.

In my starter doc, I see the following comment:

We’d ideally have the same context provided up in Cockroach too.

What is the concrete goal of providing this context into the CRDB layer? That the fatal log message here (

log.Fatalf(ctx, "disk stall detected: pebble unable to write to %s in %.2f seconds",
) includes the same extra context that cockroachdb/pebble#1672 adds to the pebble log?

Also, should I backport these changes into some older CRDB versions?

@jbowens
Copy link
Collaborator

jbowens commented Jan 17, 2023

I will add the length of the write buffer too, in case of a "sized write"

Sounds good

In my starter doc, I see the following comment:

Mind linking me? I'm not sure off-hand what that comment is referring to.

I think this original issue was created to update the language for the disk stall error to be a little less accusatory towards the hard disks. Maybe something like a "filesystem write stall," plus a comment that it may be a hard disk stall? I think making those changes, plus including the write buffer size should cover it.

DiskSlow: func(info pebble.DiskSlowInfo) {
maxSyncDuration := maxSyncDurationDefault
fatalOnExceeded := maxSyncDurationFatalOnExceededDefault
if p.settings != nil {
maxSyncDuration = MaxSyncDuration.Get(&p.settings.SV)
fatalOnExceeded = MaxSyncDurationFatalOnExceeded.Get(&p.settings.SV)
}
if info.Duration.Seconds() >= maxSyncDuration.Seconds() {
atomic.AddInt64(&p.diskStallCount, 1)
// Note that the below log messages go to the main cockroach log, not
// the pebble-specific log.
if fatalOnExceeded {
log.Fatalf(ctx, "disk stall detected: pebble unable to write to %s in %.2f seconds",
info.Path, redact.Safe(info.Duration.Seconds()))
} else {
log.Errorf(ctx, "disk stall detected: pebble unable to write to %s in %.2f seconds",
info.Path, redact.Safe(info.Duration.Seconds()))
}
return
}
atomic.AddInt64(&p.diskSlowCount, 1)
},

@joshimhoff
Copy link
Collaborator

Thanks, @jbowens! All makes sense.

Mind linking me?

Here ya go: https://docs.google.com/document/d/1zgpqbSd1RbCSaDh36Sbd8UgG4XgS8rP41GGhBrFXyRs/edit

Specifically, the below part, which I think is suggesting the context is included in fatal message in the CRDB logs in addition to the pebble logs (as is already partially handled by cockroachdb/pebble#1672):

We’d ideally have the same context provided up in Cockroach too.

@joshimhoff
Copy link
Collaborator

joshimhoff commented Jan 17, 2023

I think I get now what the bit from the starter doc was getting at; cockroachdb/pebble#1672 adds more context to DiskSlowInfo but doesn't actually plumb that context into a log file IIUC.

I think I will prob adjust the below CRDB code to call DiskSlowInfo's SafeFormat method; that way, whatever context is available in the pebble layer will be in the CRDB logs:

DiskSlow: func(info pebble.DiskSlowInfo) {
maxSyncDuration := maxSyncDurationDefault
fatalOnExceeded := maxSyncDurationFatalOnExceededDefault
if p.settings != nil {
maxSyncDuration = MaxSyncDuration.Get(&p.settings.SV)
fatalOnExceeded = MaxSyncDurationFatalOnExceeded.Get(&p.settings.SV)
}
if info.Duration.Seconds() >= maxSyncDuration.Seconds() {
atomic.AddInt64(&p.diskStallCount, 1)
// Note that the below log messages go to the main cockroach log, not
// the pebble-specific log.
if fatalOnExceeded {
log.Fatalf(ctx, "disk stall detected: pebble unable to write to %s in %.2f seconds",
info.Path, redact.Safe(info.Duration.Seconds()))
} else {
log.Errorf(ctx, "disk stall detected: pebble unable to write to %s in %.2f seconds",
info.Path, redact.Safe(info.Duration.Seconds()))
}
return
}
atomic.AddInt64(&p.diskSlowCount, 1)
},

craig bot pushed a commit that referenced this issue Jan 24, 2023
95436: storage: rely on pebble for log message in case of write stall r=joshimhoff a=joshimhoff

Part of #67856. Haven't added additional context re: op type & size of write to `pebble.DiskSlowInfo` but will do that shortly. With this change, if we make changes to how much context is included in `pebble.DiskSlowInfo`, we will not need to make a CRDB change to have the context in the CRDB logs. Note also that I will adjust the pebble String method to be less accusatory to disks ("disk slow" -> "write slow").

See below for log message with this commit in.

```
$ ./dev test pkg/storage -v --filter='TestPebbleMetricEventListener' --show-logs
$ bazel test pkg/storage:all --test_env=GOTRACEBACK=all --test_filter=TestPebbleMetricEventListener --test_arg -test.v --test_arg -show-logs --test_sharding_strategy=disabled --test_output all
Starting local Bazel server and connecting to it...
INFO: Invocation ID: a4365f4b-9190-4ef3-8a54-de3b4c0c46b7
INFO: Analyzed 3 targets (626 packages loaded, 16556 targets configured).
INFO: Found 2 targets and 1 test target...
INFO: From Testing //pkg/storage:storage_test:
==================== Test output for //pkg/storage:storage_test:
initialized metamorphic constant "span-reuse-rate" with value 58
initialized metamorphic constant "mvcc-value-disable-simple-encoding" with value true
initialized metamorphic constant "mvcc-max-iters-before-seek" with value 2
initialized metamorphic constant "span-reuse-rate" with value 39
initialized metamorphic constant "mvcc-value-disable-simple-encoding" with value true
initialized metamorphic constant "mvcc-max-iters-before-seek" with value 0
=== RUN   TestPebbleMetricEventListener
    test_log_scope.go:76: test logs captured to: /private/var/tmp/_bazel_joshimhoff/69f147d3e5e8d9a9bc7bc343361a1a94/sandbox/darwin-sandbox/6/execroot/com_github_cockroachdb_cockroach/_tmp/88d640957a51906868fc787a6e8a38b4/logTestPebbleMetricEventListener440509452
E230118 21:01:15.941881 28 storage/pebble.go:1001  [-] 1  write stall detected: disk slowness detected: write to file  has been ongoing for 70.0s
    pebble_test.go:356: -- test log scope end --
--- PASS: TestPebbleMetricEventListener (0.05s)
PASS
================================================================================
INFO: Elapsed time: 29.194s, Critical Path: 8.66s
INFO: 7 processes: 1 internal, 6 darwin-sandbox.
INFO: Build completed successfully, 7 total actions
//pkg/storage:storage_test                                               PASSED in 0.6s

Executed 1 out of 1 test: 1 test passes.
INFO: Build completed successfully, 7 total actions
```

**storage: rely on pebble for log message in case of write stall**

In case of a write stall, CRDB logs out details re: the write stall. By
default, CRDB logs at fatal level & then crashes. Before this commit, tho the
pebble event had a String method, it was not used in the CRDB log message.
This means adding additional context to the pebble event will not add the
context to the CRDB logs, unless a CRDB change is made in addition to a pebble
change. With this commit, the log message calls the pebble event's String
method.

Release note (ops change): CRDB log message in case of write stall has been
adjusted slightly.

95697: cloud/kubernetes: update to v22.2.3 r=celiala a=cameronnunez

Part of https://cockroachlabs.atlassian.net/browse/REL-248

Release note: None

95698: roachtest: update version map for 22.2.3 r=rail,renatolabs a=cameronnunez

Part of: https://cockroachlabs.atlassian.net/browse/REL-248

Release note: None

95714: logictest: fix retry on query that returns a fleeting error r=cucaroach a=cucaroach

Fixes: #95664
Release note: None
Epic: CRDB-20535

95751: cdc: add transformations to create changefeed telemetry r=jayshrivastava a=jayshrivastava

Previously, there were no telemetry events emitted to track the usage of changefeed transformations. This change adds the `transformation` field to the `CREATE CHANGEFEED` telemetry log event to track that. Example:
```
{
  "Timestamp": 1674574083953989000, "EventType": "create_changefeed",
  "Description": "CREATE CHANGEFEED INTO 'gcpubsub://testfeed?region=testfeedRegion' AS SELECT b FROM foo",
  "SinkType": "gcpubsub", "NumTables": 1, "Resolved": "no", "Transformation": true
}
```

Epic: CRDB-17161
Closes: #95126

Release note: None

95784: dev: in `doctor`, include output of failed `git submodule` invocations r=healthy-pod a=rickystewart

Epic: none
Release note: None

Co-authored-by: Josh Imhoff <[email protected]>
Co-authored-by: Cameron Nunez <[email protected]>
Co-authored-by: Tommy Reilly <[email protected]>
Co-authored-by: Jayant Shrivastava <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
@joshimhoff
Copy link
Collaborator

Merged #95436 & cockroachdb/pebble#2255. Just put up cockroachdb/pebble#2281. Once that one goes in, I think I'm done with this issue, modulo a quick e2e test plus any branch stuff we might want to do, e.g. backports.

@nicktrav
Copy link
Collaborator

@joshimhoff - going to send this one your way, as it's in progress.

@joshimhoff
Copy link
Collaborator

joshimhoff commented Feb 10, 2023

Thanks. I'll close this out next next week (I am oncall next week). Got excited with the disagg testing but will be good to close this out.

@joshimhoff
Copy link
Collaborator

cockroachdb/pebble#2281

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. E-quick-win Likely to be a quick win for someone experienced. E-starter Might be suitable for a starter project for new employees or team members. sync-me-8 T-storage Storage Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants