-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: improve disk stall error message #67856
Comments
Whoever picks this up, let's do a kickoff where we vet the right UX around what messages to display upon detecting which disk issues |
I think cockroachlabs/support#1630 also highlighted the need for some disambiguation. In this case it was some way of determining the difference between "slow", "huge write going as fast as it can" and "stalled". We incorrectly assumed the latter, and therefore a hardware issue, which lessened the efficacy of our initial response. cc: @jbowens @bananabrick |
For cockroachlabs/support#1630, I think it would've been helpful for the logged message to include the length of the write buffer that failed to write within the configured duration. |
@jbowens, I have a few Qs for you as TL! I'm taking this over as a starter issue. Looks like cockroachdb/pebble#1672 from @nicktrav adds the op type but not the length of the write buffer. Unless someone tells me not to, I will add the length of the write buffer too, in case of a "sized write", as this will make the starter issue a bit more difficult, which will help me learn storage stuff. In my starter doc, I see the following comment:
What is the concrete goal of providing this context into the CRDB layer? That the fatal log message here ( cockroach/pkg/storage/pebble.go Line 1076 in bfd1b3f
Also, should I backport these changes into some older CRDB versions? |
Sounds good
Mind linking me? I'm not sure off-hand what that comment is referring to. I think this original issue was created to update the language for the disk stall error to be a little less accusatory towards the hard disks. Maybe something like a "filesystem write stall," plus a comment that it may be a hard disk stall? I think making those changes, plus including the write buffer size should cover it. cockroach/pkg/storage/pebble.go Lines 1065 to 1086 in 20b2e81
|
Thanks, @jbowens! All makes sense.
Here ya go: https://docs.google.com/document/d/1zgpqbSd1RbCSaDh36Sbd8UgG4XgS8rP41GGhBrFXyRs/edit Specifically, the below part, which I think is suggesting the context is included in fatal message in the CRDB logs in addition to the pebble logs (as is already partially handled by cockroachdb/pebble#1672):
|
I think I get now what the bit from the starter doc was getting at; cockroachdb/pebble#1672 adds more context to I think I will prob adjust the below CRDB code to call cockroach/pkg/storage/pebble.go Lines 1065 to 1086 in 20b2e81
|
95436: storage: rely on pebble for log message in case of write stall r=joshimhoff a=joshimhoff Part of #67856. Haven't added additional context re: op type & size of write to `pebble.DiskSlowInfo` but will do that shortly. With this change, if we make changes to how much context is included in `pebble.DiskSlowInfo`, we will not need to make a CRDB change to have the context in the CRDB logs. Note also that I will adjust the pebble String method to be less accusatory to disks ("disk slow" -> "write slow"). See below for log message with this commit in. ``` $ ./dev test pkg/storage -v --filter='TestPebbleMetricEventListener' --show-logs $ bazel test pkg/storage:all --test_env=GOTRACEBACK=all --test_filter=TestPebbleMetricEventListener --test_arg -test.v --test_arg -show-logs --test_sharding_strategy=disabled --test_output all Starting local Bazel server and connecting to it... INFO: Invocation ID: a4365f4b-9190-4ef3-8a54-de3b4c0c46b7 INFO: Analyzed 3 targets (626 packages loaded, 16556 targets configured). INFO: Found 2 targets and 1 test target... INFO: From Testing //pkg/storage:storage_test: ==================== Test output for //pkg/storage:storage_test: initialized metamorphic constant "span-reuse-rate" with value 58 initialized metamorphic constant "mvcc-value-disable-simple-encoding" with value true initialized metamorphic constant "mvcc-max-iters-before-seek" with value 2 initialized metamorphic constant "span-reuse-rate" with value 39 initialized metamorphic constant "mvcc-value-disable-simple-encoding" with value true initialized metamorphic constant "mvcc-max-iters-before-seek" with value 0 === RUN TestPebbleMetricEventListener test_log_scope.go:76: test logs captured to: /private/var/tmp/_bazel_joshimhoff/69f147d3e5e8d9a9bc7bc343361a1a94/sandbox/darwin-sandbox/6/execroot/com_github_cockroachdb_cockroach/_tmp/88d640957a51906868fc787a6e8a38b4/logTestPebbleMetricEventListener440509452 E230118 21:01:15.941881 28 storage/pebble.go:1001 [-] 1 write stall detected: disk slowness detected: write to file has been ongoing for 70.0s pebble_test.go:356: -- test log scope end -- --- PASS: TestPebbleMetricEventListener (0.05s) PASS ================================================================================ INFO: Elapsed time: 29.194s, Critical Path: 8.66s INFO: 7 processes: 1 internal, 6 darwin-sandbox. INFO: Build completed successfully, 7 total actions //pkg/storage:storage_test PASSED in 0.6s Executed 1 out of 1 test: 1 test passes. INFO: Build completed successfully, 7 total actions ``` **storage: rely on pebble for log message in case of write stall** In case of a write stall, CRDB logs out details re: the write stall. By default, CRDB logs at fatal level & then crashes. Before this commit, tho the pebble event had a String method, it was not used in the CRDB log message. This means adding additional context to the pebble event will not add the context to the CRDB logs, unless a CRDB change is made in addition to a pebble change. With this commit, the log message calls the pebble event's String method. Release note (ops change): CRDB log message in case of write stall has been adjusted slightly. 95697: cloud/kubernetes: update to v22.2.3 r=celiala a=cameronnunez Part of https://cockroachlabs.atlassian.net/browse/REL-248 Release note: None 95698: roachtest: update version map for 22.2.3 r=rail,renatolabs a=cameronnunez Part of: https://cockroachlabs.atlassian.net/browse/REL-248 Release note: None 95714: logictest: fix retry on query that returns a fleeting error r=cucaroach a=cucaroach Fixes: #95664 Release note: None Epic: CRDB-20535 95751: cdc: add transformations to create changefeed telemetry r=jayshrivastava a=jayshrivastava Previously, there were no telemetry events emitted to track the usage of changefeed transformations. This change adds the `transformation` field to the `CREATE CHANGEFEED` telemetry log event to track that. Example: ``` { "Timestamp": 1674574083953989000, "EventType": "create_changefeed", "Description": "CREATE CHANGEFEED INTO 'gcpubsub://testfeed?region=testfeedRegion' AS SELECT b FROM foo", "SinkType": "gcpubsub", "NumTables": 1, "Resolved": "no", "Transformation": true } ``` Epic: CRDB-17161 Closes: #95126 Release note: None 95784: dev: in `doctor`, include output of failed `git submodule` invocations r=healthy-pod a=rickystewart Epic: none Release note: None Co-authored-by: Josh Imhoff <[email protected]> Co-authored-by: Cameron Nunez <[email protected]> Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: Jayant Shrivastava <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
Merged #95436 & cockroachdb/pebble#2255. Just put up cockroachdb/pebble#2281. Once that one goes in, I think I'm done with this issue, modulo a quick e2e test plus any branch stuff we might want to do, e.g. backports. |
@joshimhoff - going to send this one your way, as it's in progress. |
Thanks. I'll close this out next next week (I am oncall next week). Got excited with the disagg testing but will be good to close this out. |
A client recently saw “disk stall” errors that led them to assume a hardware problem, and so they took disks offline, which led to further problems.
It appears that several situations can lead to such an error -- hitting the max # of files (1000), or other things. Is it correct that “disk stall” errors can result from distinct underlying situations? If so, is it possible to return different error messages, so users can better troubleshoot?
(If this should be in the Pebble repo, let me know.)
Jira issue: CRDB-8742
Epic CRDB-20293
The text was updated successfully, but these errors were encountered: