storage: improve disk stall error message #67856

shermanCRL · 2021-07-21T13:36:07Z

A client recently saw “disk stall” errors that led them to assume a hardware problem, and so they took disks offline, which led to further problems.

It appears that several situations can lead to such an error -- hitting the max # of files (1000), or other things. Is it correct that “disk stall” errors can result from distinct underlying situations? If so, is it possible to return different error messages, so users can better troubleshoot?

(If this should be in the Pebble repo, let me know.)

Jira issue: CRDB-8742

Epic CRDB-20293

mwang1026 · 2021-12-20T21:23:42Z

Whoever picks this up, let's do a kickoff where we vet the right UX around what messages to display upon detecting which disk issues

nicktrav · 2022-06-08T00:04:53Z

I think cockroachlabs/support#1630 also highlighted the need for some disambiguation. In this case it was some way of determining the difference between "slow", "huge write going as fast as it can" and "stalled". We incorrectly assumed the latter, and therefore a hardware issue, which lessened the efficacy of our initial response.

cc: @jbowens @bananabrick

jbowens · 2022-06-09T15:58:17Z

For cockroachlabs/support#1630, I think it would've been helpful for the logged message to include the length of the write buffer that failed to write within the configured duration.

joshimhoff · 2023-01-17T15:52:55Z

@jbowens, I have a few Qs for you as TL! I'm taking this over as a starter issue. Looks like cockroachdb/pebble#1672 from @nicktrav adds the op type but not the length of the write buffer. Unless someone tells me not to, I will add the length of the write buffer too, in case of a "sized write", as this will make the starter issue a bit more difficult, which will help me learn storage stuff.

In my starter doc, I see the following comment:

We’d ideally have the same context provided up in Cockroach too.

What is the concrete goal of providing this context into the CRDB layer? That the fatal log message here (

cockroach/pkg/storage/pebble.go

Line 1076 in bfd1b3f

    
           log.Fatalf(ctx, "disk stall detected: pebble unable to write to %s in %.2f seconds",

) includes the same extra context that cockroachdb/pebble#1672 adds to the pebble log?

Also, should I backport these changes into some older CRDB versions?

jbowens · 2023-01-17T20:15:40Z

I will add the length of the write buffer too, in case of a "sized write"

Sounds good

In my starter doc, I see the following comment:

Mind linking me? I'm not sure off-hand what that comment is referring to.

I think this original issue was created to update the language for the disk stall error to be a little less accusatory towards the hard disks. Maybe something like a "filesystem write stall," plus a comment that it may be a hard disk stall? I think making those changes, plus including the write buffer size should cover it.

cockroach/pkg/storage/pebble.go

Lines 1065 to 1086 in 20b2e81

    
           DiskSlow: func(info pebble.DiskSlowInfo) { 
        
           	maxSyncDuration := maxSyncDurationDefault 
        
           	fatalOnExceeded := maxSyncDurationFatalOnExceededDefault 
        
           	if p.settings != nil { 
        
           		maxSyncDuration = MaxSyncDuration.Get(&p.settings.SV) 
        
           		fatalOnExceeded = MaxSyncDurationFatalOnExceeded.Get(&p.settings.SV) 
        
           	} 
        
           	if info.Duration.Seconds() >= maxSyncDuration.Seconds() { 
        
           		atomic.AddInt64(&p.diskStallCount, 1) 
        
           		// Note that the below log messages go to the main cockroach log, not 
        
           		// the pebble-specific log. 
        
           		if fatalOnExceeded { 
        
           			log.Fatalf(ctx, "disk stall detected: pebble unable to write to %s in %.2f seconds", 
        
           				info.Path, redact.Safe(info.Duration.Seconds())) 
        
           		} else { 
        
           			log.Errorf(ctx, "disk stall detected: pebble unable to write to %s in %.2f seconds", 
        
           				info.Path, redact.Safe(info.Duration.Seconds())) 
        
           		} 
        
           		return 
        
           	} 
        
           	atomic.AddInt64(&p.diskSlowCount, 1) 
        
           },

joshimhoff · 2023-01-17T21:09:54Z

Thanks, @jbowens! All makes sense.

Mind linking me?

Here ya go: https://docs.google.com/document/d/1zgpqbSd1RbCSaDh36Sbd8UgG4XgS8rP41GGhBrFXyRs/edit

Specifically, the below part, which I think is suggesting the context is included in fatal message in the CRDB logs in addition to the pebble logs (as is already partially handled by cockroachdb/pebble#1672):

We’d ideally have the same context provided up in Cockroach too.

joshimhoff · 2023-01-17T22:27:00Z

I think I get now what the bit from the starter doc was getting at; cockroachdb/pebble#1672 adds more context to DiskSlowInfo but doesn't actually plumb that context into a log file IIUC.

I think I will prob adjust the below CRDB code to call DiskSlowInfo's SafeFormat method; that way, whatever context is available in the pebble layer will be in the CRDB logs:

cockroach/pkg/storage/pebble.go

Lines 1065 to 1086 in 20b2e81

    
           DiskSlow: func(info pebble.DiskSlowInfo) { 
        
           	maxSyncDuration := maxSyncDurationDefault 
        
           	fatalOnExceeded := maxSyncDurationFatalOnExceededDefault 
        
           	if p.settings != nil { 
        
           		maxSyncDuration = MaxSyncDuration.Get(&p.settings.SV) 
        
           		fatalOnExceeded = MaxSyncDurationFatalOnExceeded.Get(&p.settings.SV) 
        
           	} 
        
           	if info.Duration.Seconds() >= maxSyncDuration.Seconds() { 
        
           		atomic.AddInt64(&p.diskStallCount, 1) 
        
           		// Note that the below log messages go to the main cockroach log, not 
        
           		// the pebble-specific log. 
        
           		if fatalOnExceeded { 
        
           			log.Fatalf(ctx, "disk stall detected: pebble unable to write to %s in %.2f seconds", 
        
           				info.Path, redact.Safe(info.Duration.Seconds())) 
        
           		} else { 
        
           			log.Errorf(ctx, "disk stall detected: pebble unable to write to %s in %.2f seconds", 
        
           				info.Path, redact.Safe(info.Duration.Seconds())) 
        
           		} 
        
           		return 
        
           	} 
        
           	atomic.AddInt64(&p.diskSlowCount, 1) 
        
           },

95436: storage: rely on pebble for log message in case of write stall r=joshimhoff a=joshimhoff Part of #67856. Haven't added additional context re: op type & size of write to `pebble.DiskSlowInfo` but will do that shortly. With this change, if we make changes to how much context is included in `pebble.DiskSlowInfo`, we will not need to make a CRDB change to have the context in the CRDB logs. Note also that I will adjust the pebble String method to be less accusatory to disks ("disk slow" -> "write slow"). See below for log message with this commit in. ``` $ ./dev test pkg/storage -v --filter='TestPebbleMetricEventListener' --show-logs $ bazel test pkg/storage:all --test_env=GOTRACEBACK=all --test_filter=TestPebbleMetricEventListener --test_arg -test.v --test_arg -show-logs --test_sharding_strategy=disabled --test_output all Starting local Bazel server and connecting to it... INFO: Invocation ID: a4365f4b-9190-4ef3-8a54-de3b4c0c46b7 INFO: Analyzed 3 targets (626 packages loaded, 16556 targets configured). INFO: Found 2 targets and 1 test target... INFO: From Testing //pkg/storage:storage_test: ==================== Test output for //pkg/storage:storage_test: initialized metamorphic constant "span-reuse-rate" with value 58 initialized metamorphic constant "mvcc-value-disable-simple-encoding" with value true initialized metamorphic constant "mvcc-max-iters-before-seek" with value 2 initialized metamorphic constant "span-reuse-rate" with value 39 initialized metamorphic constant "mvcc-value-disable-simple-encoding" with value true initialized metamorphic constant "mvcc-max-iters-before-seek" with value 0 === RUN TestPebbleMetricEventListener test_log_scope.go:76: test logs captured to: /private/var/tmp/_bazel_joshimhoff/69f147d3e5e8d9a9bc7bc343361a1a94/sandbox/darwin-sandbox/6/execroot/com_github_cockroachdb_cockroach/_tmp/88d640957a51906868fc787a6e8a38b4/logTestPebbleMetricEventListener440509452 E230118 21:01:15.941881 28 storage/pebble.go:1001 [-] 1 write stall detected: disk slowness detected: write to file has been ongoing for 70.0s pebble_test.go:356: -- test log scope end -- --- PASS: TestPebbleMetricEventListener (0.05s) PASS ================================================================================ INFO: Elapsed time: 29.194s, Critical Path: 8.66s INFO: 7 processes: 1 internal, 6 darwin-sandbox. INFO: Build completed successfully, 7 total actions //pkg/storage:storage_test PASSED in 0.6s Executed 1 out of 1 test: 1 test passes. INFO: Build completed successfully, 7 total actions ``` **storage: rely on pebble for log message in case of write stall** In case of a write stall, CRDB logs out details re: the write stall. By default, CRDB logs at fatal level & then crashes. Before this commit, tho the pebble event had a String method, it was not used in the CRDB log message. This means adding additional context to the pebble event will not add the context to the CRDB logs, unless a CRDB change is made in addition to a pebble change. With this commit, the log message calls the pebble event's String method. Release note (ops change): CRDB log message in case of write stall has been adjusted slightly. 95697: cloud/kubernetes: update to v22.2.3 r=celiala a=cameronnunez Part of https://cockroachlabs.atlassian.net/browse/REL-248 Release note: None 95698: roachtest: update version map for 22.2.3 r=rail,renatolabs a=cameronnunez Part of: https://cockroachlabs.atlassian.net/browse/REL-248 Release note: None 95714: logictest: fix retry on query that returns a fleeting error r=cucaroach a=cucaroach Fixes: #95664 Release note: None Epic: CRDB-20535 95751: cdc: add transformations to create changefeed telemetry r=jayshrivastava a=jayshrivastava Previously, there were no telemetry events emitted to track the usage of changefeed transformations. This change adds the `transformation` field to the `CREATE CHANGEFEED` telemetry log event to track that. Example: ``` { "Timestamp": 1674574083953989000, "EventType": "create_changefeed", "Description": "CREATE CHANGEFEED INTO 'gcpubsub://testfeed?region=testfeedRegion' AS SELECT b FROM foo", "SinkType": "gcpubsub", "NumTables": 1, "Resolved": "no", "Transformation": true } ``` Epic: CRDB-17161 Closes: #95126 Release note: None 95784: dev: in `doctor`, include output of failed `git submodule` invocations r=healthy-pod a=rickystewart Epic: none Release note: None Co-authored-by: Josh Imhoff <[email protected]> Co-authored-by: Cameron Nunez <[email protected]> Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: Jayant Shrivastava <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>

joshimhoff · 2023-01-26T16:41:07Z

Merged #95436 & cockroachdb/pebble#2255. Just put up cockroachdb/pebble#2281. Once that one goes in, I think I'm done with this issue, modulo a quick e2e test plus any branch stuff we might want to do, e.g. backports.

nicktrav · 2023-02-10T00:13:28Z

@joshimhoff - going to send this one your way, as it's in progress.

joshimhoff · 2023-02-10T01:42:37Z

Thanks. I'll close this out next next week (I am oncall next week). Got excited with the disagg testing but will be good to close this out.

joshimhoff · 2023-03-09T16:32:09Z

cockroachdb/pebble#2281

shermanCRL added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Jul 21, 2021

jlinder added the sync-me-8 label Jun 10, 2022

nicktrav added E-starter Might be suitable for a starter project for new employees or team members. E-quick-win Likely to be a quick win for someone experienced. labels Jun 10, 2022

This was referenced Jan 18, 2023

storage: rely on pebble for log message in case of write stall #95436

Merged

vfs: capture operation type affected by disk slowness cockroachdb/pebble#2255

Merged

joshimhoff mentioned this issue Jan 26, 2023

vfs: include size of write in DiskSlowInfo cockroachdb/pebble#2281

Merged

nicktrav assigned joshimhoff Feb 10, 2023

joshimhoff closed this as completed in cockroachdb/pebble#2281 Mar 9, 2023

joshimhoff reopened this Mar 9, 2023

joshimhoff closed this as completed Mar 9, 2023

jbowens added this to [Deprecated] Storage Jun 4, 2024

jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: improve disk stall error message #67856

storage: improve disk stall error message #67856

shermanCRL commented Jul 21, 2021 •

edited by exalate-issue-sync bot

Loading

mwang1026 commented Dec 20, 2021

nicktrav commented Jun 8, 2022

jbowens commented Jun 9, 2022

joshimhoff commented Jan 17, 2023

jbowens commented Jan 17, 2023

joshimhoff commented Jan 17, 2023

joshimhoff commented Jan 17, 2023 •

edited

Loading

joshimhoff commented Jan 26, 2023

nicktrav commented Feb 10, 2023

joshimhoff commented Feb 10, 2023 •

edited

Loading

joshimhoff commented Mar 9, 2023

storage: improve disk stall error message #67856

storage: improve disk stall error message #67856

Comments

shermanCRL commented Jul 21, 2021 • edited by exalate-issue-sync bot Loading

mwang1026 commented Dec 20, 2021

nicktrav commented Jun 8, 2022

jbowens commented Jun 9, 2022

joshimhoff commented Jan 17, 2023

jbowens commented Jan 17, 2023

joshimhoff commented Jan 17, 2023

joshimhoff commented Jan 17, 2023 • edited Loading

joshimhoff commented Jan 26, 2023

nicktrav commented Feb 10, 2023

joshimhoff commented Feb 10, 2023 • edited Loading

joshimhoff commented Mar 9, 2023

shermanCRL commented Jul 21, 2021 •

edited by exalate-issue-sync bot

Loading

joshimhoff commented Jan 17, 2023 •

edited

Loading

joshimhoff commented Feb 10, 2023 •

edited

Loading