vfs: capture operation type affected by disk slowness #2255

joshimhoff · 2023-01-19T20:17:00Z

This is #1672 from @nicktrav, rebased on top of #1677 and extended so that the op type of filesystem metadata operations is also captured. I also addressed @sumeerbhola's last comments over at #1672.

With this one merged, I still have more thing to do. I will extend DiskSlowInfo to track write size in case a sized write (e.g. a call to Write).

As an aside, some discussion was had at https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1674138778490679 about moving from a single long-lived monitor goroutine per file to a shorter-lived monitor goroutine per file op. Would mean no shared mutable state and no packed int64. We prefer the existing long-lived monitor goroutine per file, as the perf impacts of the suggested change may be problematic (it may also be hard to observe these bad impacts via pebble workload, etc.).

vfs: capture operation type affected by disk slowness

Currently, if a Pebble DB is backed by vfs.FS that is wrapped with a vfs.diskHealthCheckingFS, the DB can be made aware of operations that are taking longer than some threshold. The current implementation does not make any distinction between the operation type (write, sync, etc.) that was observed as slow.

Capture the type of operation being performed when emitting a disk slowness event.

cockroach-teamcity · 2023-01-19T20:17:08Z

This change is

joshimhoff · 2023-01-19T20:26:57Z

Since this change was mostly already reviewed by @sumeerbhola over at #1672, I have assigned Sumeer directly. Storage folks, let me know if this is still a better time to just assign storage team as per the recent discussion.

nicktrav

Basically reviewing my own code ... it's been a while!

Reviewed 4 of 6 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @sumeerbhola)

joshimhoff · 2023-01-24T17:55:00Z

TTFR!

Currently, if a Pebble DB is backed by `vfs.FS` that is wrapped with a `vfs.diskHealthCheckingFS`, the DB can be made aware of operations that are taking longer than some threshold. The current implementation does not make any distinction between the operation type (write, sync, etc.) that was observed as slow. Capture the type of operation being performed when emitting a disk slowness event.

nicktrav

Reviewed 1 of 1 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @sumeerbhola)

blathers-crl bot added the T-storage label Jan 19, 2023

joshimhoff requested review from sumeerbhola and a team January 19, 2023 20:25

joshimhoff force-pushed the improve_write_stall_info_safe branch 2 times, most recently from 5164e72 to 5ace9bf Compare January 19, 2023 20:37

joshimhoff mentioned this pull request Jan 19, 2023

vfs: capture operation type affected by disk slowness #1672

Closed

sumeerbhola requested a review from nicktrav January 20, 2023 16:10

nicktrav approved these changes Jan 24, 2023

View reviewed changes

joshimhoff force-pushed the improve_write_stall_info_safe branch from 5ace9bf to 1de7da9 Compare January 24, 2023 17:54

joshimhoff force-pushed the improve_write_stall_info_safe branch from 1de7da9 to d9b570e Compare January 24, 2023 18:19

nicktrav reviewed Jan 24, 2023

View reviewed changes

joshimhoff mentioned this pull request Jan 25, 2023

TestDiskHealthChecking_Filesystem/remove might be flaky #2272

Closed

joshimhoff merged commit 169e5db into cockroachdb:master Jan 25, 2023

This was referenced Jan 25, 2023

vfs: flaky test TestDiskHealthChecking_Filesystem #1718

Closed

vfs: include size of write in DiskSlowInfo #2281

Merged

storage: improve disk stall error message cockroachdb/cockroach#67856

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vfs: capture operation type affected by disk slowness #2255

vfs: capture operation type affected by disk slowness #2255

joshimhoff commented Jan 19, 2023 •

edited

Loading

cockroach-teamcity commented Jan 19, 2023

joshimhoff commented Jan 19, 2023

nicktrav left a comment

joshimhoff commented Jan 24, 2023

nicktrav left a comment

vfs: capture operation type affected by disk slowness #2255

vfs: capture operation type affected by disk slowness #2255

Conversation

joshimhoff commented Jan 19, 2023 • edited Loading

cockroach-teamcity commented Jan 19, 2023

joshimhoff commented Jan 19, 2023

nicktrav left a comment

Choose a reason for hiding this comment

joshimhoff commented Jan 24, 2023

nicktrav left a comment

Choose a reason for hiding this comment

joshimhoff commented Jan 19, 2023 •

edited

Loading