Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vfs: capture operation type affected by disk slowness #2255

Merged

Conversation

joshimhoff
Copy link
Contributor

@joshimhoff joshimhoff commented Jan 19, 2023

Part of cockroachdb/cockroach#67856.

This is #1672 from @nicktrav, rebased on top of #1677 and extended so that the op type of filesystem metadata operations is also captured. I also addressed @sumeerbhola's last comments over at #1672.

With this one merged, I still have more thing to do. I will extend DiskSlowInfo to track write size in case a sized write (e.g. a call to Write).

As an aside, some discussion was had at https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1674138778490679 about moving from a single long-lived monitor goroutine per file to a shorter-lived monitor goroutine per file op. Would mean no shared mutable state and no packed int64. We prefer the existing long-lived monitor goroutine per file, as the perf impacts of the suggested change may be problematic (it may also be hard to observe these bad impacts via pebble workload, etc.).

vfs: capture operation type affected by disk slowness

Currently, if a Pebble DB is backed by vfs.FS that is wrapped with a vfs.diskHealthCheckingFS, the DB can be made aware of operations that are taking longer than some threshold. The current implementation does not make any distinction between the operation type (write, sync, etc.) that was observed as slow.

Capture the type of operation being performed when emitting a disk slowness event.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@joshimhoff joshimhoff requested review from sumeerbhola and a team January 19, 2023 20:25
@joshimhoff
Copy link
Contributor Author

Since this change was mostly already reviewed by @sumeerbhola over at #1672, I have assigned Sumeer directly. Storage folks, let me know if this is still a better time to just assign storage team as per the recent discussion.

@joshimhoff joshimhoff force-pushed the improve_write_stall_info_safe branch 2 times, most recently from 5164e72 to 5ace9bf Compare January 19, 2023 20:37
@sumeerbhola sumeerbhola requested a review from nicktrav January 20, 2023 16:10
Copy link
Contributor

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically reviewing my own code ... it's been a while! :lgtm:

Reviewed 4 of 6 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @sumeerbhola)

@joshimhoff joshimhoff force-pushed the improve_write_stall_info_safe branch from 5ace9bf to 1de7da9 Compare January 24, 2023 17:54
@joshimhoff
Copy link
Contributor Author

TTFR!

Currently, if a Pebble DB is backed by `vfs.FS` that is wrapped with a
`vfs.diskHealthCheckingFS`, the DB can be made aware of operations that
are taking longer than some threshold. The current implementation does not make
any distinction between the operation type (write, sync, etc.) that was
observed as slow.

Capture the type of operation being performed when emitting a disk
slowness event.
@joshimhoff joshimhoff force-pushed the improve_write_stall_info_safe branch from 1de7da9 to d9b570e Compare January 24, 2023 18:19
Copy link
Contributor

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @sumeerbhola)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants