-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: don't double-count RAID volumes in disk metrics #104640
Conversation
5f8fb4b
to
ffc5c5c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal)
pkg/server/status/runtime.go
line 261 at r1 (raw file):
// for instance, RAID volumes under both the logical volume and under the // physical volume(s). var diskMetricsFilter = envutil.EnvOrDefaultString("COCKROACH_DISK_METRICS_IGNORED_DEVICES", getDefaultDiskMetricsFilter())
[nit] "Filter" implies that the regex has to pass for something to count. Consider just flipping its meaning and renaming the env var to COCKROACH_DISK_METRICS_DEVICE_FILTER
pkg/server/status/runtime.go
line 726 at r1 (raw file):
if diskMetricsFilter != "" { var err error filter, err = regexp.Compile(diskMetricsFilter)
If an invalid regex is specified in the env var, is erroring here too late? I assume this error won't show up in a very visible way
ffc5c5c
to
37a499e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal)
pkg/server/status/runtime_linux.go
line 17 at r2 (raw file):
func getDefaultDiskMetricsFilter() string { return "^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\\d+n\\d+p)\\d+$"
[nit] It would be useful to add a comment explaining the choices here. e.g. what is nvmeXnYpZ
and why is nvmeX
not ok?
pkg/server/status/runtime_linux_test.go
line 37 at r2 (raw file):
{ Name: "sda1", ReadBytes: 1,
Use 10s here and 100s below so it's clear from the output which one was excluded
Previously, RaduBerinde wrote…
(I get it now, |
37a499e
to
dd10d17
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde)
pkg/server/status/runtime.go
line 261 at r1 (raw file):
Previously, RaduBerinde wrote…
[nit] "Filter" implies that the regex has to pass for something to count. Consider just flipping its meaning and renaming the env var to
COCKROACH_DISK_METRICS_DEVICE_FILTER
Good point. Rather I renamed the variables to *ignoredDevices*
since I'd rather keep the 1:1 match with Prometheus instead of invert the meaning of the filter.
pkg/server/status/runtime.go
line 726 at r1 (raw file):
Previously, RaduBerinde wrote…
If an invalid regex is specified in the env var, is erroring here too late? I assume this error won't show up in a very visible way
It's going to zero out the disk-based metrics and print a Warning in the logs, so I think it's going to be pretty visible:
log.Ops.Warningf(ctx, "problem fetching disk stats: %s; disk stats will be empty.", err)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal)
pkg/server/status/runtime.go
line 261 at r1 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Good point. Rather I renamed the variables to
*ignoredDevices*
since I'd rather keep the 1:1 match with Prometheus instead of invert the meaning of the filter.
If the default regex is taken from Prometheus, it would be good to mention that in a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal and @RaduBerinde)
pkg/server/status/runtime_linux.go
line 17 at r2 (raw file):
Previously, RaduBerinde wrote…
(I get it now,
nvme1n1p1
is a specific partition (likesda1
) whereasnvme1n1
is the entire device, likesda
. Still useful to shed some light on this in a comment.)
+1 to a long detailed comment, so the existing understanding is fully captured here.
Previously, we wouldn't exclude volumes from disk counters that are likely to be double-counted such as RAID logical volumes that are composed of physical volumes that are also independently present in disk metrics. This change adds a regex-based filter, overridable with env vars, that excludes common double-counted volume patterns. Fixes cockroachdb#97867. Epic: none Release note (bug fix): Avoids double-counting disk read/write bytes in disk metrics if Cockroach observes volumes that are likely to be duplicated in reported disk counters, such as RAID logical vs physical volumes.
dd10d17
to
479effa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal, @RaduBerinde, and @sumeerbhola)
pkg/server/status/runtime.go
line 261 at r1 (raw file):
Previously, RaduBerinde wrote…
If the default regex is taken from Prometheus, it would be good to mention that in a comment.
Done.
pkg/server/status/runtime_linux.go
line 17 at r2 (raw file):
Previously, sumeerbhola wrote…
+1 to a long detailed comment, so the existing understanding is fully captured here.
Incorporated some of this into a comment. Not a long one because the main logic (based on this and the conversation in Slack) is "just be consistent with Prometheus".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 6 files at r1, 2 of 3 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal)
TFTR! bors r=RaduBerinde |
Build failed (retrying...): |
This PR was included in a batch that successfully built, but then failed to merge into master. It will not be retried. Additional information: {"message":"Changes must be made through a pull request.","documentation_url":"https://docs.github.com/articles/about-protected-branches"} |
bors r+ |
Build succeeded: |
Should we backport this to 23.1? |
Previously, we wouldn't exclude volumes from disk counters that are likely to be double-counted such as RAID logical volumes that are composed of physical volumes that are also independently present in disk metrics. This change adds a regex-based filter, overridable with env vars, that excludes common double-counted volume patterns.
Fixes #97867.
Epic: none
Release note (bug fix): Avoids double-counting disk read/write bytes in disk metrics if Cockroach observes volumes that are likely to be duplicated in reported disk counters, such as RAID logical vs physical volumes.