Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: don't double-count RAID volumes in disk metrics #104640

Merged
merged 1 commit into from
Jun 9, 2023

Conversation

itsbilal
Copy link
Contributor

@itsbilal itsbilal commented Jun 9, 2023

Previously, we wouldn't exclude volumes from disk counters that are likely to be double-counted such as RAID logical volumes that are composed of physical volumes that are also independently present in disk metrics. This change adds a regex-based filter, overridable with env vars, that excludes common double-counted volume patterns.

Fixes #97867.

Epic: none

Release note (bug fix): Avoids double-counting disk read/write bytes in disk metrics if Cockroach observes volumes that are likely to be duplicated in reported disk counters, such as RAID logical vs physical volumes.

@itsbilal itsbilal requested review from a team June 9, 2023 01:33
@itsbilal itsbilal requested a review from a team as a code owner June 9, 2023 01:33
@itsbilal itsbilal self-assigned this Jun 9, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@itsbilal itsbilal force-pushed the disk-filter-regex branch from 5f8fb4b to ffc5c5c Compare June 9, 2023 01:35
Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal)


pkg/server/status/runtime.go line 261 at r1 (raw file):

// for instance, RAID volumes under both the logical volume and under the
// physical volume(s).
var diskMetricsFilter = envutil.EnvOrDefaultString("COCKROACH_DISK_METRICS_IGNORED_DEVICES", getDefaultDiskMetricsFilter())

[nit] "Filter" implies that the regex has to pass for something to count. Consider just flipping its meaning and renaming the env var to COCKROACH_DISK_METRICS_DEVICE_FILTER


pkg/server/status/runtime.go line 726 at r1 (raw file):

	if diskMetricsFilter != "" {
		var err error
		filter, err = regexp.Compile(diskMetricsFilter)

If an invalid regex is specified in the env var, is erroring here too late? I assume this error won't show up in a very visible way

@itsbilal itsbilal force-pushed the disk-filter-regex branch from ffc5c5c to 37a499e Compare June 9, 2023 14:20
Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal)


pkg/server/status/runtime_linux.go line 17 at r2 (raw file):

func getDefaultDiskMetricsFilter() string {
	return "^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\\d+n\\d+p)\\d+$"

[nit] It would be useful to add a comment explaining the choices here. e.g. what is nvmeXnYpZ and why is nvmeX not ok?


pkg/server/status/runtime_linux_test.go line 37 at r2 (raw file):

		{
			Name:           "sda1",
			ReadBytes:      1,

Use 10s here and 100s below so it's clear from the output which one was excluded

@RaduBerinde
Copy link
Member

pkg/server/status/runtime_linux.go line 17 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] It would be useful to add a comment explaining the choices here. e.g. what is nvmeXnYpZ and why is nvmeX not ok?

(I get it now, nvme1n1p1 is a specific partition (like sda1) whereas nvme1n1 is the entire device, like sda. Still useful to shed some light on this in a comment.)

@itsbilal itsbilal force-pushed the disk-filter-regex branch from 37a499e to dd10d17 Compare June 9, 2023 14:44
Copy link
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde)


pkg/server/status/runtime.go line 261 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] "Filter" implies that the regex has to pass for something to count. Consider just flipping its meaning and renaming the env var to COCKROACH_DISK_METRICS_DEVICE_FILTER

Good point. Rather I renamed the variables to *ignoredDevices* since I'd rather keep the 1:1 match with Prometheus instead of invert the meaning of the filter.


pkg/server/status/runtime.go line 726 at r1 (raw file):

Previously, RaduBerinde wrote…

If an invalid regex is specified in the env var, is erroring here too late? I assume this error won't show up in a very visible way

It's going to zero out the disk-based metrics and print a Warning in the logs, so I think it's going to be pretty visible:

log.Ops.Warningf(ctx, "problem fetching disk stats: %s; disk stats will be empty.", err)

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal)


pkg/server/status/runtime.go line 261 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Good point. Rather I renamed the variables to *ignoredDevices* since I'd rather keep the 1:1 match with Prometheus instead of invert the meaning of the filter.

If the default regex is taken from Prometheus, it would be good to mention that in a comment.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal and @RaduBerinde)


pkg/server/status/runtime_linux.go line 17 at r2 (raw file):

Previously, RaduBerinde wrote…

(I get it now, nvme1n1p1 is a specific partition (like sda1) whereas nvme1n1 is the entire device, like sda. Still useful to shed some light on this in a comment.)

+1 to a long detailed comment, so the existing understanding is fully captured here.

Previously, we wouldn't exclude volumes from disk counters
that are likely to be double-counted such as RAID logical
volumes that are composed of physical volumes that are also
independently present in disk metrics. This change adds a
regex-based filter, overridable with env vars, that excludes
common double-counted volume patterns.

Fixes cockroachdb#97867.

Epic: none

Release note (bug fix): Avoids double-counting disk read/write bytes
in disk metrics if Cockroach observes volumes that are likely to
be duplicated in reported disk counters, such as RAID logical vs
physical volumes.
@itsbilal itsbilal force-pushed the disk-filter-regex branch from dd10d17 to 479effa Compare June 9, 2023 15:20
Copy link
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal, @RaduBerinde, and @sumeerbhola)


pkg/server/status/runtime.go line 261 at r1 (raw file):

Previously, RaduBerinde wrote…

If the default regex is taken from Prometheus, it would be good to mention that in a comment.

Done.


pkg/server/status/runtime_linux.go line 17 at r2 (raw file):

Previously, sumeerbhola wrote…

+1 to a long detailed comment, so the existing understanding is fully captured here.

Incorporated some of this into a comment. Not a long one because the main logic (based on this and the conversation in Slack) is "just be consistent with Prometheus".

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 6 files at r1, 2 of 3 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal)

@itsbilal
Copy link
Contributor Author

itsbilal commented Jun 9, 2023

TFTR!

bors r=RaduBerinde

@craig
Copy link
Contributor

craig bot commented Jun 9, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jun 9, 2023

This PR was included in a batch that successfully built, but then failed to merge into master. It will not be retried.

Additional information:

{"message":"Changes must be made through a pull request.","documentation_url":"https://docs.github.com/articles/about-protected-branches"}

@RaduBerinde
Copy link
Member

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 9, 2023

Build succeeded:

@craig craig bot merged commit e342c56 into cockroachdb:master Jun 9, 2023
@itsbilal
Copy link
Contributor Author

Should we backport this to 23.1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

server: raided disks can cause erroneous disk metrics
4 participants