Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: replica load distribution metric #98267

Merged
merged 4 commits into from
Apr 3, 2023

Conversation

kvoli
Copy link
Collaborator

@kvoli kvoli commented Mar 9, 2023

This PR adds two histograms, tracking the percentiles of replica CPU time and replica Batch requests received. Previously, there was only point in time insight into either distribution via hotranges. This change enables historical timeseries tracking via the metric rebalancing.replicas.cpunanospersecond for CPU and rebalancing.replicas.queriespersecond for Batch Requests.

Informs: #98255

@blathers-crl
Copy link

blathers-crl bot commented Mar 9, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@kvoli kvoli self-assigned this Mar 9, 2023
@kvoli kvoli force-pushed the 230308.hottest-k-replicas branch from 3515d1d to 34bba9a Compare March 13, 2023 14:07
@kvoli kvoli changed the title kvserver: [wip] add replica cpu histogram metric kvserver: replica load distribution metric Mar 13, 2023
@kvoli kvoli force-pushed the 230308.hottest-k-replicas branch from 34bba9a to 316a6e0 Compare March 13, 2023 14:20
@kvoli kvoli marked this pull request as ready for review March 13, 2023 14:36
@kvoli kvoli requested a review from a team March 13, 2023 14:36
@kvoli kvoli requested a review from a team as a code owner March 13, 2023 14:36
@kvoli
Copy link
Collaborator Author

kvoli commented Mar 13, 2023

Going to hold off requesting reviews until #98266 is settled.

@kvoli
Copy link
Collaborator Author

kvoli commented Mar 13, 2023

@aadityasondhi mentioned that they expect #98266 to be resolved before GA. I'm going to go ahead and req reviews - noting that currently the timseries has periodic gaps.

@kvoli kvoli force-pushed the 230308.hottest-k-replicas branch from 316a6e0 to 21387e5 Compare March 13, 2023 22:43
@kvoli kvoli marked this pull request as draft March 14, 2023 18:41
@kvoli
Copy link
Collaborator Author

kvoli commented Mar 14, 2023

Reverted to draft. I'm going to rework the approach here using a ManualHistogram after some changes are made to ManualHistogram.

From @aadityasondhi

For histograms such as the one above, I would suggest using ManualWindowHistogram. I do recognize that the current interface is cumbersome and I will rework it to make it more easily usable. When you create this histogram, you will be required to provide the promised update interval. The poller will report the last known value of these histograms as long as T-t < update interval. To avoid race conditions such as the one in this issue, we will allow for 1 missed update but start reporting 0 if the poller misses twice in a row.

@kvoli kvoli force-pushed the 230308.hottest-k-replicas branch 7 times, most recently from 73d7ca1 to 88ddb0b Compare March 24, 2023 13:58
@kvoli kvoli marked this pull request as ready for review March 24, 2023 15:00
@kvoli
Copy link
Collaborator Author

kvoli commented Mar 24, 2023

I took a pretty hacky path to use the ManualHistogram, with the assumption that we will update the iface in 23.2 - lmk what you think @aadityasondhi.

Copy link
Collaborator

@aadityasondhi aadityasondhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only reviewed the first commit (related to the refactoring of the manual histogram), but overall it looks great! Some of the ideas here are what I had in mind for the refactor in #98622.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)

@kvoli
Copy link
Collaborator Author

kvoli commented Mar 29, 2023

@andrewbaptist, could you review the last 3 commits? They are adding two metrics using the histogram modified in c1.

Copy link
Collaborator

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Thanks for getting this in, this will be useful to see as we continue to make improvements to the allocator.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @kvoli)


pkg/kv/kvserver/metrics.go line 363 at r1 (raw file):

		Unit:        metric.Unit_NANOSECONDS,
	}
	metaAverageReplicaCPUNanosPerSecond = metric.Metadata{

nit: The name is confusing. In general average for a histogram is confusing since there are two different ways it could be interpreted: 1) average across replicas, 2) average across time. Maybe something like metaRecentReplicaCPUNanosPerSecond. I think part of the problem is that it is just a complex topic because of the two dimensions. Also, can you add a comment to clarify if this applies to all replicas including quiesced ones. I think it does.


pkg/kv/kvserver/metrics.go line 2376 at r1 (raw file):

		AverageWriteBytesPerSecond: metric.NewGaugeFloat64(metaAverageWriteBytesPerSecond),
		AverageReadBytesPerSecond:  metric.NewGaugeFloat64(metaAverageReadBytesPerSecond),
		AverageCPUNanosPerSecond:   metric.NewGaugeFloat64(metaAverageCPUNanosPerSecond),

nit: Is this stat redundant with the AverageReplicaCPUNanosPerSecond now? If so can we remove it? If it is still necessary since ManualHistograms can't be used for computing this average just add a short comment why not.

@kvoli kvoli force-pushed the 230308.hottest-k-replicas branch from 88ddb0b to b09af2e Compare April 3, 2023 13:25
@kvoli kvoli requested a review from andrewbaptist April 3, 2023 13:25
Copy link
Collaborator Author

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TYFTR

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist and @kvoli)


pkg/kv/kvserver/metrics.go line 363 at r1 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

nit: The name is confusing. In general average for a histogram is confusing since there are two different ways it could be interpreted: 1) average across replicas, 2) average across time. Maybe something like metaRecentReplicaCPUNanosPerSecond. I think part of the problem is that it is just a complex topic because of the two dimensions. Also, can you add a comment to clarify if this applies to all replicas including quiesced ones. I think it does.

Added comment to clarify it is for all replicas and renamed to recent....


pkg/kv/kvserver/metrics.go line 2376 at r1 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

nit: Is this stat redundant with the AverageReplicaCPUNanosPerSecond now? If so can we remove it? If it is still necessary since ManualHistograms can't be used for computing this average just add a short comment why not.

It is however I'd rather not break dependencies (lots of tests, probably users and definitely UI) in this PR, since I intend to backport to 23.1.

@kvoli kvoli force-pushed the 230308.hottest-k-replicas branch 2 times, most recently from 60ea36d to 4108223 Compare April 3, 2023 15:14
kvoli added 4 commits April 3, 2023 15:52
This commit extends the ManualWindowHistogram to support RecordValue and
Rotate. Previously, it was necessary to maintain duplicate cumulative
histograms in order to batch update the manual histogram. This update
adds a quality of life feature, enabling recording to the
ManualWindowHistogram, then once finished, rotating the batch of
recorded values into the current window for the internal tsdb to query.

Touches: cockroachdb#98266

Release note: None
This commit introduces a histogram tracking percentiles of replica CPU
time. Previously, there was only point in time insight into replica CPU
distribution via hotranges. This change enables historical timeseries
tracking via the metric `rebalancing.replicas.cpunanospersecond`.

Part of: cockroachdb#98255

Release note (ops change): The `rebalancing.replicas.cpunanospersecond`
histogram metric is added, which provides insight into the distribution
of replica CPU usage within a store.
The `rebalancing.queriespersecond` metric incorrectly used `Keys/Sec` as
a measurement. This commit updates the measurement to be `Queries/Sec`,
as implied by the name.

Release note: None
This commit introduces a histogram tracking percentiles of replica QPS
time. Previously, there was only point in time insight into replica QPS
distribution via hot ranges. This change enables historical timeseries
tracking via the metric `rebalancing.replicas.queriespersecond`.

Part of: cockroachdb#98255

Release note (ops change): The `rebalancing.replicas.queriespersecond`
histogram metric is added, which provides insight into the distribution
of queries per replica within a store.
@kvoli kvoli force-pushed the 230308.hottest-k-replicas branch from 4108223 to 7b6c751 Compare April 3, 2023 15:53
@kvoli kvoli added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Apr 3, 2023
@kvoli
Copy link
Collaborator Author

kvoli commented Apr 3, 2023

bors r=andrewbaptist

@craig craig bot merged commit 6cd1f1b into cockroachdb:master Apr 3, 2023
@craig
Copy link
Contributor

craig bot commented Apr 3, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants