-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: replica load distribution metric #98267
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
3515d1d
to
34bba9a
Compare
34bba9a
to
316a6e0
Compare
Going to hold off requesting reviews until #98266 is settled. |
@aadityasondhi mentioned that they expect #98266 to be resolved before GA. I'm going to go ahead and req reviews - noting that currently the timseries has periodic gaps. |
316a6e0
to
21387e5
Compare
Reverted to draft. I'm going to rework the approach here using a ManualHistogram after some changes are made to ManualHistogram. From @aadityasondhi
|
73d7ca1
to
88ddb0b
Compare
I took a pretty hacky path to use the ManualHistogram, with the assumption that we will update the iface in 23.2 - lmk what you think @aadityasondhi. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only reviewed the first commit (related to the refactoring of the manual histogram), but overall it looks great! Some of the ideas here are what I had in mind for the refactor in #98622.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)
@andrewbaptist, could you review the last 3 commits? They are adding two metrics using the histogram modified in c1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting this in, this will be useful to see as we continue to make improvements to the allocator.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @kvoli)
pkg/kv/kvserver/metrics.go
line 363 at r1 (raw file):
Unit: metric.Unit_NANOSECONDS, } metaAverageReplicaCPUNanosPerSecond = metric.Metadata{
nit: The name is confusing. In general average for a histogram is confusing since there are two different ways it could be interpreted: 1) average across replicas, 2) average across time. Maybe something like metaRecentReplicaCPUNanosPerSecond. I think part of the problem is that it is just a complex topic because of the two dimensions. Also, can you add a comment to clarify if this applies to all replicas including quiesced ones. I think it does.
pkg/kv/kvserver/metrics.go
line 2376 at r1 (raw file):
AverageWriteBytesPerSecond: metric.NewGaugeFloat64(metaAverageWriteBytesPerSecond), AverageReadBytesPerSecond: metric.NewGaugeFloat64(metaAverageReadBytesPerSecond), AverageCPUNanosPerSecond: metric.NewGaugeFloat64(metaAverageCPUNanosPerSecond),
nit: Is this stat redundant with the AverageReplicaCPUNanosPerSecond now? If so can we remove it? If it is still necessary since ManualHistograms can't be used for computing this average just add a short comment why not.
88ddb0b
to
b09af2e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TYFTR
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist and @kvoli)
pkg/kv/kvserver/metrics.go
line 363 at r1 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
nit: The name is confusing. In general average for a histogram is confusing since there are two different ways it could be interpreted: 1) average across replicas, 2) average across time. Maybe something like metaRecentReplicaCPUNanosPerSecond. I think part of the problem is that it is just a complex topic because of the two dimensions. Also, can you add a comment to clarify if this applies to all replicas including quiesced ones. I think it does.
Added comment to clarify it is for all replicas and renamed to recent...
.
pkg/kv/kvserver/metrics.go
line 2376 at r1 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
nit: Is this stat redundant with the AverageReplicaCPUNanosPerSecond now? If so can we remove it? If it is still necessary since ManualHistograms can't be used for computing this average just add a short comment why not.
It is however I'd rather not break dependencies (lots of tests, probably users and definitely UI) in this PR, since I intend to backport to 23.1.
60ea36d
to
4108223
Compare
This commit extends the ManualWindowHistogram to support RecordValue and Rotate. Previously, it was necessary to maintain duplicate cumulative histograms in order to batch update the manual histogram. This update adds a quality of life feature, enabling recording to the ManualWindowHistogram, then once finished, rotating the batch of recorded values into the current window for the internal tsdb to query. Touches: cockroachdb#98266 Release note: None
This commit introduces a histogram tracking percentiles of replica CPU time. Previously, there was only point in time insight into replica CPU distribution via hotranges. This change enables historical timeseries tracking via the metric `rebalancing.replicas.cpunanospersecond`. Part of: cockroachdb#98255 Release note (ops change): The `rebalancing.replicas.cpunanospersecond` histogram metric is added, which provides insight into the distribution of replica CPU usage within a store.
The `rebalancing.queriespersecond` metric incorrectly used `Keys/Sec` as a measurement. This commit updates the measurement to be `Queries/Sec`, as implied by the name. Release note: None
This commit introduces a histogram tracking percentiles of replica QPS time. Previously, there was only point in time insight into replica QPS distribution via hot ranges. This change enables historical timeseries tracking via the metric `rebalancing.replicas.queriespersecond`. Part of: cockroachdb#98255 Release note (ops change): The `rebalancing.replicas.queriespersecond` histogram metric is added, which provides insight into the distribution of queries per replica within a store.
4108223
to
7b6c751
Compare
bors r=andrewbaptist |
Build succeeded: |
This PR adds two histograms, tracking the percentiles of replica CPU time and replica Batch requests received. Previously, there was only point in time insight into either distribution via hotranges. This change enables historical timeseries tracking via the metric
rebalancing.replicas.cpunanospersecond
for CPU andrebalancing.replicas.queriespersecond
for Batch Requests.Informs: #98255