-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql/stats: convert between histograms and quantile functions #83796
Conversation
This PR is based on #83123. The first commit is from that PR and can be ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
Reviewed 5 of 5 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @yuzefovich)
pkg/sql/stats/quantile.go
line 82 at r2 (raw file):
// 190 + - - - - - - - - - - // 0 .2 .4 .6 .8 1 //
Love this comment! Super helpful.
pkg/sql/stats/quantile.go
line 118 at r2 (raw file):
types.TimestampTZFamily: // TODO(michae2): Check that there are no constraints making this a de facto // ENUM. (Could also check histogram for sum(NumRange) > 0.)
How is this an enum? I'm a bit confused by this comment.... (I know it's a note to yourself, but wouldn't hurt to make it understandable by others too)
pkg/sql/stats/quantile.go
line 275 at r2 (raw file):
lowerBound := getNextLowerBound(nilEvalContext, currentUpperBound) distinctRange := estimatedDistinctValuesInRange( nilEvalContext, currentBucket.NumRange, lowerBound, currentUpperBound,
seems like lowerBound
and currentUpperBound
are going to be basically the same here. Maybe you need to get the value of lowerBound
earlier (i.e. from the previous bucket)?
pkg/sql/stats/quantile.go
line 337 at r2 (raw file):
} func isPositive(x float64) bool {
nit: should this be isNonNegative
?
pkg/sql/stats/quantile_test.go
line 27 at r2 (raw file):
) type testHistogram []testBucket
Do you think you could add a test where you create a random histogram or quantile function and test that it round-trips?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work - it was an interesting read!
Reviewed 1 of 2 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @rytaft)
pkg/sql/stats/quantile.go
line 82 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Love this comment! Super helpful.
+1
pkg/sql/stats/quantile.go
line 102 at r2 (raw file):
// makeQuantile and quantile.toHistogram might need to change if we introduce a // new histogram version. const _ uint = 1 - uint(histVersion)
nit: is this supposed to fail the compilation once histVersion
is increased? If so, why that would be the case? Or this just a usage of histVersion
value so that in the future this comment would be found via "Find usages"?
pkg/sql/stats/quantile.go
line 141 at r2 (raw file):
// To produce a quantile with first point at p=0 and at least two points, we // need the first bucket to have NumRange == 0. if hist.buckets[0].NumRange != 0 {
Just a question: is this a limiting assumption or can we always have such histograms that satisfy the assumption?
pkg/sql/stats/quantile.go
line 151 at r2 (raw file):
// zero-row buckets from the beginning and end of the histogram. qfTrimLo, qfTrimHi quantileIndex qf quantile
How large can qf
get? Will it be limited by the number of histogram buckets times two? Should we do memory accounting somewhere? (It doesn't seem like we're using the quantiles yet, and maybe the responsibility will be pushed to the caller - might be worth mentioning it in the comment to makeQuantile
.)
pkg/sql/stats/quantile.go
line 214 at r2 (raw file):
// Fix any floating point errors or histogram errors (e.g. sum of bucket row // counts < total row count) causing p to be below 1 at the end. qf[len(qf)-1].p = 1
nit: shouldn't this already be fixed in addPoint
?
pkg/sql/stats/quantile.go
line 238 at r2 (raw file):
var i int // Skip any leading p=0 points instead of emitting zero-row buckets.
IIUC we must have already trimmed all but one zero-row buckets from the start in makeQuantile
- do we do this out of caution? Maybe we should have an assertion check that qf[0].p == 0 && qf[1].p > 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @michae2, @rytaft, and @yuzefovich)
pkg/sql/stats/quantile.go
line 82 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
+1
+2
pkg/sql/stats/quantile.go
line 102 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: is this supposed to fail the compilation once
histVersion
is increased? If so, why that would be the case? Or this just a usage ofhistVersion
value so that in the future this comment would be found via "Find usages"?
Compilation will fail when histVersion
increases because a negative value overflows uint
.
pkg/sql/stats/quantile.go
line 156 at r2 (raw file):
) addPoint := func(num, v float64) error {
nit: might be helpful to add a comment explaining this function and its arguments. I'm assuming num
is the number times value v
exists in the histogram?
pkg/sql/stats/quantile.go
line 214 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: shouldn't this already be fixed in
addPoint
?
Looks like that only fixes a p
if p > 1
. The final p
could be slightly less than 1
, so this is fixing that - 100% of the values should be less than or equal to the maximum value in the histogram.
pkg/sql/stats/quantile.go
line 258 at r2 (raw file):
var pEq float64 closeCurrentBucket := func() error {
nit: describe briefly what this function does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback so far!
- Added a test that tries to round-trip a random histogram.
- Added a
histogram.String
method for easier debugging. - Now preallocating slices in
makeQuantile
andquantile.toHistogram
.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @rytaft, and @yuzefovich)
pkg/sql/stats/quantile.go
line 102 at r2 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Compilation will fail when
histVersion
increases because a negative value overflowsuint
.
Yep, it's a compile-time check to make sure someone looks at this file when changing histogram formats. I tried making the comment clearer.
pkg/sql/stats/quantile.go
line 118 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
How is this an enum? I'm a bit confused by this comment.... (I know it's a note to yourself, but wouldn't hurt to make it understandable by others too)
This is about columns like the hidden shard column of a hash-sharded index, which have the correct type but behave more like enums than integers. I tried making the comment a little better, lmk if it needs more work.
pkg/sql/stats/quantile.go
line 141 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Just a question: is this a limiting assumption or can we always have such histograms that satisfy the assumption?
We currently always produce histograms that satisfy this assumption. I don't think NumRange != 0
in the first bucket would ever really make sense (there is only one endpoint for the range counted by NumRange of the first bucket) but I wanted to be explicit.
pkg/sql/stats/quantile.go
line 151 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
How large can
qf
get? Will it be limited by the number of histogram buckets times two? Should we do memory accounting somewhere? (It doesn't seem like we're using the quantiles yet, and maybe the responsibility will be pushed to the caller - might be worth mentioning it in the comment tomakeQuantile
.)
Good questions. Yes, qf
is limited to len(hist.buckets) * 2
so I went ahead and pre-allocated it (and also pre-allocated the histogram slice in toHistogram
). The quantiles will only exist briefly whenever calculating a statistics forecast, after which they will be thrown away. The end result will be an in-memory histogram in the stats cache. I'm not sure the stats cache actually has memory accounting, I should probably add that in a future PR.
pkg/sql/stats/quantile.go
line 156 at r2 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
nit: might be helpful to add a comment explaining this function and its arguments. I'm assuming
num
is the number times valuev
exists in the histogram?
Done. (num
could also be counting values < v
if we're adding a point for NumRange
.)
pkg/sql/stats/quantile.go
line 214 at r2 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Looks like that only fixes a
p
ifp > 1
. The finalp
could be slightly less than1
, so this is fixing that - 100% of the values should be less than or equal to the maximum value in the histogram.
Yes, the check in addPoint
handles p > 1
, this handles p < 1
.
pkg/sql/stats/quantile.go
line 238 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
IIUC we must have already trimmed all but one zero-row buckets from the start in
makeQuantile
- do we do this out of caution? Maybe we should have an assertion check thatqf[0].p == 0 && qf[1].p > 0
?
It's a little silly in this PR, but it will make more sense when I open the next PR adding operations on quantiles for performing linear regression. Those operations could create a quantile with multiple p=0 points.
pkg/sql/stats/quantile.go
line 258 at r2 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
nit: describe briefly what this function does
Done.
pkg/sql/stats/quantile.go
line 275 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
seems like
lowerBound
andcurrentUpperBound
are going to be basically the same here. Maybe you need to get the value oflowerBound
earlier (i.e. from the previous bucket)?
Gah, good catch! Thank you. Done.
pkg/sql/stats/quantile.go
line 337 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
nit: should this be
isNonNegative
?
Yes, it should be. I wanted to avoid the double negative (if !isNonNegative
) (or is that a triple negative? 😁) so I went with isInvalidCount
instead, lmk what you think.
pkg/sql/stats/quantile_test.go
line 27 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Do you think you could add a test where you create a random histogram or quantile function and test that it round-trips?
Yes, definitely. It's not particularly succinct, but here's my attempt at a random histogram generator. This caught one bug, which was that I needed a real non-nil eval.Context
in quantile.toHistogram
. After fixing that, it seems to be round-tripping successfully so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 6 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mgartner, @rytaft, and @yuzefovich)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 6 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)
pkg/sql/stats/quantile_test.go
line 37 at r4 (raw file):
// changing. func TestRandomQuantileRoundTrip(t *testing.T) { colTypes := []*types.T{
nit: so that in the future we don't forget to include newly-supported types, it might be better to use CanMakeQuantile
and types.Scalar
here to populate colTypes
(we'll then need to add ints and floats of smaller widths separately).
pkg/sql/stats/quantile_test.go
line 53 at r4 (raw file):
t.Run(fmt.Sprintf("%v/%v", colType.Name(), i), func(t *testing.T) { hist, rowCount := randHist(evalCtx, colType, rng) //t.Log(hist)
nit: seems like a leftover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 6 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)
pkg/sql/stats/quantile.go
line 337 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Yes, it should be. I wanted to avoid the double negative (
if !isNonNegative
) (or is that a triple negative? 😁) so I went withisInvalidCount
instead, lmk what you think.
How about leaving the logic as it was before and changing the name to isValidCount
?
pkg/sql/stats/quantile_test.go
line 27 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Yes, definitely. It's not particularly succinct, but here's my attempt at a random histogram generator. This caught one bug, which was that I needed a real non-nil
eval.Context
inquantile.toHistogram
. After fixing that, it seems to be round-tripping successfully so far.
Nice!
To predict histograms in statistics forecasts, we will use linear regression over quantile functions. (Quantile functions are another representation of histogram data, in a form more amenable to statistical manipulation.) This commit defines quantile functions and adds methods to convert between histograms and quantile functions. This code was originally part of cockroachdb#77070 but has been pulled out to simplify that PR. A few changes have been made: - Common code has been factored into closures. - More checks have been added for positive values. - In `makeQuantile` we now trim leading empty buckets as well as trailing empty buckets. - The logic in `quantile.toHistogram` to steal from `NumRange` if `NumEq` is zero now checks that `NumRange` will still be >= 1. - More tests have been added. Assists: cockroachdb#79872 Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @mgartner, @michae2, @rytaft, and @yuzefovich)
pkg/sql/stats/quantile.go
line 337 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
How about leaving the logic as it was before and changing the name to
isValidCount
?
Done.
pkg/sql/stats/quantile_test.go
line 37 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: so that in the future we don't forget to include newly-supported types, it might be better to use
CanMakeQuantile
andtypes.Scalar
here to populatecolTypes
(we'll then need to add ints and floats of smaller widths separately).
Ah, great idea! Done.
pkg/sql/stats/quantile_test.go
line 53 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: seems like a leftover.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @rytaft)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @rytaft)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @rytaft)
TFTRs! bors r=mgartner,rytaft,yuzefovich |
Build succeeded: |
To predict histograms in statistics forecasts, we will use linear
regression over quantile functions. (Quantile functions are another
representation of histogram data, in a form more amenable to statistical
manipulation.)
This commit defines quantile functions and adds methods to convert
between histograms and quantile functions.
This code was originally part of #77070 but has been pulled out to
simplify that PR. A few changes have been made:
makeQuantile
we now trim leading empty buckets as well astrailing empty buckets.
quantile.toHistogram
to steal fromNumRange
ifNumEq
is zero now checks thatNumRange
will still be >= 1.Assists: #79872
Release note: None