Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/stats: convert between histograms and quantile functions #83796

Merged
merged 1 commit into from
Jul 11, 2022

Conversation

michae2
Copy link
Collaborator

@michae2 michae2 commented Jul 5, 2022

To predict histograms in statistics forecasts, we will use linear
regression over quantile functions. (Quantile functions are another
representation of histogram data, in a form more amenable to statistical
manipulation.)

This commit defines quantile functions and adds methods to convert
between histograms and quantile functions.

This code was originally part of #77070 but has been pulled out to
simplify that PR. A few changes have been made:

  • Common code has been factored into closures.
  • More checks have been added for positive values.
  • In makeQuantile we now trim leading empty buckets as well as
    trailing empty buckets.
  • The logic in quantile.toHistogram to steal from NumRange if
    NumEq is zero now checks that NumRange will still be >= 1.
  • More tests have been added.

Assists: #79872

Release note: None

@michae2 michae2 requested a review from a team July 5, 2022 06:52
@michae2 michae2 requested a review from a team as a code owner July 5, 2022 06:52
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@michae2
Copy link
Collaborator Author

michae2 commented Jul 5, 2022

This PR is based on #83123. The first commit is from that PR and can be ignored.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

Reviewed 5 of 5 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @yuzefovich)


pkg/sql/stats/quantile.go line 82 at r2 (raw file):

//   190 + - - - - - - - - - -
//       0  .2  .4  .6  .8   1
//

Love this comment! Super helpful.


pkg/sql/stats/quantile.go line 118 at r2 (raw file):

		types.TimestampTZFamily:
		// TODO(michae2): Check that there are no constraints making this a de facto
		// ENUM. (Could also check histogram for sum(NumRange) > 0.)

How is this an enum? I'm a bit confused by this comment.... (I know it's a note to yourself, but wouldn't hurt to make it understandable by others too)


pkg/sql/stats/quantile.go line 275 at r2 (raw file):

		lowerBound := getNextLowerBound(nilEvalContext, currentUpperBound)
		distinctRange := estimatedDistinctValuesInRange(
			nilEvalContext, currentBucket.NumRange, lowerBound, currentUpperBound,

seems like lowerBound and currentUpperBound are going to be basically the same here. Maybe you need to get the value of lowerBound earlier (i.e. from the previous bucket)?


pkg/sql/stats/quantile.go line 337 at r2 (raw file):

}

func isPositive(x float64) bool {

nit: should this be isNonNegative?


pkg/sql/stats/quantile_test.go line 27 at r2 (raw file):

)

type testHistogram []testBucket

Do you think you could add a test where you create a random histogram or quantile function and test that it round-trips?

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work - it was an interesting read!

Reviewed 1 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @rytaft)


pkg/sql/stats/quantile.go line 82 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Love this comment! Super helpful.

+1


pkg/sql/stats/quantile.go line 102 at r2 (raw file):

// makeQuantile and quantile.toHistogram might need to change if we introduce a
// new histogram version.
const _ uint = 1 - uint(histVersion)

nit: is this supposed to fail the compilation once histVersion is increased? If so, why that would be the case? Or this just a usage of histVersion value so that in the future this comment would be found via "Find usages"?


pkg/sql/stats/quantile.go line 141 at r2 (raw file):

	// To produce a quantile with first point at p=0 and at least two points, we
	// need the first bucket to have NumRange == 0.
	if hist.buckets[0].NumRange != 0 {

Just a question: is this a limiting assumption or can we always have such histograms that satisfy the assumption?


pkg/sql/stats/quantile.go line 151 at r2 (raw file):

		// zero-row buckets from the beginning and end of the histogram.
		qfTrimLo, qfTrimHi quantileIndex
		qf                 quantile

How large can qf get? Will it be limited by the number of histogram buckets times two? Should we do memory accounting somewhere? (It doesn't seem like we're using the quantiles yet, and maybe the responsibility will be pushed to the caller - might be worth mentioning it in the comment to makeQuantile.)


pkg/sql/stats/quantile.go line 214 at r2 (raw file):

		// Fix any floating point errors or histogram errors (e.g. sum of bucket row
		// counts < total row count) causing p to be below 1 at the end.
		qf[len(qf)-1].p = 1

nit: shouldn't this already be fixed in addPoint?


pkg/sql/stats/quantile.go line 238 at r2 (raw file):

	var i int
	// Skip any leading p=0 points instead of emitting zero-row buckets.

IIUC we must have already trimmed all but one zero-row buckets from the start in makeQuantile - do we do this out of caution? Maybe we should have an assertion check that qf[0].p == 0 && qf[1].p > 0?

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! :lgtm:

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @michae2, @rytaft, and @yuzefovich)


pkg/sql/stats/quantile.go line 82 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

+1

+2


pkg/sql/stats/quantile.go line 102 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: is this supposed to fail the compilation once histVersion is increased? If so, why that would be the case? Or this just a usage of histVersion value so that in the future this comment would be found via "Find usages"?

Compilation will fail when histVersion increases because a negative value overflows uint.


pkg/sql/stats/quantile.go line 156 at r2 (raw file):

	)

	addPoint := func(num, v float64) error {

nit: might be helpful to add a comment explaining this function and its arguments. I'm assuming num is the number times value v exists in the histogram?


pkg/sql/stats/quantile.go line 214 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: shouldn't this already be fixed in addPoint?

Looks like that only fixes a p if p > 1. The final p could be slightly less than 1, so this is fixing that - 100% of the values should be less than or equal to the maximum value in the histogram.


pkg/sql/stats/quantile.go line 258 at r2 (raw file):

	var pEq float64

	closeCurrentBucket := func() error {

nit: describe briefly what this function does

@mgartner mgartner requested a review from a team July 7, 2022 14:06
Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback so far!

  • Added a test that tries to round-trip a random histogram.
  • Added a histogram.String method for easier debugging.
  • Now preallocating slices in makeQuantile and quantile.toHistogram.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @rytaft, and @yuzefovich)


pkg/sql/stats/quantile.go line 102 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Compilation will fail when histVersion increases because a negative value overflows uint.

Yep, it's a compile-time check to make sure someone looks at this file when changing histogram formats. I tried making the comment clearer.


pkg/sql/stats/quantile.go line 118 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

How is this an enum? I'm a bit confused by this comment.... (I know it's a note to yourself, but wouldn't hurt to make it understandable by others too)

This is about columns like the hidden shard column of a hash-sharded index, which have the correct type but behave more like enums than integers. I tried making the comment a little better, lmk if it needs more work.


pkg/sql/stats/quantile.go line 141 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Just a question: is this a limiting assumption or can we always have such histograms that satisfy the assumption?

We currently always produce histograms that satisfy this assumption. I don't think NumRange != 0 in the first bucket would ever really make sense (there is only one endpoint for the range counted by NumRange of the first bucket) but I wanted to be explicit.


pkg/sql/stats/quantile.go line 151 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

How large can qf get? Will it be limited by the number of histogram buckets times two? Should we do memory accounting somewhere? (It doesn't seem like we're using the quantiles yet, and maybe the responsibility will be pushed to the caller - might be worth mentioning it in the comment to makeQuantile.)

Good questions. Yes, qf is limited to len(hist.buckets) * 2 so I went ahead and pre-allocated it (and also pre-allocated the histogram slice in toHistogram). The quantiles will only exist briefly whenever calculating a statistics forecast, after which they will be thrown away. The end result will be an in-memory histogram in the stats cache. I'm not sure the stats cache actually has memory accounting, I should probably add that in a future PR.


pkg/sql/stats/quantile.go line 156 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: might be helpful to add a comment explaining this function and its arguments. I'm assuming num is the number times value v exists in the histogram?

Done. (num could also be counting values < v if we're adding a point for NumRange.)


pkg/sql/stats/quantile.go line 214 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Looks like that only fixes a p if p > 1. The final p could be slightly less than 1, so this is fixing that - 100% of the values should be less than or equal to the maximum value in the histogram.

Yes, the check in addPoint handles p > 1, this handles p < 1.


pkg/sql/stats/quantile.go line 238 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

IIUC we must have already trimmed all but one zero-row buckets from the start in makeQuantile - do we do this out of caution? Maybe we should have an assertion check that qf[0].p == 0 && qf[1].p > 0?

It's a little silly in this PR, but it will make more sense when I open the next PR adding operations on quantiles for performing linear regression. Those operations could create a quantile with multiple p=0 points.


pkg/sql/stats/quantile.go line 258 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: describe briefly what this function does

Done.


pkg/sql/stats/quantile.go line 275 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

seems like lowerBound and currentUpperBound are going to be basically the same here. Maybe you need to get the value of lowerBound earlier (i.e. from the previous bucket)?

Gah, good catch! Thank you. Done.


pkg/sql/stats/quantile.go line 337 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: should this be isNonNegative?

Yes, it should be. I wanted to avoid the double negative (if !isNonNegative) (or is that a triple negative? 😁) so I went with isInvalidCount instead, lmk what you think.


pkg/sql/stats/quantile_test.go line 27 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Do you think you could add a test where you create a random histogram or quantile function and test that it round-trips?

Yes, definitely. It's not particularly succinct, but here's my attempt at a random histogram generator. This caught one bug, which was that I needed a real non-nil eval.Context in quantile.toHistogram. After fixing that, it seems to be round-tripping successfully so far.

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 6 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner, @rytaft, and @yuzefovich)

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 6 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)


pkg/sql/stats/quantile_test.go line 37 at r4 (raw file):

// changing.
func TestRandomQuantileRoundTrip(t *testing.T) {
	colTypes := []*types.T{

nit: so that in the future we don't forget to include newly-supported types, it might be better to use CanMakeQuantile and types.Scalar here to populate colTypes (we'll then need to add ints and floats of smaller widths separately).


pkg/sql/stats/quantile_test.go line 53 at r4 (raw file):

			t.Run(fmt.Sprintf("%v/%v", colType.Name(), i), func(t *testing.T) {
				hist, rowCount := randHist(evalCtx, colType, rng)
				//t.Log(hist)

nit: seems like a leftover.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 6 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)


pkg/sql/stats/quantile.go line 337 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Yes, it should be. I wanted to avoid the double negative (if !isNonNegative) (or is that a triple negative? 😁) so I went with isInvalidCount instead, lmk what you think.

How about leaving the logic as it was before and changing the name to isValidCount?


pkg/sql/stats/quantile_test.go line 27 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Yes, definitely. It's not particularly succinct, but here's my attempt at a random histogram generator. This caught one bug, which was that I needed a real non-nil eval.Context in quantile.toHistogram. After fixing that, it seems to be round-tripping successfully so far.

Nice!

To predict histograms in statistics forecasts, we will use linear
regression over quantile functions. (Quantile functions are another
representation of histogram data, in a form more amenable to statistical
manipulation.)

This commit defines quantile functions and adds methods to convert
between histograms and quantile functions.

This code was originally part of cockroachdb#77070 but has been pulled out to
simplify that PR. A few changes have been made:
- Common code has been factored into closures.
- More checks have been added for positive values.
- In `makeQuantile` we now trim leading empty buckets as well as
  trailing empty buckets.
- The logic in `quantile.toHistogram` to steal from `NumRange` if
  `NumEq` is zero now checks that `NumRange` will still be >= 1.
- More tests have been added.

Assists: cockroachdb#79872

Release note: None
Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @mgartner, @michae2, @rytaft, and @yuzefovich)


pkg/sql/stats/quantile.go line 337 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

How about leaving the logic as it was before and changing the name to isValidCount?

Done.


pkg/sql/stats/quantile_test.go line 37 at r4 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: so that in the future we don't forget to include newly-supported types, it might be better to use CanMakeQuantile and types.Scalar here to populate colTypes (we'll then need to add ints and floats of smaller widths separately).

Ah, great idea! Done.


pkg/sql/stats/quantile_test.go line 53 at r4 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: seems like a leftover.

Done.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @rytaft)

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @rytaft)

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong: :shipit:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @rytaft)

@michae2
Copy link
Collaborator Author

michae2 commented Jul 11, 2022

TFTRs!

bors r=mgartner,rytaft,yuzefovich

@craig craig bot merged commit bef1101 into cockroachdb:master Jul 11, 2022
@craig
Copy link
Contributor

craig bot commented Jul 11, 2022

Build succeeded:

@michae2 michae2 deleted the more_quantile branch July 12, 2022 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants