Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/stats: conversion of datums to and from quantile function values #83123

Merged
merged 1 commit into from
Jul 7, 2022

Conversation

michae2
Copy link
Collaborator

@michae2 michae2 commented Jun 21, 2022

To predict histograms in statistics forecasts, we will use linear
regression over quantile functions. (Quantile functions are another
representation of histogram data, in a form more amenable to statistical
manipulation.)

The conversion of histograms to quantile functions will require
conversion of histogram bounds (datums) to quantile values (float64s).
And likewise, the inverse conversion from quantile functions back to
histograms will require the inverse conversion of float64 quantile
values back to datums. These conversions are a little different from our
usual SQL conversions in eval.PerformCast, so we add them to a new
quantile file in the sql/stats module.

This code was originally part of #77070 but has been pulled out to
simplify that PR. A few changes have been made:

  • histogramValue has been renamed to FromQuantileValue.
  • Support for DECIMAL, TIME, TIMETZ, and INTERVAL has been
    dropped. Clamping these types in FromQuantileValue was too complex
    for the first iteration of statistics forecasting. We expect the
    overwhelming majority of ascending keys to use INT or TIMESTAMP
    types.
  • Bugs in FLOAT4, TIMESTAMP and TIMESTAMPTZ conversions have been
    fixed.
  • We're now clamping timestamps to slightly tighter bounds to avoid the
    problems with infinite timestamps (see sql: 'infinity' returned as timestamp rather than 'infinity' #41564).

Assists: #79872

Release note: None

@michae2 michae2 requested review from mgartner, rharding6373 and a team June 21, 2022 14:24
@michae2 michae2 requested a review from a team as a code owner June 21, 2022 14:24
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for breaking this out! Looks pretty good, mostly just some nits.

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)


-- commits line 29 at r1:
TIL Assists is a keyword.


pkg/sql/stats/quantile.go line 120 at r1 (raw file):

				return tree.NewDInt(tree.DInt(math.MinInt32)), nil
			}
			if i >= math.MaxInt16 {

MaxInt32?

Could use another clamping test case.


pkg/util/timeutil/pgdate/pgdate.go line 148 at r1 (raw file):

func MakeDateFromPGEpochClampFinite(days int32) Date {
	if days < lowDays {
		return Date{days: lowDays}

Why can't we just return LowDate here?


pkg/sql/stats/quantile_test.go line 32 at r1 (raw file):

		val float64
		err bool
		res tree.Datum

There doesn't seem to be a test case that populates res. Is that intentional? I infer from below that we always expect the roundtrip to equal dat, but in that case I think it's better to remove this and just compare to the original dat below.

Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @rharding6373)


pkg/sql/stats/quantile.go line 120 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

MaxInt32?

Could use another clamping test case.

Gaaaah! Good catch! Done, and added more test cases.


pkg/util/timeutil/pgdate/pgdate.go line 148 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Why can't we just return LowDate here?

Good point! Done.


pkg/sql/stats/quantile_test.go line 32 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

There doesn't seem to be a test case that populates res. Is that intentional? I infer from below that we always expect the roundtrip to equal dat, but in that case I think it's better to remove this and just compare to the original dat below.

Good point, done.

@michae2 michae2 force-pushed the quantile branch 3 times, most recently from d6db735 to c7f1d40 Compare June 30, 2022 00:58
@michae2
Copy link
Collaborator Author

michae2 commented Jun 30, 2022

Bazel CI failures are just flakes. This is RFAL.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I have some nits, but it :lgtm:

Another suggestion would be to write a randomized test for round-tripping thought the quantile functions, but it doesn't have to be in this PR, so a TODO will suffice.

Reviewed 1 of 5 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @rharding6373)


pkg/sql/stats/quantile.go line 112 at r2 (raw file):

		case 16:
			if i <= math.MinInt16 {
				return tree.NewDInt(tree.DInt(math.MinInt16)), nil

nit: for consistency with other types below maybe update i in-place to have only a single return per type?


pkg/sql/stats/quantile.go line 155 at r2 (raw file):

		// Then clamp to pgdate.Date.
		return tree.NewDDate(pgdate.MakeDateFromPGEpochClampFinite(int32(days))), nil
	case types.TimestampFamily:

nit: cases for Timestamp and TimestampTZ look exactly the same, with one line difference, maybe squash them?


pkg/sql/stats/quantile_test.go line 294 at r2 (raw file):

}

// Test conversions from quantile value to datum and back. TestToQuantileValue

nit: consider renaming the tests to something like TestQuantileValueRoundTrip and TestQuantileValueRoundTripOverflow. When I started reading the code in this file I was confused why for TestToQuantileValue test we would not do the whole round trip.

@michae2 michae2 requested review from rytaft and a team July 6, 2022 05:09
Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Another suggestion would be to write a randomized test for round-tripping thought the quantile functions

+1

Reviewed 1 of 5 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @rharding6373)


pkg/sql/stats/quantile.go line 75 at r2 (raw file):

			return 0, tree.ErrFloatOutOfRange
		}
		return float64(v.Unix()) + float64(v.Nanosecond())*1e-9, nil

Doesn't seem like this will round-trip since the location isn't included here. Is there any possible issue with that?

Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. I added a TODO and will do that in another PR. TFTRs!

bors r=yuzefovich,rytaft

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @mgartner, @rharding6373, @rytaft, and @yuzefovich)


pkg/sql/stats/quantile.go line 75 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Doesn't seem like this will round-trip since the location isn't included here. Is there any possible issue with that?

My understanding is that TIMESTAMPTZ doesn't actually store a location when written to disk (and neither does TIMESTAMP). When we decode them, the time.Time in the datum has nil location. When comparing, casting, and printing, we assume that TIMESTAMPTZ is in the session time zone and TIMESTAMP is UTC, and make adjustments to the time.Time to account for these assumptions.

So, I believe in histograms we can treat them the same way as each other, because they are "timezoneless" until comparing, casting, or printing.


pkg/sql/stats/quantile.go line 112 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: for consistency with other types below maybe update i in-place to have only a single return per type?

Done. Had to special case MaxInt64.


pkg/sql/stats/quantile.go line 155 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: cases for Timestamp and TimestampTZ look exactly the same, with one line difference, maybe squash them?

Done.


pkg/sql/stats/quantile_test.go line 294 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: consider renaming the tests to something like TestQuantileValueRoundTrip and TestQuantileValueRoundTripOverflow. When I started reading the code in this file I was confused why for TestToQuantileValue test we would not do the whole round trip.

Done.

@craig
Copy link
Contributor

craig bot commented Jul 7, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jul 7, 2022

Build failed:

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done!

Reviewed 1 of 5 files at r1, 1 of 4 files at r2, 3 of 3 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @rharding6373 and @rytaft)


-- commits line 29 at r1:

Previously, rharding6373 (Rachael Harding) wrote…

TIL Assists is a keyword.

It's not a keyword: https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword

Using a non-keyword like "assists" or "informs" is a way to link to the issue without auto-closing it when the PR is merged.

@michae2
Copy link
Collaborator Author

michae2 commented Jul 7, 2022

Failing a linter.

bors r-

To predict histograms in statistics forecasts, we will use linear
regression over quantile functions. (Quantile functions are another
representation of histogram data, in a form more amenable to statistical
manipulation.)

The conversion of histograms to quantile functions will require
conversion of histogram bounds (datums) to quantile values (float64s).
And likewise, the inverse conversion from quantile functions back to
histograms will require the inverse conversion of float64 quantile
values back to datums. These conversions are a little different from our
usual SQL conversions in `eval.PerformCast`, so we add them to a new
quantile file in the `sql/stats` module.

This code was originally part of cockroachdb#77070 but has been pulled out to
simplify that PR. A few changes have been made:
- `histogramValue` has been renamed to `FromQuantileValue`.
- Support for `DECIMAL`, `TIME`, `TIMETZ`, and `INTERVAL` has been
  dropped. Clamping these types in `FromQuantileValue` was too complex
  for the first iteration of statistics forecasting. We expect the
  overwhelming majority of ascending keys to use `INT` or `TIMESTAMP`
  types.
- Bugs in `FLOAT4`, `TIMESTAMP` and `TIMESTAMPTZ` conversions have been
  fixed.
- We're now clamping timestamps to slightly tighter bounds to avoid the
  problems with infinite timestamps (see cockroachdb#41564).

Assists: cockroachdb#79872

Release note: None
@michae2
Copy link
Collaborator Author

michae2 commented Jul 7, 2022

bors r=yuzefovich,rytaft,mgartner

@craig craig bot merged commit bce2267 into cockroachdb:master Jul 7, 2022
@craig
Copy link
Contributor

craig bot commented Jul 7, 2022

Build succeeded:

@michae2 michae2 deleted the quantile branch July 7, 2022 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants