opt: ensure validation of unique constraints is efficient #65355

rytaft · 2021-05-17T21:26:09Z

opt: ensure validation of unique constraints is efficient

This commit adds two new exploration rules: SplitGroupByScanIntoUnionScans
and SplitGroupByFilteredScanIntoUnionScans.

SplitGroupByScanIntoUnionScans splits a non-inverted scan under a GroupBy,
DistinctOn, or EnsureUpsertDistinctOn into a union-all of scans, where each
scan is ordered on the grouping columns. This ordering is then maintained by
the union-all operation and passed on to the grouping operation. Ordering on
the grouping columns is important since it enables the grouping operation to
execute in a streaming fashion, which is more efficient. Example:

   CREATE TABLE tab (
     region STRING NOT NULL CHECK (region IN ('ASIA', 'EUROPE')),
     data INT NOT NULL,
     INDEX (region, data)
   );

   SELECT DISTINCT data
   FROM tab;

   =>

   SELECT DISTINCT data
   FROM (SELECT * FROM tab WHERE region='ASIA')
   UNION ALL (SELECT * FROM tab WHERE region='EUROPE');

This rule does not actually build the streaming grouping operation, but it
allows another rule, GenerateStreamingGroupBy, to fire and use the new
interesting orderings provided by the UnionAll of scans to build a streaming
operation.

SplitGroupByFilteredScanIntoUnionScans is like SplitGroupByScanIntoUnionScans,
but the scan is wrapped in a Select.

These transformations are important for ensuring that validation of the
unique constraint in an implicitly-partitioned unique index is efficient. The
validation query to verify that (a, b) is UNIQUE on table tbl looks like this:

   SELECT a, b
   FROM tbl
   WHERE a IS NOT NULL AND b IS NOT NULL
   GROUP BY a, b
   HAVING count(*) > 1
   LIMIT 1;

Without SplitGroupByFilteredScanIntoUnionScans, this query would require an
inefficient and memory-intensive hash group by operation. Note that the
previous rule, SplitGroupByScanIntoUnionScans, is also needed since it would
apply in cases where a and b are not nullable.

Fixes #56201

Release note (performance improvement): Validation of a new UNIQUE index in a
REGIONAL BY ROW table no longer requires an inefficient and memory-intensive
hash aggregation query. The optimizer can now plan the validation query so
that it uses all streaming operations, which are much more efficient.

opt: move splitScanIntoUnionScans to general_funcs.go

This commit moves splitScanIntoUnionScans and all helper functions
into general_funcs.go since it's now used in both limit_funcs.go and
groupby_funcs.go. This is a separate commit for ease of reviewing.

Release note: None

cockroach-teamcity · 2021-05-17T21:26:17Z

This change is

rytaft · 2021-05-17T21:26:50Z

Please ignore the first commit; it's from #65350.

RaduBerinde

Nicely done!

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mgartner and @rytaft)

pkg/sql/opt/xform/limit_funcs.go, line 215 at r2 (raw file):

// SplitScanIntoUnionScans returns a UnionAll of Scan operators (with an
// optional hard limit) that each scan over a single key from the original
// Scan's constraints. If no such UnionAll of Scans can be found, ok=false is

[nit] returns a UnionAll [..] if the scans can provide the given ordering, and if the statistics suggest that splitting the scan could be beneficial.

pkg/sql/opt/xform/limit_funcs.go, line 263 at r2 (raw file):

	if scan.Relational().Stats.Available &&
		float64(scanCount*limit*threshold) >= scan.Relational().Stats.RowCount {

[nit] I would put all the code that is limit specific inside an if limit > 0. It's more or less luck that it all works out with 0, better to not have to think about it.

pkg/sql/opt/xform/limit_funcs.go, line 272 at r2 (raw file):

	// The index ordering must have a prefix of columns of length keyLength
	// followed by the ordering columns either in order or in reverse order.
	hasLimitOrderingSeq, reverse := indexHasOrderingSequence(

[nit] hasOrderingSeq

pkg/sql/opt/xform/rules/groupby.opt, line 163 at r2 (raw file):

)
=>
((OpName) $unionScans $aggs $private)

After this rule runs, we are relying on the existing exploration rules to generate a streaming GroupBy, correct?

rytaft

TFTR!

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner)

pkg/sql/opt/xform/limit_funcs.go, line 215 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] returns a UnionAll [..] if the scans can provide the given ordering, and if the statistics suggest that splitting the scan could be beneficial.

Done.

pkg/sql/opt/xform/limit_funcs.go, line 263 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] I would put all the code that is limit specific inside an if limit > 0. It's more or less luck that it all works out with 0, better to not have to think about it.

Done.

pkg/sql/opt/xform/limit_funcs.go, line 272 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] hasOrderingSeq

Done.

pkg/sql/opt/xform/rules/groupby.opt, line 163 at r2 (raw file):

Previously, RaduBerinde wrote…

After this rule runs, we are relying on the existing exploration rules to generate a streaming GroupBy, correct?

Yep! Added a comment.

mgartner

[nit] the PR description is missing the second commit

Looks great!

Reviewed 5 of 7 files at r2, 2 of 2 files at r3, 3 of 3 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball and @rytaft)

pkg/sql/opt/xform/groupby_funcs.go, line 213 at r3 (raw file):

	scan memo.RelExpr, sp *memo.ScanPrivate, private *memo.GroupingPrivate,
) (_ memo.RelExpr, ok bool) {
	cons, ok := c.getKnownScanConstraint(sp)

[nit] move the definition of getKnownScanConstraint from limit_funcs.go to general_funcs.go (probably in a separate commit)

pkg/sql/opt/xform/rules/groupby.opt, line 185 at r3 (raw file):

# inefficient and memory-intensive hash group by operation. Note that the
# previous rule, SplitGroupByScanIntoUnionScans, is also needed since it would
# apply in cases where a and b are not nullable.

[nit] It might be nice to expand this sentence a bit more—"are not nullable" confused me a bit when I read it because I thought it should read "are nullable". Something like "it would apply in cases where a and b are not nullable (the IS NOT NULL filters would be eliminated during normalization)". (Or is it that the IS NOT NULL filters are not part of the query to begin with?)

DrewKimball

Really cool!

It might be nice to have a comment explaining why zero is passed as the limit to SplitScanIntoUnionScans.

This commit adds two new exploration rules: SplitGroupByScanIntoUnionScans and SplitGroupByFilteredScanIntoUnionScans. SplitGroupByScanIntoUnionScans splits a non-inverted scan under a GroupBy, DistinctOn, or EnsureUpsertDistinctOn into a union-all of scans, where each scan is ordered on the grouping columns. This ordering is then maintained by the union-all operation and passed on to the grouping operation. Ordering on the grouping columns is important since it enables the grouping operation to execute in a streaming fashion, which is more efficient. Example: CREATE TABLE tab ( region STRING NOT NULL CHECK (region IN ('ASIA', 'EUROPE')), data INT NOT NULL, INDEX (region, data) ); SELECT DISTINCT data FROM tab; => SELECT DISTINCT data FROM (SELECT * FROM tab WHERE region='ASIA') UNION ALL (SELECT * FROM tab WHERE region='EUROPE'); This rule does not actually build the streaming grouping operation, but it allows another rule, GenerateStreamingGroupBy, to fire and use the new interesting orderings provided by the UnionAll of scans to build a streaming operation. SplitGroupByFilteredScanIntoUnionScans is like SplitGroupByScanIntoUnionScans, but the scan is wrapped in a Select. These transformations are important for ensuring that validation of the unique constraint in an implicitly-partitioned unique index is efficient. The validation query to verify that (a, b) is UNIQUE on table tbl looks like this: SELECT a, b FROM tbl WHERE a IS NOT NULL AND b IS NOT NULL GROUP BY a, b HAVING count(*) > 1 LIMIT 1; Without SplitGroupByFilteredScanIntoUnionScans, this query would require an inefficient and memory-intensive hash group by operation. Note that the previous rule, SplitGroupByScanIntoUnionScans, is also needed since it would apply in cases where a and b are not nullable. Fixes cockroachdb#56201 Release note (performance improvement): Validation of a new UNIQUE index in a REGIONAL BY ROW table no longer requires an inefficient and memory-intensive hash aggregation query. The optimizer can now plan the validation query so that it uses all streaming operations, which are much more efficient.

This commit moves splitScanIntoUnionScans and all helper functions into general_funcs.go since it's now used in both limit_funcs.go and groupby_funcs.go. This is a separate commit for ease of reviewing. Release note: None

rytaft

TFTRs!

[nit] the PR description is missing the second commit

Done.

It might be nice to have a comment explaining why zero is passed as the limit to SplitScanIntoUnionScans.

Done.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @mgartner)

pkg/sql/opt/xform/groupby_funcs.go, line 213 at r3 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

[nit] move the definition of getKnownScanConstraint from limit_funcs.go to general_funcs.go (probably in a separate commit)

It was moved in the second commit.

pkg/sql/opt/xform/rules/groupby.opt, line 185 at r3 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

[nit] It might be nice to expand this sentence a bit more—"are not nullable" confused me a bit when I read it because I thought it should read "are nullable". Something like "it would apply in cases where a and b are not nullable (the IS NOT NULL filters would be eliminated during normalization)". (Or is it that the IS NOT NULL filters are not part of the query to begin with?)

Done.

mgartner

Reviewed 4 of 4 files at r5, 3 of 3 files at r6.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @rytaft)

pkg/sql/opt/xform/groupby_funcs.go, line 213 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

It was moved in the second commit.

Ahh I missed that...

rytaft

Thanks!

bors r+

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @rytaft)

craig · 2021-05-19T17:16:33Z

Build succeeded:

GitHub CI (Cockroach)

rytaft requested review from mgartner, RaduBerinde and a team May 17, 2021 21:26

RaduBerinde approved these changes May 18, 2021

View reviewed changes

rytaft force-pushed the efficient-validate branch from a1b3ca6 to 7b9ad1f Compare May 18, 2021 15:18

rytaft commented May 18, 2021

View reviewed changes

rytaft requested a review from DrewKimball May 18, 2021 15:39

mgartner approved these changes May 18, 2021

View reviewed changes

DrewKimball approved these changes May 18, 2021

View reviewed changes

rytaft added 2 commits May 19, 2021 08:42

opt: move splitScanIntoUnionScans to general_funcs.go

35b1ec6

This commit moves splitScanIntoUnionScans and all helper functions into general_funcs.go since it's now used in both limit_funcs.go and groupby_funcs.go. This is a separate commit for ease of reviewing. Release note: None

rytaft force-pushed the efficient-validate branch from 7b9ad1f to 35b1ec6 Compare May 19, 2021 13:43

rytaft commented May 19, 2021

View reviewed changes

mgartner approved these changes May 19, 2021

View reviewed changes

rytaft commented May 19, 2021

View reviewed changes

craig bot merged commit c305dbc into cockroachdb:master May 19, 2021

rytaft mentioned this pull request May 19, 2021

opt,sql: ensure validation of implicitly-partitioned partial unique indexes is efficient #65473

Closed

stevendanna mentioned this pull request Jun 28, 2021

investigate: kv roachtest performance drop on 20 May #66961

Closed

jseldess mentioned this pull request Sep 8, 2021

opt: ensure validation of unique constraints is efficient cockroachdb/docs#11585

Open

exalate-issue-sync bot mentioned this pull request Jun 8, 2022

opt: ensure validation of unique constraints is efficient cockroachdb/docs#14196

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opt: ensure validation of unique constraints is efficient #65355

opt: ensure validation of unique constraints is efficient #65355

rytaft commented May 17, 2021 •

edited

Loading

cockroach-teamcity commented May 17, 2021

rytaft commented May 17, 2021

RaduBerinde left a comment

rytaft left a comment

mgartner left a comment

DrewKimball left a comment

rytaft left a comment

mgartner left a comment

rytaft left a comment

craig bot commented May 19, 2021

opt: ensure validation of unique constraints is efficient #65355

opt: ensure validation of unique constraints is efficient #65355

Conversation

rytaft commented May 17, 2021 • edited Loading

cockroach-teamcity commented May 17, 2021

rytaft commented May 17, 2021

RaduBerinde left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

mgartner left a comment

Choose a reason for hiding this comment

DrewKimball left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

mgartner left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

craig bot commented May 19, 2021

rytaft commented May 17, 2021 •

edited

Loading