Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: ensure validation of unique constraints is efficient #65355

Merged
merged 2 commits into from
May 19, 2021

Conversation

rytaft
Copy link
Collaborator

@rytaft rytaft commented May 17, 2021

opt: ensure validation of unique constraints is efficient

This commit adds two new exploration rules: SplitGroupByScanIntoUnionScans
and SplitGroupByFilteredScanIntoUnionScans.

SplitGroupByScanIntoUnionScans splits a non-inverted scan under a GroupBy,
DistinctOn, or EnsureUpsertDistinctOn into a union-all of scans, where each
scan is ordered on the grouping columns. This ordering is then maintained by
the union-all operation and passed on to the grouping operation. Ordering on
the grouping columns is important since it enables the grouping operation to
execute in a streaming fashion, which is more efficient. Example:

   CREATE TABLE tab (
     region STRING NOT NULL CHECK (region IN ('ASIA', 'EUROPE')),
     data INT NOT NULL,
     INDEX (region, data)
   );

   SELECT DISTINCT data
   FROM tab;

   =>

   SELECT DISTINCT data
   FROM (SELECT * FROM tab WHERE region='ASIA')
   UNION ALL (SELECT * FROM tab WHERE region='EUROPE');

This rule does not actually build the streaming grouping operation, but it
allows another rule, GenerateStreamingGroupBy, to fire and use the new
interesting orderings provided by the UnionAll of scans to build a streaming
operation.

SplitGroupByFilteredScanIntoUnionScans is like SplitGroupByScanIntoUnionScans,
but the scan is wrapped in a Select.

These transformations are important for ensuring that validation of the
unique constraint in an implicitly-partitioned unique index is efficient. The
validation query to verify that (a, b) is UNIQUE on table tbl looks like this:

   SELECT a, b
   FROM tbl
   WHERE a IS NOT NULL AND b IS NOT NULL
   GROUP BY a, b
   HAVING count(*) > 1
   LIMIT 1;

Without SplitGroupByFilteredScanIntoUnionScans, this query would require an
inefficient and memory-intensive hash group by operation. Note that the
previous rule, SplitGroupByScanIntoUnionScans, is also needed since it would
apply in cases where a and b are not nullable.

Fixes #56201

Release note (performance improvement): Validation of a new UNIQUE index in a
REGIONAL BY ROW table no longer requires an inefficient and memory-intensive
hash aggregation query. The optimizer can now plan the validation query so
that it uses all streaming operations, which are much more efficient.

opt: move splitScanIntoUnionScans to general_funcs.go

This commit moves splitScanIntoUnionScans and all helper functions
into general_funcs.go since it's now used in both limit_funcs.go and
groupby_funcs.go. This is a separate commit for ease of reviewing.

Release note: None

@rytaft rytaft requested review from mgartner, RaduBerinde and a team May 17, 2021 21:26
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@rytaft
Copy link
Collaborator Author

rytaft commented May 17, 2021

Please ignore the first commit; it's from #65350.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Nicely done!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner and @rytaft)


pkg/sql/opt/xform/limit_funcs.go, line 215 at r2 (raw file):

// SplitScanIntoUnionScans returns a UnionAll of Scan operators (with an
// optional hard limit) that each scan over a single key from the original
// Scan's constraints. If no such UnionAll of Scans can be found, ok=false is

[nit] returns a UnionAll [..] if the scans can provide the given ordering, and if the statistics suggest that splitting the scan could be beneficial.


pkg/sql/opt/xform/limit_funcs.go, line 263 at r2 (raw file):

	if scan.Relational().Stats.Available &&
		float64(scanCount*limit*threshold) >= scan.Relational().Stats.RowCount {

[nit] I would put all the code that is limit specific inside an if limit > 0. It's more or less luck that it all works out with 0, better to not have to think about it.


pkg/sql/opt/xform/limit_funcs.go, line 272 at r2 (raw file):

	// The index ordering must have a prefix of columns of length keyLength
	// followed by the ordering columns either in order or in reverse order.
	hasLimitOrderingSeq, reverse := indexHasOrderingSequence(

[nit] hasOrderingSeq


pkg/sql/opt/xform/rules/groupby.opt, line 163 at r2 (raw file):

)
=>
((OpName) $unionScans $aggs $private)

After this rule runs, we are relying on the existing exploration rules to generate a streaming GroupBy, correct?

@rytaft rytaft force-pushed the efficient-validate branch from a1b3ca6 to 7b9ad1f Compare May 18, 2021 15:18
Copy link
Collaborator Author

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner)


pkg/sql/opt/xform/limit_funcs.go, line 215 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] returns a UnionAll [..] if the scans can provide the given ordering, and if the statistics suggest that splitting the scan could be beneficial.

Done.


pkg/sql/opt/xform/limit_funcs.go, line 263 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] I would put all the code that is limit specific inside an if limit > 0. It's more or less luck that it all works out with 0, better to not have to think about it.

Done.


pkg/sql/opt/xform/limit_funcs.go, line 272 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] hasOrderingSeq

Done.


pkg/sql/opt/xform/rules/groupby.opt, line 163 at r2 (raw file):

Previously, RaduBerinde wrote…

After this rule runs, we are relying on the existing exploration rules to generate a streaming GroupBy, correct?

Yep! Added a comment.

@rytaft rytaft requested a review from DrewKimball May 18, 2021 15:39
Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] the PR description is missing the second commit

Looks great! :lgtm:

Reviewed 5 of 7 files at r2, 2 of 2 files at r3, 3 of 3 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball and @rytaft)


pkg/sql/opt/xform/groupby_funcs.go, line 213 at r3 (raw file):

	scan memo.RelExpr, sp *memo.ScanPrivate, private *memo.GroupingPrivate,
) (_ memo.RelExpr, ok bool) {
	cons, ok := c.getKnownScanConstraint(sp)

[nit] move the definition of getKnownScanConstraint from limit_funcs.go to general_funcs.go (probably in a separate commit)


pkg/sql/opt/xform/rules/groupby.opt, line 185 at r3 (raw file):

# inefficient and memory-intensive hash group by operation. Note that the
# previous rule, SplitGroupByScanIntoUnionScans, is also needed since it would
# apply in cases where a and b are not nullable.

[nit] It might be nice to expand this sentence a bit more—"are not nullable" confused me a bit when I read it because I thought it should read "are nullable". Something like "it would apply in cases where a and b are not nullable (the IS NOT NULL filters would be eliminated during normalization)". (Or is it that the IS NOT NULL filters are not part of the query to begin with?)

Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool!

It might be nice to have a comment explaining why zero is passed as the limit to SplitScanIntoUnionScans.

rytaft added 2 commits May 19, 2021 08:42
This commit adds two new exploration rules: SplitGroupByScanIntoUnionScans
and SplitGroupByFilteredScanIntoUnionScans.

SplitGroupByScanIntoUnionScans splits a non-inverted scan under a GroupBy,
DistinctOn, or EnsureUpsertDistinctOn into a union-all of scans, where each
scan is ordered on the grouping columns. This ordering is then maintained by
the union-all operation and passed on to the grouping operation. Ordering on
the grouping columns is important since it enables the grouping operation to
execute in a streaming fashion, which is more efficient. Example:

   CREATE TABLE tab (
     region STRING NOT NULL CHECK (region IN ('ASIA', 'EUROPE')),
     data INT NOT NULL,
     INDEX (region, data)
   );

   SELECT DISTINCT data
   FROM tab;

   =>

   SELECT DISTINCT data
   FROM (SELECT * FROM tab WHERE region='ASIA')
   UNION ALL (SELECT * FROM tab WHERE region='EUROPE');

This rule does not actually build the streaming grouping operation, but it
allows another rule, GenerateStreamingGroupBy, to fire and use the new
interesting orderings provided by the UnionAll of scans to build a streaming
operation.

SplitGroupByFilteredScanIntoUnionScans is like SplitGroupByScanIntoUnionScans,
but the scan is wrapped in a Select.

These transformations are important for ensuring that validation of the
unique constraint in an implicitly-partitioned unique index is efficient. The
validation query to verify that (a, b) is UNIQUE on table tbl looks like this:

   SELECT a, b
   FROM tbl
   WHERE a IS NOT NULL AND b IS NOT NULL
   GROUP BY a, b
   HAVING count(*) > 1
   LIMIT 1;

Without SplitGroupByFilteredScanIntoUnionScans, this query would require an
inefficient and memory-intensive hash group by operation. Note that the
previous rule, SplitGroupByScanIntoUnionScans, is also needed since it would
apply in cases where a and b are not nullable.

Fixes cockroachdb#56201

Release note (performance improvement): Validation of a new UNIQUE index in a
REGIONAL BY ROW table no longer requires an inefficient and memory-intensive
hash aggregation query. The optimizer can now plan the validation query so
that it uses all streaming operations, which are much more efficient.
This commit moves splitScanIntoUnionScans and all helper functions
into general_funcs.go since it's now used in both limit_funcs.go and
groupby_funcs.go. This is a separate commit for ease of reviewing.

Release note: None
@rytaft rytaft force-pushed the efficient-validate branch from 7b9ad1f to 35b1ec6 Compare May 19, 2021 13:43
Copy link
Collaborator Author

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

[nit] the PR description is missing the second commit

Done.

It might be nice to have a comment explaining why zero is passed as the limit to SplitScanIntoUnionScans.

Done.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @mgartner)


pkg/sql/opt/xform/groupby_funcs.go, line 213 at r3 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

[nit] move the definition of getKnownScanConstraint from limit_funcs.go to general_funcs.go (probably in a separate commit)

It was moved in the second commit.


pkg/sql/opt/xform/rules/groupby.opt, line 185 at r3 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

[nit] It might be nice to expand this sentence a bit more—"are not nullable" confused me a bit when I read it because I thought it should read "are nullable". Something like "it would apply in cases where a and b are not nullable (the IS NOT NULL filters would be eliminated during normalization)". (Or is it that the IS NOT NULL filters are not part of the query to begin with?)

Done.

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 4 files at r5, 3 of 3 files at r6.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @rytaft)


pkg/sql/opt/xform/groupby_funcs.go, line 213 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

It was moved in the second commit.

Ahh I missed that...

Copy link
Collaborator Author

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @rytaft)

@craig
Copy link
Contributor

craig bot commented May 19, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql, opt: validating the unique constraint must be efficient for partitioned unique indexes
5 participants