Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: allow join-elimination rules to apply in more cases #105214

Merged
merged 2 commits into from
Jun 26, 2023

Conversation

DrewKimball
Copy link
Collaborator

opt: allow join elimination rules to remap columns

When it can be proven that a join does not add rows to or remove them
from one of its inputs, the other input can often be removed, eliminating
the join. However, this can only be done if the columns from the
eliminated side are not needed.

This patch allows the join elimination rules to remap columns from the
eliminated side to the preserved side of the join, using the join's
functional dependencies. For example:

CREATE TABLE xy (x INT PRIMARY KEY, y INT);
CREATE TABLE fk (k INT PRIMARY KEY, v INT NOT NULL, FOREIGN KEY (v) REFERENCES xy (x));

SELECT x, k, v FROM fk INNER JOIN xy ON v = x;

In the example above, the join could not previously be eliminated because
the x column is required in the output. Now, the x column is remapped
to the equivalent v column, allowing the join to be removed.

Fixes #102614

Release note (performance improvement): The optimizer can now eliminate
joins in more cases.

opt: infer equality filters for self joins

When a table is joined to itself with an equality that forms a key
over both join inputs, it is possible to infer equality filters
between each pair of columns at the same ordinal position in the base
table. This patch improves the logical props builder to infer these
self-join equalities in a join's FuncDepSet. This improves the quality
of information available to optimization rules, and in particular, join
elimination rules.

Informs #102614

Release note: None

@DrewKimball DrewKimball requested a review from a team as a code owner June 20, 2023 18:05
@DrewKimball DrewKimball requested a review from yuzefovich June 20, 2023 18:05
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich yuzefovich requested review from mgartner and msirek and removed request for yuzefovich June 20, 2023 18:07
Copy link
Contributor

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice performance improvements!
:lgtm_strong:

Reviewed 17 of 17 files at r1, 15 of 15 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner)

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Just left some minor comments/questions.

Reviewed 17 of 17 files at r1, 15 of 15 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball)


pkg/sql/opt/norm/project_funcs.go line 862 at r1 (raw file):

	for col, ok := from.Next(0); ok; col, ok = from.Next(col + 1) {
		if !to.Contains(col) {
			candidates := fds.ComputeEquivGroup(col)

nit: add TODO(mgartner): We don't need to compute the entire equivalence group, we only need to find the first equivalent column that's in the to set.


pkg/sql/opt/norm/project_funcs.go line 883 at r1 (raw file):

) memo.ProjectionsExpr {
	getReplacement := func(col opt.ColumnID) opt.ColumnID {
		candidates := fds.ComputeEquivGroup(col)

nit: add TODO(mgartner): We don't need to compute the entire equivalence group, we only need to find the first equivalent column that's in the to set.


pkg/sql/opt/norm/rules/project.opt line 36 at r1 (raw file):

(Project
    $left
    (MergeProjections

Are you convinced we don't need to call CanMergeProjections before MergeProjections? I think it's fine because the outer projections, (RemapProjectionCols $projections $leftCols $fds), will only contain references to columns in $leftCols, and the inner projections, (ProjectRemappedCols $passthrough $leftCols $fds), will only produce columns in $passthrough.


pkg/sql/opt/testutils/opttester/forcing_opt.go line 78 at r1 (raw file):

		if tester.Flags.DisableRules.Contains(int(ruleName)) {
			return false
		}

Good catch!


pkg/sql/opt/exec/execbuilder/testdata/subquery line 182 at r1 (raw file):

# IN expression transformed into semi-join.
query T
EXPLAIN (VERBOSE) SELECT a FROM abc WHERE a IN (SELECT a FROM abc WHERE b < 0)

The comment doesn't make sense anymore. Maybe change the as to bs here and the semi-join won't be eliminated.


pkg/sql/opt/memo/testdata/logprops/join line 3012 at r2 (raw file):

 ├── multiplicity: left-rows(zero-or-one), right-rows(zero-or-one)
 ├── key: (6)
 ├── fd: ()-->(2,7), (1)-->(3), (6)-->(8), (1)==(6), (6)==(1), (2)==(7), (7)==(2), (3)==(8), (8)==(3)

Walk me through this test case. We can infer equality of 2==7 and 3==8 because 2 and 7 are held constant? Any rows passing the xyz.x = foo.x filter and one of the constant equality filters would make 3 funcationally equivalent to 8, even though this is contradiction that no rows can satisfy?


pkg/sql/opt/norm/testdata/rules/join line 1842 at r2 (raw file):


# --------------------------------------------------
# SimplifyLeftJoin + SimplifyRightJoin

Do we want to prevent these tests from changing since their suppose to show the left-join => inner-join transformation? I suppose we need to self-join in order to make that transformation, and that also enables the join to be eliminated entirely, so maybe it's fine to leave as-is?

Copy link
Collaborator Author

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner)


pkg/sql/opt/norm/project_funcs.go line 862 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: add TODO(mgartner): We don't need to compute the entire equivalence group, we only need to find the first equivalent column that's in the to set.

Done.


pkg/sql/opt/norm/project_funcs.go line 883 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: add TODO(mgartner): We don't need to compute the entire equivalence group, we only need to find the first equivalent column that's in the to set.

Done.


pkg/sql/opt/norm/rules/project.opt line 36 at r1 (raw file):

Are you convinced we don't need to call CanMergeProjections before MergeProjections? I think it's fine because the outer projections, (RemapProjectionCols $projections $leftCols $fds), will only contain references to columns in $leftCols, and the inner projections, (ProjectRemappedCols $passthrough $leftCols $fds), will only produce columns in $passthrough.

Yes, RemapProjectionCols projections only reference left columns, while ProjectRemappedCols only projects right columns. Added a comment.


pkg/sql/opt/testutils/opttester/forcing_opt.go line 78 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Good catch!

Thanks!


pkg/sql/opt/exec/execbuilder/testdata/subquery line 182 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

The comment doesn't make sense anymore. Maybe change the as to bs here and the semi-join won't be eliminated.

It didn't want to use the semi-join when I used b instead, but I got around it by just adding a duplicate table abc2.


pkg/sql/opt/memo/testdata/logprops/join line 3012 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Walk me through this test case. We can infer equality of 2==7 and 3==8 because 2 and 7 are held constant? Any rows passing the xyz.x = foo.x filter and one of the constant equality filters would make 3 funcationally equivalent to 8, even though this is contradiction that no rows can satisfy?

x implies y (and the same x implies the same y for both tables because it's a self-join), so if the x values between xyz and foo are the same, the y values must also be the same in order for there to be any matches. It's a contradiction in this case, but the optimizer doesn't see it because the filters get pushed down before the join's properties are calculated.


pkg/sql/opt/norm/testdata/rules/join line 1842 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Do we want to prevent these tests from changing since their suppose to show the left-join => inner-join transformation? I suppose we need to self-join in order to make that transformation, and that also enables the join to be eliminated entirely, so maybe it's fine to leave as-is?

I think it's ok since the expected rules are still firing, but I'm happy to prevent the elimination if you think that'd be better.

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 20 of 22 files at r3, 15 of 15 files at r4, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @DrewKimball)


pkg/sql/opt/exec/execbuilder/testdata/subquery line 182 at r1 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

It didn't want to use the semi-join when I used b instead, but I got around it by just adding a duplicate table abc2.

👍


pkg/sql/opt/memo/testdata/logprops/join line 3012 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

x implies y (and the same x implies the same y for both tables because it's a self-join), so if the x values between xyz and foo are the same, the y values must also be the same in order for there to be any matches. It's a contradiction in this case, but the optimizer doesn't see it because the filters get pushed down before the join's properties are calculated.

👍


pkg/sql/opt/norm/testdata/rules/join line 1842 at r2 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

I think it's ok since the expected rules are still firing, but I'm happy to prevent the elimination if you think that'd be better.

No, I think this is fine as-is.

@DrewKimball DrewKimball requested review from a team as code owners June 23, 2023 19:17
@DrewKimball
Copy link
Collaborator Author

Fixed a few test failures, I think everything should be passing now.

@DrewKimball DrewKimball removed request for a team June 23, 2023 19:17
@DrewKimball
Copy link
Collaborator Author

TFTRs!

bors r+

craig bot pushed a commit that referenced this pull request Jun 24, 2023
105214: opt: allow join-elimination rules to apply in more cases r=DrewKimball a=DrewKimball

#### opt: allow join elimination rules to remap columns

When it can be proven that a join does not add rows to or remove them
from one of its inputs, the other input can often be removed, eliminating
the join. However, this can only be done if the columns from the
eliminated side are not needed.

This patch allows the join elimination rules to remap columns from the
eliminated side to the preserved side of the join, using the join's
functional dependencies. For example:
```
CREATE TABLE xy (x INT PRIMARY KEY, y INT);
CREATE TABLE fk (k INT PRIMARY KEY, v INT NOT NULL, FOREIGN KEY (v) REFERENCES xy (x));

SELECT x, k, v FROM fk INNER JOIN xy ON v = x;
```
In the example above, the join could not previously be eliminated because
the `x` column is required in the output. Now, the `x` column is remapped
to the equivalent `v` column, allowing the join to be removed.

Fixes #102614

Release note (performance improvement): The optimizer can now eliminate
joins in more cases.

#### opt: infer equality filters for self joins

When a table is joined to itself with an equality that forms a key
over both join inputs, it is possible to infer equality filters
between each pair of columns at the same ordinal position in the base
table. This patch improves the logical props builder to infer these
self-join equalities in a join's FuncDepSet. This improves the quality
of information available to optimization rules, and in particular, join
elimination rules.

Informs #102614

Release note: None

Co-authored-by: Drew Kimball <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jun 24, 2023

Build failed:

@DrewKimball
Copy link
Collaborator Author

bors retry

@craig
Copy link
Contributor

craig bot commented Jun 26, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jun 26, 2023

Build failed:

@DrewKimball
Copy link
Collaborator Author

Looks like TestExecBuild_upsert is flaking, will investigate.

When it can be proven that a join does not add rows to or remove them
from one of its inputs, the other input can often be removed, eliminating
the join. However, this can only be done if the columns from the
eliminated side are not needed.

This patch allows the join elimination rules to remap columns from the
eliminated side to the preserved side of the join, using the join's
functional dependencies. For example:
```
CREATE TABLE xy (x INT PRIMARY KEY, y INT);
CREATE TABLE fk (k INT PRIMARY KEY, v INT NOT NULL, FOREIGN KEY (v) REFERENCES xy (x));

SELECT x, k, v FROM fk INNER JOIN xy ON v = x;
```
In the example above, the join could not previously be eliminated because
the `x` column is required in the output. Now, the `x` column is remapped
to the equivalent `v` column, allowing the join to be removed.

Fixes cockroachdb#102614

Release note (performance improvement): The optimizer can now eliminate
joins in more cases.
When a table is joined to itself with an equality that forms a key
over both join inputs, it is possible to infer equality filters
between each pair of columns at the same ordinal position in the base
table. This patch improves the logical props builder to infer these
self-join equalities in a join's FuncDepSet. This improves the quality
of information available to optimization rules, and in particular, join
elimination rules.

Informs cockroachdb#102614

Release note: None
@DrewKimball
Copy link
Collaborator Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 26, 2023

Build succeeded:

@craig craig bot merged commit 9bc0a0b into cockroachdb:master Jun 26, 2023
@DrewKimball DrewKimball deleted the eliminate-join branch June 26, 2023 19:36
DrewKimball added a commit to DrewKimball/cockroach that referenced this pull request Jun 29, 2023
PR cockroachdb#105214 added logic to infer equivalences for a join that is performing
a self-join with an equality between the same columns, and those columns
form a key over both inputs. This missed the fact that, for nullable columns
from the self-join table, the inferred equivalences are actually *lax*, as
opposed to strict.

A strict equivalence captures the behavior of the SQL `=` operator; strictly
equivalent columns have the same value for every row and no NULL values.
A lax equivalence is similar, but without the restriction on NULL values.
If one column is NULL for a given row, the other must still be NULL at that
row.

Adding all the inferred equivalences as strict could cause other rules that
depend on functional dependencies to perform incorrect transformations; e.g.
adding an invalid filter. This patch moves the equivalence-inference logic
to the join-elimination rules. This is valid because column remapping only
cares about the "col1 and col2 have the same values for every row" semantics
of lax equivalences. If some of those values are NULL, the remapping is
still valid. This ensures that the inferred lax equivalences are only seen
by the join-elimination rules, which know how to handle them.

Fixes cockroachdb#105608

Release note: None
DrewKimball added a commit to DrewKimball/cockroach that referenced this pull request Jun 30, 2023
The `JoinOrderBuilder` is able to infer equality filters from the transitive
closure of equalities in the reordered join tree for use in building new
joins. However, it isn't valid to infer this equality for nullable columns.
This is because the SQL `=` operator rejects NULL values, while functional
dependency equivalences allow them. It was not previously possible to
encounter this bug because inference of an "interesting" equality required
an existing equality filter on both of the involved columns, which rejects
NULLs anyway. However, cockroachdb#105214 added the ability to infer equivalences
between self-join columns, which *do* allow nulls. This allowed the bug
to manifest.

This patch adds a check to the equality filter inference logic to ensure
that at least one of the candidate columns is non-null. This works because
if one column is non-null, all equivalent columns are also non-null. This
ensures that a null-rejecting filter isn't incorrectly added to a nullable
column.

Fixes cockroachdb#105608

Release note: None
DrewKimball added a commit to DrewKimball/cockroach that referenced this pull request Jul 6, 2023
The `JoinOrderBuilder` is able to infer equality filters from the transitive
closure of equalities in the reordered join tree for use in building new
joins. However, it isn't valid to infer this equality for nullable columns.
This is because the SQL `=` operator rejects NULL values, while functional
dependency equivalences allow them. It was not previously possible to
encounter this bug because inference of an "interesting" equality required
an existing equality filter on both of the involved columns, which rejects
NULLs anyway. However, cockroachdb#105214 added the ability to infer equivalences
between self-join columns, which *do* allow nulls. This allowed the bug
to manifest.

This patch adds a check to the equality filter inference logic to ensure
that at least one of the candidate columns is non-null. This works because
if one column is non-null, all equivalent columns are also non-null. This
ensures that a null-rejecting filter isn't incorrectly added to a nullable
column.

Fixes cockroachdb#105608

Release note: None
craig bot pushed a commit that referenced this pull request Jul 6, 2023
105866: opt: don't infer join equalities from equivalences on nullable columns r=DrewKimball a=DrewKimball

The `JoinOrderBuilder` is able to infer equality filters from the transitive closure of equalities in the reordered join tree for use in building new joins. However, it isn't valid to infer this equality for nullable columns. This is because the SQL `=` operator rejects NULL values, while functional dependency equivalences allow them. It was not previously possible to encounter this bug because inference of an "interesting" equality required an existing equality filter on both of the involved columns, which rejects NULLs anyway. However, #105214 added the ability to infer equivalences between self-join columns, which *do* allow nulls. This allowed the bug to manifest.

This patch adds a check to the equality filter inference logic to ensure that at least one of the candidate columns is non-null. This works because if one column is non-null, all equivalent columns are also non-null. This ensures that a null-rejecting filter isn't incorrectly added to a nullable column.

Fixes #105608, Fixes #105598

Release note: None

Co-authored-by: Drew Kimball <[email protected]>
DrewKimball added a commit to DrewKimball/cockroach that referenced this pull request Jul 8, 2023
Self-join equality inference was added by cockroachdb#105214, so that the `FuncDeps`
for a self-join would include equalities between *every* pair of columns
at the same ordinal position in the base table if there was an equality
between key columns (also at the same ordinal position). However, the
key column check was performed using the FDs of the join inputs rather
than the base table's FDs. This could lead to incorrectly adding self-join
equality filters. For example, consider the following:
```
CREATE TABLE xy (x INT NOT NULL, y INT NOT NULL);
INSERT INTO xy VALUES (1, 1), (1, 2);

SELECT * FROM xy a JOIN xy b ON a.x = b.x;

SELECT * FROM (SELECT * FROM xy LIMIT 1) a
JOIN (SELECT * FROM xy LIMIT 1) b ON a.x = b.x;
```
In the first query above, `a.x = b.x` does not consitute joining on
key columns. But in the second query, both inputs have one row and
so *any* set of columns is a key. However, there are no guarantees
which row is being joined from each table - if `a` is the `(1, 1)`
row and `b` is the `(1, 2)` row, inferring a `a.y = b.y` filter
will incorrectly cause the join to return no rows.

This patch fixes the problem by requiring the initial self-join
equalities to form a key over the *base* table, not just the inputs
of the join.

Fixes cockroachdb#106371

Release note: None
DrewKimball added a commit to DrewKimball/cockroach that referenced this pull request Jul 18, 2023
Self-join equality inference was added by cockroachdb#105214, so that the `FuncDeps`
for a self-join would include equalities between *every* pair of columns
at the same ordinal position in the base table if there was an equality
between key columns (also at the same ordinal position). However, the
key column check was performed using the FDs of the join inputs rather
than the base table's FDs. This could lead to incorrectly adding self-join
equality filters. For example, consider the following:
```
CREATE TABLE t106371 (x INT NOT NULL, y INT NOT NULL);
INSERT INTO t106371 VALUES (1, 1), (1, 2);

SELECT * FROM t106371 a JOIN t106371 b ON a.x = b.x;

SELECT * FROM (SELECT * FROM t106371 ORDER BY y DESC LIMIT 1) a
JOIN (SELECT DISTINCT ON (x) * FROM t106371) b ON a.x = b.x;
```
In the first query above, `a.x = b.x` does not consitute joining on
key columns. But in the second query, one input has one row and the
other de-duplicated by the `x` column and so `x` is a key over both
inputs. However, the query as written will select different rows for
each input - `a` will return the `(1, 2)` row, while `b` will return
the `(1, 1)` row. Inferring a `a.y = b.y` filter will incorrectly cause
the join to return no rows.

This patch fixes the problem by requiring the initial self-join
equalities to form a key over the *base* table, not just the inputs
of the join.

Fixes cockroachdb#106371

Release note: None
craig bot pushed a commit that referenced this pull request Jul 18, 2023
106485: opt: only infer self-join equality with a key over the base table r=DrewKimball a=DrewKimball

#### opt: only infer self-join equality with a key over the base table

Self-join equality inference was added by #105214, so that the `FuncDeps`
for a self-join would include equalities between *every* pair of columns
at the same ordinal position in the base table if there was an equality
between key columns (also at the same ordinal position). However, the
key column check was performed using the FDs of the join inputs rather
than the base table's FDs. This could lead to incorrectly adding self-join
equality filters. For example, consider the following:
```
CREATE TABLE t106371 (x INT NOT NULL, y INT NOT NULL);
INSERT INTO t106371 VALUES (1, 1), (1, 2);

SELECT * FROM t106371 a JOIN t106371 b ON a.x = b.x;

SELECT * FROM (SELECT * FROM t106371 ORDER BY y DESC LIMIT 1) a
JOIN (SELECT DISTINCT ON (x) * FROM t106371) b ON a.x = b.x;
```
In the first query above, `a.x = b.x` does not consitute joining on
key columns. But in the second query, one input has one row and the
other de-duplicated by the `x` column and so `x` is a key over both
inputs. However, the query as written will select different rows for
each input - `a` will return the `(1, 2)` row, while `b` will return
the `(1, 1)` row. Inferring a `a.y = b.y` filter will incorrectly cause
the join to return no rows.

This patch fixes the problem by requiring the initial self-join
equalities to form a key over the *base* table, not just the inputs
of the join.

Fixes #106371

Release note: None

106779: acceptance: Fix DockerCSharp Test r=rimadeodhar a=rimadeodhar

This PR updates the .net framework within the csproj setup to .net 6. 
Additionally, it also fixes the CS program that we run as
a part of this test with the new npgsql DateTime
changes made as a part of npgsql 6 version update. The old datetime 
types have been deprecated and needed to be removed. 
With these fixes, the test is running successfully and can be unskipped.

Epic: none
Fixes: #86852
Release note: none

107104: cli: fix debug zip with new columns for cluster settings r=maryliag a=maryliag

The addition of new columns on cluster_settings view done on #104449 were causing debug zip fails to add the information from that table, since the query used to gather the information was doing a join and not considering the new columns.

This commit updates the query to use the explicit columns, so even if new columns are added it won't be a problem in the future. It also adds tests for all custom querys used to generate the debug zip, so this type of issue would have been caught.

The file `crdb_internal.cluster_settings.txt` in debug zips was
empty due to this bug on v23.1.5 (only version affected by this bug).

Fixes #107103

Release note (bug fix): Debug zip now are properly showing the information from cluster_settings. The file `crdb_internal.cluster_settings.txt` in debug zips was
empty due to this bug on v23.1.5 (only version affected by this bug).

Co-authored-by: Drew Kimball <[email protected]>
Co-authored-by: rimadeodhar <[email protected]>
Co-authored-by: maryliag <[email protected]>
@mgartner
Copy link
Collaborator

mgartner commented Aug 1, 2023

🚨 If we backport this, we'll need to also backport #105723, #107319, and #107995. 🚨

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

opt: TryRemapJoinOuterColsRight prevents optimal plan
4 participants