Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: only infer self-join equality with a key over the base table #106485

Merged
merged 1 commit into from
Jul 18, 2023

Conversation

DrewKimball
Copy link
Collaborator

@DrewKimball DrewKimball commented Jul 8, 2023

opt: only infer self-join equality with a key over the base table

Self-join equality inference was added by #105214, so that the FuncDeps
for a self-join would include equalities between every pair of columns
at the same ordinal position in the base table if there was an equality
between key columns (also at the same ordinal position). However, the
key column check was performed using the FDs of the join inputs rather
than the base table's FDs. This could lead to incorrectly adding self-join
equality filters. For example, consider the following:

CREATE TABLE t106371 (x INT NOT NULL, y INT NOT NULL);
INSERT INTO t106371 VALUES (1, 1), (1, 2);

SELECT * FROM t106371 a JOIN t106371 b ON a.x = b.x;

SELECT * FROM (SELECT * FROM t106371 ORDER BY y DESC LIMIT 1) a
JOIN (SELECT DISTINCT ON (x) * FROM t106371) b ON a.x = b.x;

In the first query above, a.x = b.x does not consitute joining on
key columns. But in the second query, one input has one row and the
other de-duplicated by the x column and so x is a key over both
inputs. However, the query as written will select different rows for
each input - a will return the (1, 2) row, while b will return
the (1, 1) row. Inferring a a.y = b.y filter will incorrectly cause
the join to return no rows.

This patch fixes the problem by requiring the initial self-join
equalities to form a key over the base table, not just the inputs
of the join.

Fixes #106371

Release note: None

@DrewKimball DrewKimball requested a review from mgartner July 8, 2023 21:22
@DrewKimball DrewKimball requested a review from a team as a code owner July 8, 2023 21:22
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just need to fix some tests.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball)


pkg/sql/opt/memo/logical_props_builder.go line 2553 at r1 (raw file):

	}
	for leftTable, leftTableOrds := range leftTables {
		var baseTabFDs *props.FuncDepSet

The "resetting" of baseTabFDs for every iteration of the outer loop is subtle and might be easy for someone to mess up in the future. Consider adding a comment to highlight this or a test that will fail if it's not reset.


pkg/sql/logictest/testdata/logic_test/join line 1210 at r1 (raw file):


# Regression test for #106371 - only infer self-join equalities if the original
# equality columns form a key over the *base* table, not just the join inputs.

Can you simplify this to the example in the commit message, or does it the reproduction require all this?


pkg/sql/logictest/testdata/logic_test/join line 1234 at r1 (raw file):


statement ok
CREATE TABLE queries_item_tags (                                                                      

nit: remove trailing spaces

Copy link
Collaborator Author

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)


pkg/sql/opt/memo/logical_props_builder.go line 2553 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

The "resetting" of baseTabFDs for every iteration of the outer loop is subtle and might be easy for someone to mess up in the future. Consider adding a comment to highlight this or a test that will fail if it's not reset.

Since MakeTableFuncDep caches the FDs anyway, I've just changed it to unconditionally initialize baseTabFDs in the outer loop, and leave it untouched in the inner loop.


pkg/sql/logictest/testdata/logic_test/join line 1210 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Can you simplify this to the example in the commit message, or does it the reproduction require all this?

Good point, a modified version of the commit example will work for a logic test. Done.


pkg/sql/logictest/testdata/logic_test/join line 1234 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: remove trailing spaces

Replaced the test case as mentioned above.

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Nice find!

Reviewed 5 of 5 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball)


pkg/sql/opt/memo/logical_props_builder.go line 2553 at r1 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Since MakeTableFuncDep caches the FDs anyway, I've just changed it to unconditionally initialize baseTabFDs in the outer loop, and leave it untouched in the inner loop.

Nice solution!


pkg/sql/opt/memo/testdata/logprops/join line 3010 at r2 (raw file):

# from each input.
norm
SELECT * FROM xyz INNER JOIN xyz foo ON xyz.x = foo.x AND xyz.y = 1 AND foo.y = 2

We already had a test for this, but none of us realized the output was invalid... 🤷

Self-join equality inference was added by cockroachdb#105214, so that the `FuncDeps`
for a self-join would include equalities between *every* pair of columns
at the same ordinal position in the base table if there was an equality
between key columns (also at the same ordinal position). However, the
key column check was performed using the FDs of the join inputs rather
than the base table's FDs. This could lead to incorrectly adding self-join
equality filters. For example, consider the following:
```
CREATE TABLE t106371 (x INT NOT NULL, y INT NOT NULL);
INSERT INTO t106371 VALUES (1, 1), (1, 2);

SELECT * FROM t106371 a JOIN t106371 b ON a.x = b.x;

SELECT * FROM (SELECT * FROM t106371 ORDER BY y DESC LIMIT 1) a
JOIN (SELECT DISTINCT ON (x) * FROM t106371) b ON a.x = b.x;
```
In the first query above, `a.x = b.x` does not consitute joining on
key columns. But in the second query, one input has one row and the
other de-duplicated by the `x` column and so `x` is a key over both
inputs. However, the query as written will select different rows for
each input - `a` will return the `(1, 2)` row, while `b` will return
the `(1, 1)` row. Inferring a `a.y = b.y` filter will incorrectly cause
the join to return no rows.

This patch fixes the problem by requiring the initial self-join
equalities to form a key over the *base* table, not just the inputs
of the join.

Fixes cockroachdb#106371

Release note: None
@DrewKimball
Copy link
Collaborator Author

TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Jul 18, 2023

Timed out.

@DrewKimball
Copy link
Collaborator Author

bors retry

@DrewKimball
Copy link
Collaborator Author

Looks like the bors failure was transient, due to something in the test infra.

@craig
Copy link
Contributor

craig bot commented Jul 18, 2023

Build succeeded:

@craig craig bot merged commit 06b5ba7 into cockroachdb:master Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: django failed
3 participants