Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xform: ignore derived hash bucket lookup join cols for selectivity estimate #86622

Merged
merged 1 commit into from
Sep 5, 2022

Conversation

msirek
Copy link
Contributor

@msirek msirek commented Aug 22, 2022

Fixes #85353

Previously, a lookup join involving the first column of a hash-sharded
index and a column from another table would use a derived equality
condition between the invisible hash bucket column and an expression on
the other table's column for selectivity estimation purposes. Since the
derived join condition does not actually reduce the number of qualified
rows, the optimizer would end up with an underestimated row count
estimate for the join, and end up selecting it, when lower cost join
methods existed.

To address this, this patch remembers which left table key columns in
the lookup join are synthesized for derived equijoin conditions on the
lookup table hash bucket column, and ignores them when building the
filter functional dependencies which are later used in
selectivityFromEquivalencies() for calculating the join selectivity.

Release justification: Low risk fix for costly lookup joins on
hash-sharded indexes.

Release note (bug fix): This patch fixes a bug in lookup join
selectivity estimation involving hash-sharded indexes which may cause
lookup joins to be selected by the optimizer in cases where other join
methods are less expensive.

memo: This commit adds redundantEquivCols to FuncDepSet. This ColSet
tracks the columns in FD set equivalences which should not be used
for selectivity estimation.

Release note: none

@msirek msirek requested review from mgartner and michae2 August 22, 2022 22:52
@msirek msirek requested a review from a team as a code owner August 22, 2022 22:52
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find!

Reviewed 1 of 6 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @msirek)


pkg/sql/opt/lookupjoin/constraint_builder.go line 267 at r1 (raw file):

			projection := b.f.ConstructProjectionsItem(b.f.RemapCols(expr, b.eqColMap), compEqCol)
			inputProjections = append(inputProjections, projection)
			addEqualityColumns(compEqCol, idxCol, compEqCol)

nit: might be simpler to add directly to derivedKeyCols here


pkg/sql/opt/memo/logical_props_builder.go line 2272 at r1 (raw file):

			h.filterNotNullCols.Add(indexColID)
			if !join.DerivedKeyCols.Contains(colID) {
				h.filtersFD.AddEquivalency(colID, indexColID)

I think omitting this equivalency could prevent some optimizations from applying, though it might be unlikely (maybe only when the user is filtering/grouping/ordering by the hash-sharded column).

But I am curious - it looks like there's already some logic in the statistics builder that tries not to double-count selectivity of key columns when they functionally determine each other, see tryReduceJoinCols. I'm curious why that's not working in this case. The hash-sharded column is functionally determined by the other key columns, so it seems like this mechanism should be handling this case.

Copy link
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)


pkg/sql/opt/lookupjoin/constraint_builder.go line 267 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: might be simpler to add directly to derivedKeyCols here

Made the change

Code quote:

addEqualityColumns(compEqCol, idxCol, compEqCol)

pkg/sql/opt/memo/logical_props_builder.go line 2272 at r1 (raw file):

But I am curious - it looks like there's already some logic in the statistics builder that tries not to double-count selectivity of key columns when they functionally determine each other, see tryReduceJoinCols.
I'm curious why that's not working in this case.

My understanding of tryReduceJoinCols is that it is trying to figure out the ON clause constraint columns to use for selectivity estimation, where a constraint is considered to be a predicate involving a column and a constant, so it is not directly applicable to this case. This case is more about join selectivity, and in particular selectivity coming from equality join conditions between columns.

But you bring up a good point. Can we use the functional dependencies which are present to determine if the selectivity from one of the join conditions can be ignored? I don't think we can. For example, instead of having computed columns with identical functions, what if the functions differed slightly (or were totally different):

CREATE TABLE t (b INT,
      c INT8 NOT NULL AS (mod(fnv32(crdb_internal.datums_to_bytes(b)), 15:::INT8)) STORED);

CREATE TABLE u (
      a INT8 NULL,
      b INT8 NULL,
      c INT8 NOT NULL AS (mod(fnv32(crdb_internal.datums_to_bytes(b)), 16:::INT8)) STORED,
      INDEX u_b_idx (c, b ASC)
  );

EXPLAIN (OPT,VERBOSE) SELECT * FROM t JOIN u@u_b_idx USING (c,b) WHERE a < 10;

Columns t.c and u.c are determined by t.b and u.b, like in the hash-sharded case, but we don't have the property t.c = u.c is true iff t.b = u.b is true. So, application of the predicate t.c = u.c would further reduce the number of qualified rows, and the selectivity of this term should not be ignored. We would probably need to store the actual function (expression) used in the functional dependency, and look for identical functions if we were to take a functional dependency approach. This could also handle cases like the above, if the functions (expressions) used were identical. I don't know how important this case is, but one could imagine a customer defining computed columns using the same expression in more than one table, and use join conditions on those columns. This seems like a more generalized solution though. Maybe it deserves a separate issue.

I think omitting this equivalency could prevent some optimizations from applying...

Great point! We don't want to limit our current or future potential to do rewrites by excluding some functional dependencies. I have pushed a new commit which changes the fix to add a new field to FuncDepSet specifying the equivalency columns to ignore for selectivity estimation purposes. Maybe this will also make it easier to handle cases other than lookup join in the future.

@msirek msirek force-pushed the 85353_2 branch 2 times, most recently from 32e42ef to e7d2c9a Compare August 29, 2022 02:53
Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)


pkg/sql/opt/memo/logical_props_builder.go line 2272 at r1 (raw file):

Previously, msirek (Mark Sirek) wrote…

But I am curious - it looks like there's already some logic in the statistics builder that tries not to double-count selectivity of key columns when they functionally determine each other, see tryReduceJoinCols.
I'm curious why that's not working in this case.

My understanding of tryReduceJoinCols is that it is trying to figure out the ON clause constraint columns to use for selectivity estimation, where a constraint is considered to be a predicate involving a column and a constant, so it is not directly applicable to this case. This case is more about join selectivity, and in particular selectivity coming from equality join conditions between columns.

But you bring up a good point. Can we use the functional dependencies which are present to determine if the selectivity from one of the join conditions can be ignored? I don't think we can. For example, instead of having computed columns with identical functions, what if the functions differed slightly (or were totally different):

CREATE TABLE t (b INT,
      c INT8 NOT NULL AS (mod(fnv32(crdb_internal.datums_to_bytes(b)), 15:::INT8)) STORED);

CREATE TABLE u (
      a INT8 NULL,
      b INT8 NULL,
      c INT8 NOT NULL AS (mod(fnv32(crdb_internal.datums_to_bytes(b)), 16:::INT8)) STORED,
      INDEX u_b_idx (c, b ASC)
  );

EXPLAIN (OPT,VERBOSE) SELECT * FROM t JOIN u@u_b_idx USING (c,b) WHERE a < 10;

Columns t.c and u.c are determined by t.b and u.b, like in the hash-sharded case, but we don't have the property t.c = u.c is true iff t.b = u.b is true. So, application of the predicate t.c = u.c would further reduce the number of qualified rows, and the selectivity of this term should not be ignored. We would probably need to store the actual function (expression) used in the functional dependency, and look for identical functions if we were to take a functional dependency approach. This could also handle cases like the above, if the functions (expressions) used were identical. I don't know how important this case is, but one could imagine a customer defining computed columns using the same expression in more than one table, and use join conditions on those columns. This seems like a more generalized solution though. Maybe it deserves a separate issue.

I think omitting this equivalency could prevent some optimizations from applying...

Great point! We don't want to limit our current or future potential to do rewrites by excluding some functional dependencies. I have pushed a new commit which changes the fix to add a new field to FuncDepSet specifying the equivalency columns to ignore for selectivity estimation purposes. Maybe this will also make it easier to handle cases other than lookup join in the future.

I'm not sure FuncDepSet is the right place to store this information - knowing which equivalencies are correlated with one another for selectivity estimation seems orthogonal to tracking which equivalencies are valid over the expression. My feeling is that the "right" place to correct this issue would be in selectivityFromEquivalency, using information stored on either the LookupJoinExpr itself, or its memo group. The idea would be to immediately return props.OneSelectivity iff the equivGroup is known to be redundant, with no extra columns. (If the group has equal columns from a redundant computed filter but also has additional equivalent columns, I think we'd want to leave the calculation unchanged).

Copy link
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)


pkg/sql/opt/memo/logical_props_builder.go line 2272 at r1 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

I'm not sure FuncDepSet is the right place to store this information - knowing which equivalencies are correlated with one another for selectivity estimation seems orthogonal to tracking which equivalencies are valid over the expression. My feeling is that the "right" place to correct this issue would be in selectivityFromEquivalency, using information stored on either the LookupJoinExpr itself, or its memo group. The idea would be to immediately return props.OneSelectivity iff the equivGroup is known to be redundant, with no extra columns. (If the group has equal columns from a redundant computed filter but also has additional equivalent columns, I think we'd want to leave the calculation unchanged).

Storing the info directly in FuncDepSet is an object-oriented approach to make sure the information about the equivalency is always carried alongside it, and always kept up-to-date, no matter if that equivalency is copied and used elsewhere, or has columns remapped and propagated to a new operation (via FuncDepSet.RemapFrom, for example, used by set operations like INTERSECT or EXCEPT which may consume a nested loop join). It should just automatically work instead of having to explicitly propagate redundancy information for each operation type in current or future uses of the FuncDepSet for estimating selectivity.

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @mgartner, @michae2, and @msirek)


pkg/sql/opt/memo/logical_props_builder.go line 2272 at r1 (raw file):
I agree with @DrewKimball. I'd prefer not to extend FDs for this edge case. They are a fundamental tool used for many different things throughout the optimizer, so I'd prefer that their scope of responsibility remain as narrow and focused as possible.

The set of derived columns is available in the LookupJoinPrivate, so can the statistics builder fetch them directly from there instead of plumbing through an FD?


Storing the info directly in FuncDepSet is an object-oriented approach

Object-oriented patterns are generally discouraged in Go. I think one of the reasons they are discouraged is that they can make it harder to maximize performance of software on modern hardware. (If you're interested, here's some talks that were eye-opening for me about this (they aren't Go-specific): https://www.youtube.com/watch?v=WDIkqP4JbkE, https://vimeo.com/649009599, https://www.youtube.com/watch?v=rX0ItVEVjHc.) In this case, a larger FD struct would decrease the number of FDs that fit in CPU caches, and cause more cache misses. I suspect the performance would be significant because optimizer profiles consistently show that we spend a big chunk (I've seen 10-20% recently) of optimization time performing computations with FDs, though we'd have to measure to know for sure.

@DrewKimball
Copy link
Collaborator

If we want something a bit more general / widely applicable, it might be interesting to consider keeping additional information about the "origin" of columns (maybe stored on the metadata or something). This would let us make assertions about the domain of the column, since we maintain the invariant that a given column ID has at most been duplicated, filtered and/or null-extended. We already use this sort of information in calculating join multiplicities, for example, and there might be other places it would be useful too.

Copy link
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)


pkg/sql/opt/memo/logical_props_builder.go line 2272 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

I agree with @DrewKimball. I'd prefer not to extend FDs for this edge case. They are a fundamental tool used for many different things throughout the optimizer, so I'd prefer that their scope of responsibility remain as narrow and focused as possible.

The set of derived columns is available in the LookupJoinPrivate, so can the statistics builder fetch them directly from there instead of plumbing through an FD?


Storing the info directly in FuncDepSet is an object-oriented approach

Object-oriented patterns are generally discouraged in Go. I think one of the reasons they are discouraged is that they can make it harder to maximize performance of software on modern hardware. (If you're interested, here's some talks that were eye-opening for me about this (they aren't Go-specific): https://www.youtube.com/watch?v=WDIkqP4JbkE, https://vimeo.com/649009599, https://www.youtube.com/watch?v=rX0ItVEVjHc.) In this case, a larger FD struct would decrease the number of FDs that fit in CPU caches, and cause more cache misses. I suspect the performance would be significant because optimizer profiles consistently show that we spend a big chunk (I've seen 10-20% recently) of optimization time performing computations with FDs, though we'd have to measure to know for sure.

OK, I understand. We can defer a general solution until we have more concrete use cases. I've removed the FuncDepSet changes and am handling just selectivityFromEquivalency for nested loop join now.

Copy link
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea. For now I'll just fix the nested loop join case to keep it simple.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done! :lgtm:

Reviewed 3 of 12 files at r3, 3 of 6 files at r4, 12 of 12 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @msirek)


pkg/sql/opt/memo/statistics_builder.go line 4332 at r5 (raw file):

	var derivedEquivCols opt.ColSet
	if lookupJoinExpr, ok := e.(*LookupJoinExpr); ok {
		if !lookupJoinExpr.DerivedEquivCols.Empty() {

nit: the check for empty isn't necessary, you can just always set derivedEquivCols if its a lookup join.

…timate

Fixes cockroachdb#85353

Previously, a lookup join involving the first column of a hash-sharded
index and a column from another table would use a derived equality
condition between the invisible hash bucket column and an expression on
the other table's column for selectivity estimation purposes. Since the
derived join condition does not actually reduce the number of qualified
rows, the optimizer would end up with an underestimated row count
estimate for the join, and end up selecting it, when lower cost join
methods existed.

To address this, this patch remembers which left table key columns in
the lookup join are synthesized for derived equijoin conditions on the
lookup table hash bucket column, and ignores them when building the
filter functional dependencies which are later used in
`selectivityFromEquivalencies()` for calculating the join selectivity.

Release justification: Low risk fix for costly lookup joins on
hash-sharded indexes.

Release note (bug fix): This patch fixes a bug in lookup join
selectivity estimation involving hash-sharded indexes which may cause
lookup joins to be selected by the optimizer in cases where other join
methods are less expensive.
Copy link
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!
bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner and @michae2)


pkg/sql/opt/memo/statistics_builder.go line 4332 at r5 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: the check for empty isn't necessary, you can just always set derivedEquivCols if its a lookup join.

Alright, changed it; was just trying to avoid an unnecessary assignment here since derivedEquivCols is already initialized to the empty set.

Code quote:

if !lookupJoinExpr.DerivedEquivCols.Empty() {

@craig
Copy link
Contributor

craig bot commented Sep 5, 2022

Build succeeded:

@craig craig bot merged commit 818a6b8 into cockroachdb:master Sep 5, 2022
@blathers-crl
Copy link

blathers-crl bot commented Sep 5, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from ca7ee14 to blathers/backport-release-22.1-86622: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql/opt: selectivity is wrong for joins with hash-sharded indexes
4 participants