xform: ignore derived hash bucket lookup join cols for selectivity estimate #86622

msirek · 2022-08-22T22:52:29Z

Previously, a lookup join involving the first column of a hash-sharded
index and a column from another table would use a derived equality
condition between the invisible hash bucket column and an expression on
the other table's column for selectivity estimation purposes. Since the
derived join condition does not actually reduce the number of qualified
rows, the optimizer would end up with an underestimated row count
estimate for the join, and end up selecting it, when lower cost join
methods existed.

To address this, this patch remembers which left table key columns in
the lookup join are synthesized for derived equijoin conditions on the
lookup table hash bucket column, and ignores them when building the
filter functional dependencies which are later used in
selectivityFromEquivalencies() for calculating the join selectivity.

Release justification: Low risk fix for costly lookup joins on
hash-sharded indexes.

Release note (bug fix): This patch fixes a bug in lookup join
selectivity estimation involving hash-sharded indexes which may cause
lookup joins to be selected by the optimizer in cases where other join
methods are less expensive.

memo: This commit adds redundantEquivCols to FuncDepSet. This ColSet
tracks the columns in FD set equivalences which should not be used
for selectivity estimation.

Release note: none

cockroach-teamcity · 2022-08-22T22:52:37Z

This change is

mgartner

Nice find!

Reviewed 1 of 6 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @msirek)

pkg/sql/opt/lookupjoin/constraint_builder.go line 267 at r1 (raw file):

			projection := b.f.ConstructProjectionsItem(b.f.RemapCols(expr, b.eqColMap), compEqCol)
			inputProjections = append(inputProjections, projection)
			addEqualityColumns(compEqCol, idxCol, compEqCol)

nit: might be simpler to add directly to derivedKeyCols here

pkg/sql/opt/memo/logical_props_builder.go line 2272 at r1 (raw file):

			h.filterNotNullCols.Add(indexColID)
			if !join.DerivedKeyCols.Contains(colID) {
				h.filtersFD.AddEquivalency(colID, indexColID)

I think omitting this equivalency could prevent some optimizations from applying, though it might be unlikely (maybe only when the user is filtering/grouping/ordering by the hash-sharded column).

But I am curious - it looks like there's already some logic in the statistics builder that tries not to double-count selectivity of key columns when they functionally determine each other, see tryReduceJoinCols. I'm curious why that's not working in this case. The hash-sharded column is functionally determined by the other key columns, so it seems like this mechanism should be handling this case.

msirek

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)

pkg/sql/opt/lookupjoin/constraint_builder.go line 267 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: might be simpler to add directly to derivedKeyCols here

Made the change

Code quote:

addEqualityColumns(compEqCol, idxCol, compEqCol)

pkg/sql/opt/memo/logical_props_builder.go line 2272 at r1 (raw file):

But I am curious - it looks like there's already some logic in the statistics builder that tries not to double-count selectivity of key columns when they functionally determine each other, see tryReduceJoinCols.
I'm curious why that's not working in this case.

My understanding of tryReduceJoinCols is that it is trying to figure out the ON clause constraint columns to use for selectivity estimation, where a constraint is considered to be a predicate involving a column and a constant, so it is not directly applicable to this case. This case is more about join selectivity, and in particular selectivity coming from equality join conditions between columns.

But you bring up a good point. Can we use the functional dependencies which are present to determine if the selectivity from one of the join conditions can be ignored? I don't think we can. For example, instead of having computed columns with identical functions, what if the functions differed slightly (or were totally different):

CREATE TABLE t (b INT,
      c INT8 NOT NULL AS (mod(fnv32(crdb_internal.datums_to_bytes(b)), 15:::INT8)) STORED);

CREATE TABLE u (
      a INT8 NULL,
      b INT8 NULL,
      c INT8 NOT NULL AS (mod(fnv32(crdb_internal.datums_to_bytes(b)), 16:::INT8)) STORED,
      INDEX u_b_idx (c, b ASC)
  );

EXPLAIN (OPT,VERBOSE) SELECT * FROM t JOIN u@u_b_idx USING (c,b) WHERE a < 10;

Columns t.c and u.c are determined by t.b and u.b, like in the hash-sharded case, but we don't have the property t.c = u.c is true iff t.b = u.b is true. So, application of the predicate t.c = u.c would further reduce the number of qualified rows, and the selectivity of this term should not be ignored. We would probably need to store the actual function (expression) used in the functional dependency, and look for identical functions if we were to take a functional dependency approach. This could also handle cases like the above, if the functions (expressions) used were identical. I don't know how important this case is, but one could imagine a customer defining computed columns using the same expression in more than one table, and use join conditions on those columns. This seems like a more generalized solution though. Maybe it deserves a separate issue.

I think omitting this equivalency could prevent some optimizations from applying...

Great point! We don't want to limit our current or future potential to do rewrites by excluding some functional dependencies. I have pushed a new commit which changes the fix to add a new field to FuncDepSet specifying the equivalency columns to ignore for selectivity estimation purposes. Maybe this will also make it easier to handle cases other than lookup join in the future.

DrewKimball

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)

pkg/sql/opt/memo/logical_props_builder.go line 2272 at r1 (raw file):

Previously, msirek (Mark Sirek) wrote…

But I am curious - it looks like there's already some logic in the statistics builder that tries not to double-count selectivity of key columns when they functionally determine each other, see tryReduceJoinCols.
I'm curious why that's not working in this case.

My understanding of tryReduceJoinCols is that it is trying to figure out the ON clause constraint columns to use for selectivity estimation, where a constraint is considered to be a predicate involving a column and a constant, so it is not directly applicable to this case. This case is more about join selectivity, and in particular selectivity coming from equality join conditions between columns.

But you bring up a good point. Can we use the functional dependencies which are present to determine if the selectivity from one of the join conditions can be ignored? I don't think we can. For example, instead of having computed columns with identical functions, what if the functions differed slightly (or were totally different):
CREATE TABLE t (b INT,
      c INT8 NOT NULL AS (mod(fnv32(crdb_internal.datums_to_bytes(b)), 15:::INT8)) STORED);

CREATE TABLE u (
      a INT8 NULL,
      b INT8 NULL,
      c INT8 NOT NULL AS (mod(fnv32(crdb_internal.datums_to_bytes(b)), 16:::INT8)) STORED,
      INDEX u_b_idx (c, b ASC)
  );

EXPLAIN (OPT,VERBOSE) SELECT * FROM t JOIN u@u_b_idx USING (c,b) WHERE a < 10;
Columns t.c and u.c are determined by t.b and u.b, like in the hash-sharded case, but we don't have the property t.c = u.c is true iff t.b = u.b is true. So, application of the predicate t.c = u.c would further reduce the number of qualified rows, and the selectivity of this term should not be ignored. We would probably need to store the actual function (expression) used in the functional dependency, and look for identical functions if we were to take a functional dependency approach. This could also handle cases like the above, if the functions (expressions) used were identical. I don't know how important this case is, but one could imagine a customer defining computed columns using the same expression in more than one table, and use join conditions on those columns. This seems like a more generalized solution though. Maybe it deserves a separate issue.

I think omitting this equivalency could prevent some optimizations from applying...

Great point! We don't want to limit our current or future potential to do rewrites by excluding some functional dependencies. I have pushed a new commit which changes the fix to add a new field to FuncDepSet specifying the equivalency columns to ignore for selectivity estimation purposes. Maybe this will also make it easier to handle cases other than lookup join in the future.

I'm not sure FuncDepSet is the right place to store this information - knowing which equivalencies are correlated with one another for selectivity estimation seems orthogonal to tracking which equivalencies are valid over the expression. My feeling is that the "right" place to correct this issue would be in selectivityFromEquivalency, using information stored on either the LookupJoinExpr itself, or its memo group. The idea would be to immediately return props.OneSelectivity iff the equivGroup is known to be redundant, with no extra columns. (If the group has equal columns from a redundant computed filter but also has additional equivalent columns, I think we'd want to leave the calculation unchanged).

msirek

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)

pkg/sql/opt/memo/logical_props_builder.go line 2272 at r1 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

I'm not sure FuncDepSet is the right place to store this information - knowing which equivalencies are correlated with one another for selectivity estimation seems orthogonal to tracking which equivalencies are valid over the expression. My feeling is that the "right" place to correct this issue would be in selectivityFromEquivalency, using information stored on either the LookupJoinExpr itself, or its memo group. The idea would be to immediately return props.OneSelectivity iff the equivGroup is known to be redundant, with no extra columns. (If the group has equal columns from a redundant computed filter but also has additional equivalent columns, I think we'd want to leave the calculation unchanged).

Storing the info directly in FuncDepSet is an object-oriented approach to make sure the information about the equivalency is always carried alongside it, and always kept up-to-date, no matter if that equivalency is copied and used elsewhere, or has columns remapped and propagated to a new operation (via FuncDepSet.RemapFrom, for example, used by set operations like INTERSECT or EXCEPT which may consume a nested loop join). It should just automatically work instead of having to explicitly propagate redundancy information for each operation type in current or future uses of the FuncDepSet for estimating selectivity.

mgartner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @mgartner, @michae2, and @msirek)

pkg/sql/opt/memo/logical_props_builder.go line 2272 at r1 (raw file):
I agree with @DrewKimball. I'd prefer not to extend FDs for this edge case. They are a fundamental tool used for many different things throughout the optimizer, so I'd prefer that their scope of responsibility remain as narrow and focused as possible.

The set of derived columns is available in the LookupJoinPrivate, so can the statistics builder fetch them directly from there instead of plumbing through an FD?

Storing the info directly in FuncDepSet is an object-oriented approach

Object-oriented patterns are generally discouraged in Go. I think one of the reasons they are discouraged is that they can make it harder to maximize performance of software on modern hardware. (If you're interested, here's some talks that were eye-opening for me about this (they aren't Go-specific): https://www.youtube.com/watch?v=WDIkqP4JbkE, https://vimeo.com/649009599, https://www.youtube.com/watch?v=rX0ItVEVjHc.) In this case, a larger FD struct would decrease the number of FDs that fit in CPU caches, and cause more cache misses. I suspect the performance would be significant because optimizer profiles consistently show that we spend a big chunk (I've seen 10-20% recently) of optimization time performing computations with FDs, though we'd have to measure to know for sure.

DrewKimball · 2022-08-29T20:46:40Z

If we want something a bit more general / widely applicable, it might be interesting to consider keeping additional information about the "origin" of columns (maybe stored on the metadata or something). This would let us make assertions about the domain of the column, since we maintain the invariant that a given column ID has at most been duplicated, filtered and/or null-extended. We already use this sort of information in calculating join multiplicities, for example, and there might be other places it would be useful too.

msirek

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)

pkg/sql/opt/memo/logical_props_builder.go line 2272 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

I agree with @DrewKimball. I'd prefer not to extend FDs for this edge case. They are a fundamental tool used for many different things throughout the optimizer, so I'd prefer that their scope of responsibility remain as narrow and focused as possible.

The set of derived columns is available in the LookupJoinPrivate, so can the statistics builder fetch them directly from there instead of plumbing through an FD?

Storing the info directly in FuncDepSet is an object-oriented approach

Object-oriented patterns are generally discouraged in Go. I think one of the reasons they are discouraged is that they can make it harder to maximize performance of software on modern hardware. (If you're interested, here's some talks that were eye-opening for me about this (they aren't Go-specific): https://www.youtube.com/watch?v=WDIkqP4JbkE, https://vimeo.com/649009599, https://www.youtube.com/watch?v=rX0ItVEVjHc.) In this case, a larger FD struct would decrease the number of FDs that fit in CPU caches, and cause more cache misses. I suspect the performance would be significant because optimizer profiles consistently show that we spend a big chunk (I've seen 10-20% recently) of optimization time performing computations with FDs, though we'd have to measure to know for sure.

OK, I understand. We can defer a general solution until we have more concrete use cases. I've removed the FuncDepSet changes and am handling just selectivityFromEquivalency for nested loop join now.

msirek

Nice idea. For now I'll just fix the nested loop join case to keep it simple.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)

mgartner

Nicely done!

Reviewed 3 of 12 files at r3, 3 of 6 files at r4, 12 of 12 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @msirek)

pkg/sql/opt/memo/statistics_builder.go line 4332 at r5 (raw file):

	var derivedEquivCols opt.ColSet
	if lookupJoinExpr, ok := e.(*LookupJoinExpr); ok {
		if !lookupJoinExpr.DerivedEquivCols.Empty() {

nit: the check for empty isn't necessary, you can just always set derivedEquivCols if its a lookup join.

…timate Fixes cockroachdb#85353 Previously, a lookup join involving the first column of a hash-sharded index and a column from another table would use a derived equality condition between the invisible hash bucket column and an expression on the other table's column for selectivity estimation purposes. Since the derived join condition does not actually reduce the number of qualified rows, the optimizer would end up with an underestimated row count estimate for the join, and end up selecting it, when lower cost join methods existed. To address this, this patch remembers which left table key columns in the lookup join are synthesized for derived equijoin conditions on the lookup table hash bucket column, and ignores them when building the filter functional dependencies which are later used in `selectivityFromEquivalencies()` for calculating the join selectivity. Release justification: Low risk fix for costly lookup joins on hash-sharded indexes. Release note (bug fix): This patch fixes a bug in lookup join selectivity estimation involving hash-sharded indexes which may cause lookup joins to be selected by the optimizer in cases where other join methods are less expensive.

msirek

TFTR!
bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner and @michae2)

pkg/sql/opt/memo/statistics_builder.go line 4332 at r5 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: the check for empty isn't necessary, you can just always set derivedEquivCols if its a lookup join.

Alright, changed it; was just trying to avoid an unnecessary assignment here since derivedEquivCols is already initialized to the empty set.

Code quote:

if !lookupJoinExpr.DerivedEquivCols.Empty() {

craig · 2022-09-05T19:52:57Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2022-09-05T19:53:09Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from ca7ee14 to blathers/backport-release-22.1-86622: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

msirek requested review from mgartner and michae2 August 22, 2022 22:52

msirek requested a review from a team as a code owner August 22, 2022 22:52

mgartner reviewed Aug 23, 2022

View reviewed changes

msirek force-pushed the 85353_2 branch from 1796313 to 2e4741b Compare August 29, 2022 01:48

msirek commented Aug 29, 2022

View reviewed changes

msirek force-pushed the 85353_2 branch 2 times, most recently from 32e42ef to e7d2c9a Compare August 29, 2022 02:53

DrewKimball reviewed Aug 29, 2022

View reviewed changes

msirek commented Aug 29, 2022

View reviewed changes

mgartner reviewed Aug 29, 2022

View reviewed changes

msirek force-pushed the 85353_2 branch from e7d2c9a to 887812d Compare August 29, 2022 21:23

msirek commented Aug 29, 2022

View reviewed changes

msirek force-pushed the 85353_2 branch from 887812d to 4807844 Compare August 29, 2022 21:36

msirek commented Aug 30, 2022

View reviewed changes

mgartner approved these changes Aug 31, 2022

View reviewed changes

msirek added the backport-22.1.x label Sep 5, 2022

msirek force-pushed the 85353_2 branch from 4807844 to ca7ee14 Compare September 5, 2022 19:06

msirek commented Sep 5, 2022

View reviewed changes

craig bot merged commit 818a6b8 into cockroachdb:master Sep 5, 2022

msirek mentioned this pull request Sep 5, 2022

release-22.1: xform: ignore derived hash bucket lookup join cols for selectivity estimate #87390

Merged

mgartner mentioned this pull request Mar 21, 2023

opt: forbid USING HASH in optimizer tests #99129

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xform: ignore derived hash bucket lookup join cols for selectivity estimate #86622

xform: ignore derived hash bucket lookup join cols for selectivity estimate #86622

msirek commented Aug 22, 2022 •

edited

Loading

cockroach-teamcity commented Aug 22, 2022

mgartner left a comment

msirek left a comment

DrewKimball left a comment

msirek left a comment

mgartner left a comment

DrewKimball commented Aug 29, 2022

msirek left a comment

msirek left a comment

mgartner left a comment

msirek left a comment

craig bot commented Sep 5, 2022

blathers-crl bot commented Sep 5, 2022

xform: ignore derived hash bucket lookup join cols for selectivity estimate #86622

xform: ignore derived hash bucket lookup join cols for selectivity estimate #86622

Conversation

msirek commented Aug 22, 2022 • edited Loading

cockroach-teamcity commented Aug 22, 2022

mgartner left a comment

Choose a reason for hiding this comment

msirek left a comment

Choose a reason for hiding this comment

DrewKimball left a comment

Choose a reason for hiding this comment

msirek left a comment

Choose a reason for hiding this comment

mgartner left a comment

Choose a reason for hiding this comment

DrewKimball commented Aug 29, 2022

msirek left a comment

Choose a reason for hiding this comment

msirek left a comment

Choose a reason for hiding this comment

mgartner left a comment

Choose a reason for hiding this comment

msirek left a comment

Choose a reason for hiding this comment

craig bot commented Sep 5, 2022

blathers-crl bot commented Sep 5, 2022

msirek commented Aug 22, 2022 •

edited

Loading