-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xform: ignore derived hash bucket lookup join cols for selectivity estimate #86622
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find!
Reviewed 1 of 6 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @msirek)
pkg/sql/opt/lookupjoin/constraint_builder.go
line 267 at r1 (raw file):
projection := b.f.ConstructProjectionsItem(b.f.RemapCols(expr, b.eqColMap), compEqCol) inputProjections = append(inputProjections, projection) addEqualityColumns(compEqCol, idxCol, compEqCol)
nit: might be simpler to add directly to derivedKeyCols
here
pkg/sql/opt/memo/logical_props_builder.go
line 2272 at r1 (raw file):
h.filterNotNullCols.Add(indexColID) if !join.DerivedKeyCols.Contains(colID) { h.filtersFD.AddEquivalency(colID, indexColID)
I think omitting this equivalency could prevent some optimizations from applying, though it might be unlikely (maybe only when the user is filtering/grouping/ordering by the hash-sharded column).
But I am curious - it looks like there's already some logic in the statistics builder that tries not to double-count selectivity of key columns when they functionally determine each other, see tryReduceJoinCols
. I'm curious why that's not working in this case. The hash-sharded column is functionally determined by the other key columns, so it seems like this mechanism should be handling this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)
pkg/sql/opt/lookupjoin/constraint_builder.go
line 267 at r1 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
nit: might be simpler to add directly to
derivedKeyCols
here
Made the change
Code quote:
addEqualityColumns(compEqCol, idxCol, compEqCol)
pkg/sql/opt/memo/logical_props_builder.go
line 2272 at r1 (raw file):
But I am curious - it looks like there's already some logic in the statistics builder that tries not to double-count selectivity of key columns when they functionally determine each other, see tryReduceJoinCols.
I'm curious why that's not working in this case.
My understanding of tryReduceJoinCols
is that it is trying to figure out the ON clause constraint columns to use for selectivity estimation, where a constraint is considered to be a predicate involving a column and a constant, so it is not directly applicable to this case. This case is more about join selectivity, and in particular selectivity coming from equality join conditions between columns.
But you bring up a good point. Can we use the functional dependencies which are present to determine if the selectivity from one of the join conditions can be ignored? I don't think we can. For example, instead of having computed columns with identical functions, what if the functions differed slightly (or were totally different):
CREATE TABLE t (b INT,
c INT8 NOT NULL AS (mod(fnv32(crdb_internal.datums_to_bytes(b)), 15:::INT8)) STORED);
CREATE TABLE u (
a INT8 NULL,
b INT8 NULL,
c INT8 NOT NULL AS (mod(fnv32(crdb_internal.datums_to_bytes(b)), 16:::INT8)) STORED,
INDEX u_b_idx (c, b ASC)
);
EXPLAIN (OPT,VERBOSE) SELECT * FROM t JOIN u@u_b_idx USING (c,b) WHERE a < 10;
Columns t.c
and u.c
are determined by t.b
and u.b
, like in the hash-sharded case, but we don't have the property t.c
= u.c
is true iff t.b
= u.b
is true. So, application of the predicate t.c
= u.c
would further reduce the number of qualified rows, and the selectivity of this term should not be ignored. We would probably need to store the actual function (expression) used in the functional dependency, and look for identical functions if we were to take a functional dependency approach. This could also handle cases like the above, if the functions (expressions) used were identical. I don't know how important this case is, but one could imagine a customer defining computed columns using the same expression in more than one table, and use join conditions on those columns. This seems like a more generalized solution though. Maybe it deserves a separate issue.
I think omitting this equivalency could prevent some optimizations from applying...
Great point! We don't want to limit our current or future potential to do rewrites by excluding some functional dependencies. I have pushed a new commit which changes the fix to add a new field to FuncDepSet
specifying the equivalency columns to ignore for selectivity estimation purposes. Maybe this will also make it easier to handle cases other than lookup join in the future.
32e42ef
to
e7d2c9a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)
pkg/sql/opt/memo/logical_props_builder.go
line 2272 at r1 (raw file):
Previously, msirek (Mark Sirek) wrote…
But I am curious - it looks like there's already some logic in the statistics builder that tries not to double-count selectivity of key columns when they functionally determine each other, see tryReduceJoinCols.
I'm curious why that's not working in this case.My understanding of
tryReduceJoinCols
is that it is trying to figure out the ON clause constraint columns to use for selectivity estimation, where a constraint is considered to be a predicate involving a column and a constant, so it is not directly applicable to this case. This case is more about join selectivity, and in particular selectivity coming from equality join conditions between columns.But you bring up a good point. Can we use the functional dependencies which are present to determine if the selectivity from one of the join conditions can be ignored? I don't think we can. For example, instead of having computed columns with identical functions, what if the functions differed slightly (or were totally different):
CREATE TABLE t (b INT, c INT8 NOT NULL AS (mod(fnv32(crdb_internal.datums_to_bytes(b)), 15:::INT8)) STORED); CREATE TABLE u ( a INT8 NULL, b INT8 NULL, c INT8 NOT NULL AS (mod(fnv32(crdb_internal.datums_to_bytes(b)), 16:::INT8)) STORED, INDEX u_b_idx (c, b ASC) ); EXPLAIN (OPT,VERBOSE) SELECT * FROM t JOIN u@u_b_idx USING (c,b) WHERE a < 10;
Columns
t.c
andu.c
are determined byt.b
andu.b
, like in the hash-sharded case, but we don't have the propertyt.c
=u.c
is true ifft.b
=u.b
is true. So, application of the predicatet.c
=u.c
would further reduce the number of qualified rows, and the selectivity of this term should not be ignored. We would probably need to store the actual function (expression) used in the functional dependency, and look for identical functions if we were to take a functional dependency approach. This could also handle cases like the above, if the functions (expressions) used were identical. I don't know how important this case is, but one could imagine a customer defining computed columns using the same expression in more than one table, and use join conditions on those columns. This seems like a more generalized solution though. Maybe it deserves a separate issue.I think omitting this equivalency could prevent some optimizations from applying...
Great point! We don't want to limit our current or future potential to do rewrites by excluding some functional dependencies. I have pushed a new commit which changes the fix to add a new field to
FuncDepSet
specifying the equivalency columns to ignore for selectivity estimation purposes. Maybe this will also make it easier to handle cases other than lookup join in the future.
I'm not sure FuncDepSet
is the right place to store this information - knowing which equivalencies are correlated with one another for selectivity estimation seems orthogonal to tracking which equivalencies are valid over the expression. My feeling is that the "right" place to correct this issue would be in selectivityFromEquivalency
, using information stored on either the LookupJoinExpr
itself, or its memo group. The idea would be to immediately return props.OneSelectivity
iff the equivGroup
is known to be redundant, with no extra columns. (If the group has equal columns from a redundant computed filter but also has additional equivalent columns, I think we'd want to leave the calculation unchanged).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)
pkg/sql/opt/memo/logical_props_builder.go
line 2272 at r1 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
I'm not sure
FuncDepSet
is the right place to store this information - knowing which equivalencies are correlated with one another for selectivity estimation seems orthogonal to tracking which equivalencies are valid over the expression. My feeling is that the "right" place to correct this issue would be inselectivityFromEquivalency
, using information stored on either theLookupJoinExpr
itself, or its memo group. The idea would be to immediately returnprops.OneSelectivity
iff theequivGroup
is known to be redundant, with no extra columns. (If the group has equal columns from a redundant computed filter but also has additional equivalent columns, I think we'd want to leave the calculation unchanged).
Storing the info directly in FuncDepSet
is an object-oriented approach to make sure the information about the equivalency is always carried alongside it, and always kept up-to-date, no matter if that equivalency is copied and used elsewhere, or has columns remapped and propagated to a new operation (via FuncDepSet.RemapFrom
, for example, used by set operations like INTERSECT or EXCEPT which may consume a nested loop join). It should just automatically work instead of having to explicitly propagate redundancy information for each operation type in current or future uses of the FuncDepSet
for estimating selectivity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @mgartner, @michae2, and @msirek)
pkg/sql/opt/memo/logical_props_builder.go
line 2272 at r1 (raw file):
I agree with @DrewKimball. I'd prefer not to extend FDs for this edge case. They are a fundamental tool used for many different things throughout the optimizer, so I'd prefer that their scope of responsibility remain as narrow and focused as possible.
The set of derived columns is available in the LookupJoinPrivate, so can the statistics builder fetch them directly from there instead of plumbing through an FD?
Storing the info directly in FuncDepSet is an object-oriented approach
Object-oriented patterns are generally discouraged in Go. I think one of the reasons they are discouraged is that they can make it harder to maximize performance of software on modern hardware. (If you're interested, here's some talks that were eye-opening for me about this (they aren't Go-specific): https://www.youtube.com/watch?v=WDIkqP4JbkE, https://vimeo.com/649009599, https://www.youtube.com/watch?v=rX0ItVEVjHc.) In this case, a larger FD struct would decrease the number of FDs that fit in CPU caches, and cause more cache misses. I suspect the performance would be significant because optimizer profiles consistently show that we spend a big chunk (I've seen 10-20% recently) of optimization time performing computations with FDs, though we'd have to measure to know for sure.
If we want something a bit more general / widely applicable, it might be interesting to consider keeping additional information about the "origin" of columns (maybe stored on the metadata or something). This would let us make assertions about the domain of the column, since we maintain the invariant that a given column ID has at most been duplicated, filtered and/or null-extended. We already use this sort of information in calculating join multiplicities, for example, and there might be other places it would be useful too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)
pkg/sql/opt/memo/logical_props_builder.go
line 2272 at r1 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
I agree with @DrewKimball. I'd prefer not to extend FDs for this edge case. They are a fundamental tool used for many different things throughout the optimizer, so I'd prefer that their scope of responsibility remain as narrow and focused as possible.
The set of derived columns is available in the LookupJoinPrivate, so can the statistics builder fetch them directly from there instead of plumbing through an FD?
Storing the info directly in FuncDepSet is an object-oriented approach
Object-oriented patterns are generally discouraged in Go. I think one of the reasons they are discouraged is that they can make it harder to maximize performance of software on modern hardware. (If you're interested, here's some talks that were eye-opening for me about this (they aren't Go-specific): https://www.youtube.com/watch?v=WDIkqP4JbkE, https://vimeo.com/649009599, https://www.youtube.com/watch?v=rX0ItVEVjHc.) In this case, a larger FD struct would decrease the number of FDs that fit in CPU caches, and cause more cache misses. I suspect the performance would be significant because optimizer profiles consistently show that we spend a big chunk (I've seen 10-20% recently) of optimization time performing computations with FDs, though we'd have to measure to know for sure.
OK, I understand. We can defer a general solution until we have more concrete use cases. I've removed the FuncDepSet
changes and am handling just selectivityFromEquivalency
for nested loop join now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea. For now I'll just fix the nested loop join case to keep it simple.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 12 files at r3, 3 of 6 files at r4, 12 of 12 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @msirek)
pkg/sql/opt/memo/statistics_builder.go
line 4332 at r5 (raw file):
var derivedEquivCols opt.ColSet if lookupJoinExpr, ok := e.(*LookupJoinExpr); ok { if !lookupJoinExpr.DerivedEquivCols.Empty() {
nit: the check for empty isn't necessary, you can just always set derivedEquivCols
if its a lookup join.
…timate Fixes cockroachdb#85353 Previously, a lookup join involving the first column of a hash-sharded index and a column from another table would use a derived equality condition between the invisible hash bucket column and an expression on the other table's column for selectivity estimation purposes. Since the derived join condition does not actually reduce the number of qualified rows, the optimizer would end up with an underestimated row count estimate for the join, and end up selecting it, when lower cost join methods existed. To address this, this patch remembers which left table key columns in the lookup join are synthesized for derived equijoin conditions on the lookup table hash bucket column, and ignores them when building the filter functional dependencies which are later used in `selectivityFromEquivalencies()` for calculating the join selectivity. Release justification: Low risk fix for costly lookup joins on hash-sharded indexes. Release note (bug fix): This patch fixes a bug in lookup join selectivity estimation involving hash-sharded indexes which may cause lookup joins to be selected by the optimizer in cases where other join methods are less expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner and @michae2)
pkg/sql/opt/memo/statistics_builder.go
line 4332 at r5 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
nit: the check for empty isn't necessary, you can just always set
derivedEquivCols
if its a lookup join.
Alright, changed it; was just trying to avoid an unnecessary assignment here since derivedEquivCols
is already initialized to the empty set.
Code quote:
if !lookupJoinExpr.DerivedEquivCols.Empty() {
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from ca7ee14 to blathers/backport-release-22.1-86622: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Fixes #85353
Previously, a lookup join involving the first column of a hash-sharded
index and a column from another table would use a derived equality
condition between the invisible hash bucket column and an expression on
the other table's column for selectivity estimation purposes. Since the
derived join condition does not actually reduce the number of qualified
rows, the optimizer would end up with an underestimated row count
estimate for the join, and end up selecting it, when lower cost join
methods existed.
To address this, this patch remembers which left table key columns in
the lookup join are synthesized for derived equijoin conditions on the
lookup table hash bucket column, and ignores them when building the
filter functional dependencies which are later used in
selectivityFromEquivalencies()
for calculating the join selectivity.Release justification: Low risk fix for costly lookup joins on
hash-sharded indexes.
Release note (bug fix): This patch fixes a bug in lookup join
selectivity estimation involving hash-sharded indexes which may cause
lookup joins to be selected by the optimizer in cases where other join
methods are less expensive.
memo: This commit adds redundantEquivCols to FuncDepSet. This ColSet
tracks the columns in FD set equivalences which should not be used
for selectivity estimation.
Release note: none