-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opt: create inverted or lookup joins for left/anti joins in more cases #59649
Comments
I'm not sure this will work because
In a case like below, I think we'd incorrectly de-duplicate the
But it might work if we ensured there was a strict key in the input? In that case, would a distinct-on above the lookup join with all columns as grouping column be equivalent? |
Yea, I agree that we have to ensure that a strict key exists on the input. I think we have functions that do this by either adding the primary key back to a scan or wrapping everything in an ordinality operator.
Do we need the distinct-on if we use a EDIT: Oh I see -- yea, your suggestion might be better than a union. |
Something still feels fishy about a distinct-on or a UNION to me. A left-join can produce multiple rows from the left with the same PK, so a distinct-on or UNION don't seem valid. For example:
The presence of the PK |
Hmm good point. Another issue is that there can be some null-extended rows that appear for rows that do have a match. Here's an example with the buggy code from before the fix merged:
Notice the null-extended row starting with 1 in addition to the null-extended rows starting with 2000. We have to make sure that the null-extended row starting with 1 doesn't appear at all. |
Here's something that I think would work: For anti joins, we would use the existing cross-join approach and simply wrap it with a group by + having clause. Using the example from above with 3 possible values in the check constraint:
The following two queries are equivalent:
We would need to ensure that the input has a strict key and the count in the having clause is equal to the number of values in the right side of the cross join. For a left join, we could simply take the output of the anti join, null-extend it, and It's not ideal, but I haven't been able to think of any other correct solution yet that doesn't require execution changes. I also haven't looked at how hard it would be to make the execution changes -- maybe it would be better to just bite the bullet and do that. |
Probably a good idea to get @yuzefovich's opinion on this. @yuzefovich, how hard do you think it would be to support looking up several spans in the lookup joiner, instead of just a single equality span? For example, if the index has columns |
I think @asubiotto knows the joinReader code better, but I'd guess it won't be that hard of a change. |
Mechanically I don't think it'll be that difficult (looks like we can just put the |
Ah ok perfect. What probably makes sense is for the optimizer to provide the new implementation for the |
This commit adds support for lookup joins that scan multiple spans in the index per input row. This is needed for cases when a prefix of the index columns are constrained to multiple constant values, such as the values in a CHECK constraint or an ENUM type. This is implemented by adding a new field to the LookupJoinPrivate in the optimizer (and the lookupJoiner in the execution engine). The new field is called LookupExpr, and similar to KeyCols in the LookupJoinPrivate (lookupCols in the lookupJoiner), LookupExpr represents the part of the join condition used to perform the lookup into the index. It should only be set when KeyCols is empty. LookupExpr is used instead of KeyCols when the lookup condition is more complicated than a simple equality between input columns and index columns. In this case, LookupExpr specifies the filter conditions that will be used to construct the spans for each lookup. Currently, the only filter conditions that are allowed are equality and IN expressions, representing constant filters and equi-join conditions. The advantage of using an expression, however, is that it is easy to extend in the future if we want to support other expressions (e.g. like OR to support a disjunction on input columns). Informs cockroachdb#59649 Release note (performance improvement): Added support for left and anti lookup joins in cases where the equi-join condition is not on the first column in the index, but the first column(s) are constrained to a small number of constant values using a CHECK constraint or an ENUM type. Planning a lookup join for these cases can significantly improve performance if the left input to the join is much smaller than the right-hand side.
This commit adds support for lookup joins that scan multiple spans in the index per input row. This is needed for cases when a prefix of the index columns are constrained to multiple constant values, such as the values in a CHECK constraint or an ENUM type. This is implemented by adding a new field to the LookupJoinPrivate in the optimizer (and the lookupJoiner in the execution engine). The new field is called LookupExpr, and similar to KeyCols in the LookupJoinPrivate (lookupCols in the lookupJoiner), LookupExpr represents the part of the join condition used to perform the lookup into the index. It should only be set when KeyCols is empty. LookupExpr is used instead of KeyCols when the lookup condition is more complicated than a simple equality between input columns and index columns. In this case, LookupExpr specifies the filter conditions that will be used to construct the spans for each lookup. Currently, the only filter conditions that are allowed are equality and IN expressions, representing constant filters and equi-join conditions. The advantage of using an expression, however, is that it is easy to extend in the future if we want to support other expressions (e.g. like OR to support a disjunction on input columns). Informs cockroachdb#59649 Release note (performance improvement): Added support for left and anti lookup joins in cases where the equi-join condition is not on the first column in the index, but the first column(s) are constrained to a small number of constant values using a CHECK constraint or an ENUM type. Planning a lookup join for these cases can significantly improve performance if the left input to the join is much smaller than the right-hand side.
Reading this issue and #59615, it seems very similar to the problem that paired joins is solving in other contexts: the rows generated by the join that applies the check constraint are grouped corresponding to the original left row. In the case that nothing in the group matches in the lookup join, only one row should be output (for left outer join and left semi join). It would need the cross join to output groups with the continuation column, which seems straightforward. And this should also distribute cleanly since each group will be produced in one place. |
I did briefly consider using the paired joins machinery for this, and I think we could probably make it work, but I've already opened a PR for looking up multiple spans in lookup joins. Feel free to take a look: #60302. The advantage of the multi-span approach is that we should be able to extend it to support other use cases described in #51576. |
This commit adds support for lookup joins that scan multiple spans in the index per input row. This is needed for cases when a prefix of the index columns are constrained to multiple constant values, such as the values in a CHECK constraint or an ENUM type. This is implemented by adding a new field to the LookupJoinPrivate in the optimizer (and the lookupJoiner in the execution engine). The new field is called LookupExpr, and similar to KeyCols in the LookupJoinPrivate (lookupCols in the lookupJoiner), LookupExpr represents the part of the join condition used to perform the lookup into the index. It should only be set when KeyCols is empty. LookupExpr is used instead of KeyCols when the lookup condition is more complicated than a simple equality between input columns and index columns. In this case, LookupExpr specifies the filter conditions that will be used to construct the spans for each lookup. Currently, the only filter conditions that are allowed are equality and IN expressions, representing constant filters and equi-join conditions. The advantage of using an expression, however, is that it is easy to extend in the future if we want to support other expressions (e.g. like OR to support a disjunction on input columns). Informs cockroachdb#59649 Release note (performance improvement): Added support for left and anti lookup joins in cases where the equi-join condition is not on the first column in the index, but the first column(s) are constrained to a small number of constant values using a CHECK constraint or an ENUM type. Planning a lookup join for these cases can significantly improve performance if the left input to the join is much smaller than the right-hand side.
This commit adds support for lookup joins that scan multiple spans in the index per input row. This is needed for cases when a prefix of the index columns are constrained to multiple constant values, such as the values in a CHECK constraint or an ENUM type. This is implemented by adding a new field to the LookupJoinPrivate in the optimizer (and the lookupJoiner in the execution engine). The new field is called LookupExpr, and similar to KeyCols in the LookupJoinPrivate (lookupCols in the lookupJoiner), LookupExpr represents the part of the join condition used to perform the lookup into the index. It should only be set when KeyCols is empty. LookupExpr is used instead of KeyCols when the lookup condition is more complicated than a simple equality between input columns and index columns. In this case, LookupExpr specifies the filter conditions that will be used to construct the spans for each lookup. Currently, the only filter conditions that are allowed are equality and IN expressions, representing constant filters and equi-join conditions. The advantage of using an expression, however, is that it is easy to extend in the future if we want to support other expressions (e.g. like OR to support a disjunction on input columns). Informs cockroachdb#59649 Release note (performance improvement): Added support for left and anti lookup joins in cases where the equi-join condition is not on the first column in the index, but the first column(s) are constrained to a small number of constant values using a CHECK constraint or an ENUM type. Planning a lookup join for these cases can significantly improve performance if the left input to the join is much smaller than the right-hand side.
This commit adds support for lookup joins that scan multiple spans in the index per input row. This is needed for cases when a prefix of the index columns are constrained to multiple constant values, such as the values in a CHECK constraint or an ENUM type. This is implemented by adding a new field to the LookupJoinPrivate in the optimizer (and the lookupJoiner in the execution engine). The new field is called LookupExpr, and similar to KeyCols in the LookupJoinPrivate (lookupCols in the lookupJoiner), LookupExpr represents the part of the join condition used to perform the lookup into the index. It should only be set when KeyCols is empty. LookupExpr is used instead of KeyCols when the lookup condition is more complicated than a simple equality between input columns and index columns. In this case, LookupExpr specifies the filter conditions that will be used to construct the spans for each lookup. Currently, the only filter conditions that are allowed are equality and IN expressions, representing constant filters and equi-join conditions. The advantage of using an expression, however, is that it is easy to extend in the future if we want to support other expressions (e.g. like OR to support a disjunction on input columns). Informs cockroachdb#59649 Release note (performance improvement): Added support for left and anti lookup joins in cases where the equi-join condition is not on the first column in the index, but the first column(s) are constrained to a small number of constant values using a CHECK constraint or an ENUM type. Planning a lookup join for these cases can significantly improve performance if the left input to the join is much smaller than the right-hand side.
This commit adds support for lookup joins that scan multiple spans in the index per input row. This is needed for cases when a prefix of the index columns are constrained to multiple constant values, such as the values in a CHECK constraint or an ENUM type. This is implemented by adding a new field to the LookupJoinPrivate in the optimizer (and the lookupJoiner in the execution engine). The new field is called LookupExpr, and similar to KeyCols in the LookupJoinPrivate (lookupCols in the lookupJoiner), LookupExpr represents the part of the join condition used to perform the lookup into the index. It should only be set when KeyCols is empty. LookupExpr is used instead of KeyCols when the lookup condition is more complicated than a simple equality between input columns and index columns. In this case, LookupExpr specifies the filter conditions that will be used to construct the spans for each lookup. Currently, the only filter conditions that are allowed are equality and IN expressions, representing constant filters and equi-join conditions. The advantage of using an expression, however, is that it is easy to extend in the future if we want to support other expressions (e.g. like OR to support a disjunction on input columns). Informs cockroachdb#59649 Release note (performance improvement): Added support for left and anti lookup joins in cases where the equi-join condition is not on the first column in the index, but the first column(s) are constrained to a small number of constant values using a CHECK constraint or an ENUM type. Planning a lookup join for these cases can significantly improve performance if the left input to the join is much smaller than the right-hand side.
60302: sql,opt: add support for lookup joins with multiple spans r=rytaft a=rytaft This commit adds support for lookup joins that scan multiple spans in the index per input row. This is needed for cases when a prefix of the index columns are constrained to multiple constant values, such as the values in a `CHECK` constraint or an `ENUM` type. This is implemented by adding a new field to the `LookupJoinPrivate` in the optimizer (and the `lookupJoiner` in the execution engine). The new field is called `LookupExpr`, and similar to `KeyCols` in the `LookupJoinPrivate` (`lookupCols` in the `lookupJoiner`), `LookupExpr` represents the part of the join condition used to perform the lookup into the index. It should only be set when `KeyCols` is empty. `LookupExpr` is used instead of `KeyCols` when the lookup condition is more complicated than a simple equality between input columns and index columns. In this case, `LookupExpr` specifies the filter conditions that will be used to construct the spans for each lookup. Currently, the only filter conditions that are allowed are equality and `IN` expressions, representing constant filters and equi-join conditions. The advantage of using an expression, however, is that it is easy to extend in the future if we want to support other expressions (e.g. like `OR` to support a disjunction on input columns). Informs #59649 Release note (performance improvement): Added support for left and anti lookup joins in cases where the equi-join condition is not on the first column in the index, but the first column(s) are constrained to a small number of constant values using a CHECK constraint or an ENUM type. Planning a lookup join for these cases can significantly improve performance if the left input to the join is much smaller than the right-hand side. 60600: sql: prohibit DROP COLUMN on REGIONAL BY ROW column reference r=ajstorm a=otan The foundation of REGIONAL BY ROW is to use a column which dictates the location to which the row is stored. As such, we should prevent the column from being used as a DROP COLUMN reference. Resolves #59117 Release note: None Co-authored-by: Rebecca Taft <[email protected]> Co-authored-by: Oliver Tan <[email protected]>
#60302 will solve this issue for lookup joins, however we still need to solve this issue for inverted joins. I'm guessing the solution will look similar, but it will still require a non-trivial amount of work to update both the optimizer and execution sides. Therefore this is very unlikely to make it into the 21.1 release. I'm adding the docs-known-limitation tag (cc @ianjevans, @taroface) in order to document this limitation: any left or anti join involving JSON, Array, or spatial columns with a multi-column or partitioned inverted index will not be able to take advantage of the index if the prefix columns are unconstrained or constrained to multiple constant values. Instead, we will need to plan a cross join with the filter applied after the fact, which may be significantly slower than an inverted join. Possible workarounds: ensure that the prefix columns are constrained to single constant values or equated with input columns. For example, consider the following database:
If we try to perform a left join between these two tables only on the geometry columns, we will not be able to plan an inverted join due to the limitation described above:
However, if we can constrain the crdb_region column to a single value, we can plan an inverted join:
If the user doesn't know which region to use, the workaround is to construct a UNION ALL as follows:
|
We have marked this issue as stale because it has been inactive for |
As described in #59615, we can no longer plan left/anti lookup/inverted joins for certain types of queries where the prefix columns in the index are constrained to multiple constant values, since doing so could result in too many rows in the output. However, the alternative of requiring a hash or merge join is suboptimal for performance, especially for mutation queries that rely on using left/anti joins for
UPSERT
/INSERT ... ON CONFLICT
execution and foreign key checks.The most efficient way to support these joins safely would require execution support to allow performing each lookup with multiple spans instead of a single equality condition. However, as a stopgap, we could support them in the optimizer by constructing a
UNION
of several lookup joins. We may also be able to use the infrastructure of the paired joins.As described in #59615, this has implications for multi-region support, because these types of queries will show up often due to the
ENUM
typecrdb_region
column.cc @mgartner
Jira issue: CRDB-3262
The text was updated successfully, but these errors were encountered: