Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: hoist uncorrelated equality subqueries #100881

Merged
merged 1 commit into from
Apr 17, 2023

Conversation

mgartner
Copy link
Collaborator

@mgartner mgartner commented Apr 6, 2023

opt: hoist uncorrelated equality subqueries

Subqueries that are in equality expressions with a variable are now
hoisted. When these expressions exist in a filter, hoisting the subquery
can allow the main query to plan a lookup join, rather than an
inefficient full-table scan.

For example, consider the table and query:

CREATE TABLE t (
  a INT,
  INDEX (a)
);

SELECT * FROM t WHERE a = (SELECT max(a) FROM t);

Prior to this commit, the query plan for this query required a full
table scan:

select
 ├── columns: a:1
 ├── scan t@t_a_idx
 │    ├── columns: a:1
 │    └── constraint: /1/2: (/NULL - ]
 └── filters
      └── eq
           ├── a:1
           └── subquery
                └── scalar-group-by
                     ├── columns: max:9
                     ├── scan t@t_a_idx,rev
                     │    ├── columns: a:5
                     │    ├── constraint: /5/6: (/NULL - ]
                     │    └── limit: 1(rev)
                     └── aggregations
                          └── const-agg [as=max:9, outer=(5)]
                               └── a:5

By hoisting the subquery, the full table scan is replaced with a lookup
join:

project
 ├── columns: a:1
 └── inner-join (lookup t@t_a_idx)
      ├── columns: a:1 max:9
      ├── key columns: [9] = [1]
      ├── scalar-group-by
      │    ├── columns: max:9
      │    ├── scan t@t_a_idx,rev
      │    │    ├── columns: a:5
      │    │    ├── constraint: /5/6: (/NULL - ]
      │    │    └── limit: 1(rev)
      │    └── aggregations
      │         └── const-agg [as=max:9, outer=(5)]
      │              └── a:5
      └── filters (true)

This hoisting is enabled by default, but can be disabled by setting the
optimizer_hoist_uncorrelated_equality_subqueries session setting to
false.

Fixes #83392
Informs #51820
Informs #93829
Informs #100855

Release note (performance improvement): Queries that have subqueries in
equality expressions are now more efficiently planned by the optimizer.

@mgartner mgartner added backport-22.2.x backport-23.1.x Flags PRs that need to be backported to 23.1 labels Apr 6, 2023
@mgartner mgartner requested review from rytaft and DrewKimball April 6, 2023 22:01
@mgartner mgartner requested a review from a team as a code owner April 6, 2023 22:01
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@mgartner
Copy link
Collaborator Author

mgartner commented Apr 6, 2023

In the backports for this I'll default this session setting to false.

Here's some analysis of the performance improvements for a simple query that benefits from this change: https://gist.github.com/mgartner/9aa764ecbdb6d879d71863eb6b32b4c2

@mgartner
Copy link
Collaborator Author

mgartner commented Apr 7, 2023

Hold off on reviewing for now - there's some interesting test failures this has caused.

@mgartner mgartner force-pushed the 83392-hoist-equality-subquery branch from c93a876 to a82051e Compare April 7, 2023 14:30
@mgartner
Copy link
Collaborator Author

mgartner commented Apr 7, 2023

This is currently blocked on #100915.

@mgartner mgartner force-pushed the 83392-hoist-equality-subquery branch from a82051e to abab506 Compare April 7, 2023 16:37
@mgartner
Copy link
Collaborator Author

mgartner commented Apr 7, 2023

The first three commits are from #100926 which should have fixed the broken builtin_function logic test, but it's still failing. No need to review just yet.

@mgartner mgartner mentioned this pull request Apr 7, 2023
@mgartner mgartner force-pushed the 83392-hoist-equality-subquery branch from abab506 to 209a9e0 Compare April 12, 2023 15:33
@mgartner
Copy link
Collaborator Author

The first three commits are from #101324 and #101327.

Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Great idea! Do you have benchmark results for tpch Q15?

Reviewed 3 of 3 files at r1, 1 of 1 files at r2, 3 of 3 files at r3, 19 of 19 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rytaft)

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: nice!

Do you have benchmark results for tpch Q15?

+1

Reviewed 3 of 3 files at r1, 1 of 1 files at r2, 3 of 3 files at r3, 19 of 19 files at r4, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @mgartner)

@mgartner mgartner force-pushed the 83392-hoist-equality-subquery branch from 209a9e0 to 0d0ea8c Compare April 17, 2023 14:21
Subqueries that are in equality expressions with a variable are now
hoisted. When these expressions exist in a filter, hoisting the subquery
can allow the main query to plan a lookup join, rather than an
inefficient full-table scan.

For example, consider the table and query:

    CREATE TABLE t (
      a INT,
      INDEX (a)
    );

    SELECT * FROM t WHERE a = (SELECT max(a) FROM t);

Prior to this commit, the query plan for this query required a full
table scan:

    select
     ├── columns: a:1
     ├── scan t@t_a_idx
     │    ├── columns: a:1
     │    └── constraint: /1/2: (/NULL - ]
     └── filters
          └── eq
               ├── a:1
               └── subquery
                    └── scalar-group-by
                         ├── columns: max:9
                         ├── scan t@t_a_idx,rev
                         │    ├── columns: a:5
                         │    ├── constraint: /5/6: (/NULL - ]
                         │    └── limit: 1(rev)
                         └── aggregations
                              └── const-agg [as=max:9, outer=(5)]
                                   └── a:5

By hoisting the subquery, the full table scan is replaced with a lookup
join:

    project
     ├── columns: a:1
     └── inner-join (lookup t@t_a_idx)
          ├── columns: a:1 max:9
          ├── key columns: [9] = [1]
          ├── scalar-group-by
          │    ├── columns: max:9
          │    ├── scan t@t_a_idx,rev
          │    │    ├── columns: a:5
          │    │    ├── constraint: /5/6: (/NULL - ]
          │    │    └── limit: 1(rev)
          │    └── aggregations
          │         └── const-agg [as=max:9, outer=(5)]
          │              └── a:5
          └── filters (true)

This hoisting is enabled by default, but can be disabled by setting the
`optimizer_hoist_uncorrelated_equality_subqueries` session setting to
`false`.

Fixes cockroachdb#83392
Informs cockroachdb#51820
Informs cockroachdb#93829
Informs cockroachdb#100855

Release note (performance improvement): Queries that have subqueries in
equality expressions are now more efficiently planned by the optimizer.
@mgartner mgartner force-pushed the 83392-hoist-equality-subquery branch from 0d0ea8c to 41fc214 Compare April 17, 2023 14:21
Copy link
Collaborator Author

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have benchmark results for tpch Q15?

How do I run the TPC-H benchmark?

I ran q15 manually in a demo on my GCE worker, and its ~5% slower than before this commit. Before it took ~6.3s and not it takes ~6.6s. It's not super obvious from the statement bundles why it's slower - it seems like it should be faster based on the query plan. And most of the time in both cases is spent on index-joins - so q15 would be best optimized with a STORING index to eliminate them. That'd have a significantly greater impact than this 5% regression. So I think this is fine. What do you all think?

Master stmt-bundle: https://gist.github.com/mgartner/3ce71bb0d8861011820e05b11bf5391e
Hoist subquery stmt-bundle: https://gist.github.com/mgartner/f9170879a8bc1a24a82f44979c3be4cd

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @DrewKimball and @rytaft)

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running queries manually is how I normally check for TPC-H regressions. How many times did you run the query? 5% seems small enough that it could just be noise. In any case, I'm not too concerned.

Regarding a STORING index -- I'm not sure what's allowed according to the spec, but if it's legal, I'm all for it (in a separate PR, obviously).

Reviewed 8 of 8 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @mgartner)

@mgartner
Copy link
Collaborator Author

I don't think it was noise because it was consistently ~5% slower when run multiple times in multiple demos. I'm not too concerned about the 5% either.

@mgartner
Copy link
Collaborator Author

I'm going to go ahead and merge. TFTRs!

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 17, 2023

Build succeeded:

@craig craig bot merged commit cd86f41 into cockroachdb:master Apr 17, 2023
@blathers-crl
Copy link

blathers-crl bot commented Apr 17, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 41fc214 to blathers/backport-release-22.2-100881: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


error creating merge commit from 41fc214 to blathers/backport-release-23.1.0-100881: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.0 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@mgartner mgartner deleted the 83392-hoist-equality-subquery branch April 18, 2023 14:07
mgartner added a commit to mgartner/cockroach that referenced this pull request Nov 28, 2023
Since cockroachdb#100881, the optimizer has hoisted uncorrelated subqueries used in
an equality expression (only when the
`optimizer_hoist_uncorrelated_equality_subqueries` session setting is
enabled). This can cause problems when the hoisted subquery has been
duplicated in the expression tree, e.g., when pushing a filter into both
sides of a join.

A subquery is a scalar expression, so the columns of its child
expression are never emitted from the subquery. This makes it safe
duplicate a subquery in an expression tree. However, when a subquery is
hoisted, it is transformed into a join which can produce the columns of
the child expression. Hoisting the same subquery multiple times can
produce query plans with duplicate column IDs in two logically different
expressions. This can lead to incorrect query plans (see the comment for
`opt.Metadata`), as well as produce expressions with children that have
intersecting column IDs (after additional normalization rules fire).

To avoid these dangers, this commit ensures that each unique subquery is
hoisted at most once. This will prevent bad plans, but it may not
inhibit the optimizer from finding optimal plans. In the future, it may
be possible to lift this restriction by generating new column IDs for
uncorrelated subqueries each time they are hoisted.

Fixes cockroachdb#114703

There is no release note because the session setting enabling this bug
is disabled by default, and because the possible correctness bug is
theoretical - we have not found a reproduction of a correctness bug, but
it could exist in theory.

Release note: None
craig bot pushed a commit that referenced this pull request Nov 29, 2023
115142: opt: hoist uncorrelated subqueries at most once r=mgartner a=mgartner

Since #100881, the optimizer has hoisted uncorrelated subqueries used in
an equality expression (only when the
`optimizer_hoist_uncorrelated_equality_subqueries` session setting is
enabled). This can cause problems when the hoisted subquery has been
duplicated in the expression tree, e.g., when pushing a filter into both
sides of a join.

A subquery is a scalar expression, so the columns of its child
expression are never emitted from the subquery. This makes it safe
duplicate a subquery in an expression tree. However, when a subquery is
hoisted, it is transformed into a join which can produce the columns of
the child expression. Hoisting the same subquery multiple times can
produce query plans with duplicate column IDs in two logically different
expressions. This can lead to incorrect query plans (see the comment for
`opt.Metadata`), as well as produce expressions with children that have
intersecting column IDs (after additional normalization rules fire).

To avoid these dangers, this commit ensures that each unique subquery is
hoisted at most once. This will prevent bad plans, but it may not
inhibit the optimizer from finding optimal plans. In the future, it may
be possible to lift this restriction by generating new column IDs for
uncorrelated subqueries each time they are hoisted.

Fixes #114703

There is no release note because the session setting enabling this bug
is disabled by default, and because the possible correctness bug is
theoretical - we have not found a reproduction of a correctness bug, but
it could exist in theory.

Release note: None


Co-authored-by: Marcus Gartner <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request Nov 29, 2023
Since #100881, the optimizer has hoisted uncorrelated subqueries used in
an equality expression (only when the
`optimizer_hoist_uncorrelated_equality_subqueries` session setting is
enabled). This can cause problems when the hoisted subquery has been
duplicated in the expression tree, e.g., when pushing a filter into both
sides of a join.

A subquery is a scalar expression, so the columns of its child
expression are never emitted from the subquery. This makes it safe
duplicate a subquery in an expression tree. However, when a subquery is
hoisted, it is transformed into a join which can produce the columns of
the child expression. Hoisting the same subquery multiple times can
produce query plans with duplicate column IDs in two logically different
expressions. This can lead to incorrect query plans (see the comment for
`opt.Metadata`), as well as produce expressions with children that have
intersecting column IDs (after additional normalization rules fire).

To avoid these dangers, this commit ensures that each unique subquery is
hoisted at most once. This will prevent bad plans, but it may not
inhibit the optimizer from finding optimal plans. In the future, it may
be possible to lift this restriction by generating new column IDs for
uncorrelated subqueries each time they are hoisted.

Fixes #114703

There is no release note because the session setting enabling this bug
is disabled by default, and because the possible correctness bug is
theoretical - we have not found a reproduction of a correctness bug, but
it could exist in theory.

Release note: None
blathers-crl bot pushed a commit that referenced this pull request Nov 29, 2023
Since #100881, the optimizer has hoisted uncorrelated subqueries used in
an equality expression (only when the
`optimizer_hoist_uncorrelated_equality_subqueries` session setting is
enabled). This can cause problems when the hoisted subquery has been
duplicated in the expression tree, e.g., when pushing a filter into both
sides of a join.

A subquery is a scalar expression, so the columns of its child
expression are never emitted from the subquery. This makes it safe
duplicate a subquery in an expression tree. However, when a subquery is
hoisted, it is transformed into a join which can produce the columns of
the child expression. Hoisting the same subquery multiple times can
produce query plans with duplicate column IDs in two logically different
expressions. This can lead to incorrect query plans (see the comment for
`opt.Metadata`), as well as produce expressions with children that have
intersecting column IDs (after additional normalization rules fire).

To avoid these dangers, this commit ensures that each unique subquery is
hoisted at most once. This will prevent bad plans, but it may not
inhibit the optimizer from finding optimal plans. In the future, it may
be possible to lift this restriction by generating new column IDs for
uncorrelated subqueries each time they are hoisted.

Fixes #114703

There is no release note because the session setting enabling this bug
is disabled by default, and because the possible correctness bug is
theoretical - we have not found a reproduction of a correctness bug, but
it could exist in theory.

Release note: None
mgartner added a commit that referenced this pull request Nov 30, 2023
Since #100881, the optimizer has hoisted uncorrelated subqueries used in
an equality expression (only when the
`optimizer_hoist_uncorrelated_equality_subqueries` session setting is
enabled). This can cause problems when the hoisted subquery has been
duplicated in the expression tree, e.g., when pushing a filter into both
sides of a join.

A subquery is a scalar expression, so the columns of its child
expression are never emitted from the subquery. This makes it safe
duplicate a subquery in an expression tree. However, when a subquery is
hoisted, it is transformed into a join which can produce the columns of
the child expression. Hoisting the same subquery multiple times can
produce query plans with duplicate column IDs in two logically different
expressions. This can lead to incorrect query plans (see the comment for
`opt.Metadata`), as well as produce expressions with children that have
intersecting column IDs (after additional normalization rules fire).

To avoid these dangers, this commit ensures that each unique subquery is
hoisted at most once. This will prevent bad plans, but it may not
inhibit the optimizer from finding optimal plans. In the future, it may
be possible to lift this restriction by generating new column IDs for
uncorrelated subqueries each time they are hoisted.

Fixes #114703

There is no release note because the session setting enabling this bug
is disabled by default, and because the possible correctness bug is
theoretical - we have not found a reproduction of a correctness bug, but
it could exist in theory.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

opt: hoist uncorrelated scalar subquery with equality
4 participants