Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: tuple IN (subquery) cannot be converted to lookup semi-join #43198

Closed
RaduBerinde opened this issue Dec 16, 2019 · 14 comments · Fixed by #63780
Closed

opt: tuple IN (subquery) cannot be converted to lookup semi-join #43198

RaduBerinde opened this issue Dec 16, 2019 · 14 comments · Fixed by #63780
Assignees

Comments

@RaduBerinde
Copy link
Member

See example below. Best plan is a lookup join, but we don't get that with the semi-join. The reason is that each side is projecting a tuple, and we are constraining those to be equal. A rule that detects this projection and converts to multiple equalities should fix this.

exec-ddl
CREATE TABLE ab (
  a INT,
  b INT,
  PRIMARY KEY (a,b)
)
----


exec-ddl
CREATE TABLE cd (
  c INT,
  d INT
)
----

exec-ddl
ALTER TABLE cd INJECT STATISTICS '[
    {
        "columns": [ "c" ],
        "created_at": "2019-12-13 16:17:43.033228+00:00",
        "distinct_count": 4,
        "row_count": 4
    }
]'
----

opt
SELECT * FROM ab, cd WHERE a=c AND b=d
----
inner-join (lookup ab)
 ├── columns: a:1(int!null) b:2(int!null) c:3(int!null) d:4(int!null)
 ├── key columns: [3 4] = [1 2]
 ├── lookup columns are key
 ├── fd: (1)==(3), (3)==(1), (2)==(4), (4)==(2)
 ├── scan cd
 │    └── columns: c:3(int) d:4(int)
 └── filters (true)

opt
SELECT * FROM ab WHERE (a, b) IN (SELECT c, d FROM cd)
----
project
 ├── columns: a:1(int!null) b:2(int!null)
 ├── key: (1,2)
 └── semi-join (hash)
      ├── columns: a:1(int!null) b:2(int!null) column7:7(tuple{int, int})
      ├── key: (1,2)
      ├── fd: (1,2)-->(7)
      ├── project
      │    ├── columns: column7:7(tuple{int, int}) a:1(int!null) b:2(int!null)
      │    ├── key: (1,2)
      │    ├── fd: (1,2)-->(7)
      │    ├── scan ab
      │    │    ├── columns: a:1(int!null) b:2(int!null)
      │    │    └── key: (1,2)
      │    └── projections
      │         └── (a, b) [type=tuple{int, int}, outer=(1,2)]
      ├── project
      │    ├── columns: column6:6(tuple{int, int})
      │    ├── scan cd
      │    │    └── columns: c:3(int) d:4(int)
      │    └── projections
      │         └── (c, d) [type=tuple{int, int}, outer=(3,4)]
      └── filters
           └── column7 = column6 [type=bool, outer=(6,7), constraints=(/6: (/NULL - ]; /7: (/NULL - ]), fd=(6)==(7), (7)==(6)]
@techytoes
Copy link
Contributor

Hi @RaduBerinde , I am new to the organization. Can I take this issue?

@RaduBerinde
Copy link
Member Author

Sure. I would start by understanding some existing normalization rules, e.g. in opt/norm/rules/join.opt.

@RoryBlevins
Copy link

@RaduBerinde I'm a little confused about what the ask is here - I thought it was to add an optimisation rule that rewrites IN queries using tuples to pull the tuples up into equalities, so
SELECT * FROM ab WHERE (a, b) IN (SELECT c, d FROM cd)
becomes
SELECT a,b FROM ab, cd WHERE a=c AND b=d
, which would then use a lookup join - however, if I look at the query plan for the rewritten query:

EXPLAIN SELECT a,b FROM ab, cd WHERE a=c AND b=d;
       tree      |       field       |   description
    -----------------+-------------------+------------------
                 | distributed       | true
                 | vectorized        | false
  render         |                   |
   └── hash-join |                   |
        │        | type              | inner
        │        | equality          | (a, b) = (c, d)
        │        | left cols are key |
        ├── scan |                   |
        │        | table             | ab@primary
        │        | spans             | ALL
        └── scan |                   |
                 | table             | cd@primary
                 | spans             | ALL

it seems that the rewritten query would still be using a hash join - have I misunderstood what the rewritten query should look like?

@RaduBerinde
Copy link
Member Author

The query plan depends on the table statistics. If both tables are about the same size, there is no reason to choose a lookup join. You'd have to make the ab table a lot bigger, or disable the automated statistics and inject some statistics. The easiest is to just do it in opt tests, you can paste the stuff above in opt/xform/testdata/external/foo and run the xform tests.

RoryBlevins added a commit to RoryBlevins/cockroach that referenced this issue Mar 3, 2020
…roachdb#43198)

Initial WIP progress commit for sanity check of proposed changes

Release note (<category, see below>): <what> <show> <why>
@RoryBlevins
Copy link

@RaduBerinde Ok, I think I've worked out what's going on - the IN statement is converted to an equality on a projected variable, which we can't currently detect and simply, resulting in the extra projections. I've written a new rule to detect such projections and try to convert the tuple equalities into multiple single-valued equalities, and hoist the projection above the join. I'm not sure whether it's the correct approach, though, as it produces some quite specific (and currently quite messy) code, so I'd appreciate some guidance that I'm on the right track before I go any further - would you be able to look at the above pull request and give me some feedback?

@RoryBlevins
Copy link

Initial pull request was to the main repo, which it's definitely not ready for, so I've closed it - forked pull request is here RoryBlevins#1

RoryBlevins added a commit to RoryBlevins/cockroach that referenced this issue Mar 15, 2020
…roachdb#43198)

Initial WIP progress commit for sanity check of proposed changes

Release note (<category, see below>): <what> <show> <why>
RoryBlevins added a commit to RoryBlevins/cockroach that referenced this issue Mar 15, 2020
@RoryBlevins
Copy link

@RaduBerinde new updated pull request at RoryBlevins#2 , further feedback would be useful, thanks

@RaduBerinde
Copy link
Member Author

This looks great overall, thanks! I think you can post up a PR against the main repo for a full review.

@RoryBlevins
Copy link

Great, thanks for your help

@RoryBlevins
Copy link

@RaduBerinde Before I submit a proper PR, it looks like I have one failing data test - to me that looks like an improvement (at any rate, it seems considerably simpler as it's eliminated two projections), but this is my first time in the world of SQL optimisation, so I'd like some reassurance that it is actually an improvement before I rewrite someone else's test:

    --- FAIL: TestNormRules/scalar (0.01s)
        datadriven.go:165: 
            testdata/rules/scalar:1047: SELECT k FROM a WHERE (k, i) IN (SELECT b, a FROM (VALUES (1, 1), (2, 2), (3, 3)) AS v(a,b))
            expected:
            project
             ├── columns: k:1!null
             ├── key: (1)
             └── semi-join (hash)
                  ├── columns: k:1!null column10:10
                  ├── key: (1)
                  ├── fd: (1)-->(10)
                  ├── project
                  │    ├── columns: column10:10 k:1!null
                  │    ├── key: (1)
                  │    ├── fd: (1)-->(10)
                  │    ├── scan a
                  │    │    ├── columns: k:1!null i:2
                  │    │    ├── key: (1)
                  │    │    └── fd: (1)-->(2)
                  │    └── projections
                  │         └── (k:1, i:2) [as=column10:10, outer=(1,2)]
                  ├── project
                  │    ├── columns: column9:9!null
                  │    ├── cardinality: [3 - 3]
                  │    ├── values
                  │    │    ├── columns: column1:7!null column2:8!null
                  │    │    ├── cardinality: [3 - 3]
                  │    │    ├── (1, 1)
                  │    │    ├── (2, 2)
                  │    │    └── (3, 3)
                  │    └── projections
                  │         └── (column2:8, column1:7) [as=column9:9, outer=(7,8)]
                  └── filters
                       └── column10:10 = column9:9 [outer=(9,10), constraints=(/9: (/NULL - ]; /10: (/NULL - ]), fd=(9)==(10), (10)==(9)]
            
            found:
            project
             ├── columns: k:1!null
             ├── key: (1)
             └── semi-join (hash)
                  ├── columns: k:1!null i:2
                  ├── key: (1)
                  ├── fd: (1)-->(2)
                  ├── scan a
                  │    ├── columns: k:1!null i:2
                  │    ├── key: (1)
                  │    └── fd: (1)-->(2)
                  ├── values
                  │    ├── columns: column1:7!null column2:8!null
                  │    ├── cardinality: [3 - 3]
                  │    ├── (1, 1)
                  │    ├── (2, 2)
                  │    └── (3, 3)
                  └── filters
                       ├── column2:8 = k:1 [outer=(1,8), constraints=(/1: (/NULL - ]; /8: (/NULL - ]), fd=(1)==(8), (8)==(1)]
                       └── column1:7 = i:2 [outer=(2,7), constraints=(/2: (/NULL - ]; /7: (/NULL - ]), fd=(2)==(7), (7)==(2)]

@RaduBerinde
Copy link
Member Author

The plan is definitely an improvement. One other thing to consider is if the test is still testing what it was meant to test. In this case it looks like it's for the InlineAnyValuesMultiCol rule and it does look like it's still testing the same thing (the InlineAnyValuesMultiCol would have fired before our new rule would have gotten a chance to run).

RoryBlevins added a commit to RoryBlevins/cockroach that referenced this issue Mar 17, 2020
This commit adds a new rule to join normalisation that detects joins on tuple equalities between a tuple and a variable that is a projected tuple and splits them out into multiple equalities on the underlying columns, and hoists the project above the join. This allows us to simplify a number of queries where previously the projected tuple was preventing us merging projections.

Resolves: cockroachdb#43198

Release justification: low-risk performance improvement
RoryBlevins added a commit to RoryBlevins/cockroach that referenced this issue Mar 18, 2020
This commit adds a new rule to join normalisation that detects joins on tuple equalities between a tuple and a variable that is a projected tuple and splits them out into multiple equalities on the underlying columns, and hoists the project above the join. This allows us to simplify a number of queries where previously the projected tuple was preventing us merging projections.

Resolves: cockroachdb#43198

Release justification: low-risk performance improvement
@longnguyennn
Copy link

Hi @RaduBerinde @RoryBlevins , it seems like the PR for this was abandoned. Can I take over this issue?

@RoryBlevins
Copy link

Please do - unfortunately I accepted a new job which prevents me from continuing to work on this - I would be happy to see someone finish this.

@RaduBerinde
Copy link
Member Author

@longnguyennn Sorry, but we ran into a case where this transformation would have helped a lot so I am starting to work on it ASAP.

I will first try out @andy-kimball's suggestion from the old PR to try removing the tuples earlier, at the Select level.

ethanadams added a commit to storj/storj that referenced this issue Apr 15, 2021
…e issue.

Performance issue with previous delete statement is related to cockroachdb/cockroach#43198

Change-Id: I4a301526f209ef3fcaeca560d522c02aa28c20f8
ethanadams added a commit to storj/storj that referenced this issue Apr 15, 2021
…ormance issue.

Performance issue with previous delete statement is related to cockroachdb/cockroach#43198

Change-Id: I4a301526f209ef3fcaeca560d522c02aa28c20f8
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Apr 17, 2021
This change adds a rule that handles a case which prevents Exists
subqueries from becoming lookup joins.

Fixes cockroachdb#43198.

Release note (performance improvement): certain queries containing
`<tuple> IN (<subquery>)` conditions may run significantly faster.
craig bot pushed a commit that referenced this issue Apr 19, 2021
63706: flowinfra: remove remnants of RunSyncFlow DistSQL RPC r=yuzefovich a=yuzefovich

We recently removed `RunSyncFlow` RPC call of DistSQL, but a couple of
things were missed. This commit cleans up the outbox based on that
removal, removes `SetupSyncFlow` method as well as adds some comments.

Release note: None

63780: opt: allow IN subquery to be converted to lookup join r=RaduBerinde a=RaduBerinde

#### opt: add opttester facility to test placeholder assignment

We like to assume that the result of building a memo with placeholders
followed by AssignPlaceholders is equivalent to building the query
with the values directly. This is not necessarily the case - it is
possible that some normalization rules act on a higher part of the
tree in a way that would not happen if we had fully normalized a lower
part of the tree.

This commit adds two new opttester directives:
`assign-placeholders-norm` and `assign-placeholders-opt`. These take a
query that has placeholders and simulates the prepared query planning
path.

We use these facilities to add some tests that reproduce a customer
issue.

Release note: None

#### opt: allow IN subquery to be converted to lookup join

This change adds a rule that handles a case which prevents Exists
subqueries from becoming lookup joins.

Fixes #43198.

Release note (performance improvement): certain queries containing
`<tuple> IN (<subquery>)` conditions may run significantly faster.

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue Apr 19, 2021
This change adds a rule that handles a case which prevents Exists
subqueries from becoming lookup joins.

Fixes cockroachdb#43198.

Release note (performance improvement): certain queries containing
`<tuple> IN (<subquery>)` conditions may run significantly faster.
@craig craig bot closed this as completed in e6fd187 Apr 19, 2021
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue May 13, 2021
This change adds a rule that handles a case which prevents Exists
subqueries from becoming lookup joins.

Fixes cockroachdb#43198.

Release note (performance improvement): certain queries containing
`<tuple> IN (<subquery>)` conditions may run significantly faster.
RaduBerinde added a commit to RaduBerinde/cockroach that referenced this issue May 13, 2021
This change adds a rule that handles a case which prevents Exists
subqueries from becoming lookup joins.

Fixes cockroachdb#43198.

Release note (performance improvement): certain queries containing
`<tuple> IN (<subquery>)` conditions may run significantly faster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants