opt: tuple IN (subquery) cannot be converted to lookup semi-join #43198

RaduBerinde · 2019-12-16T18:05:54Z

See example below. Best plan is a lookup join, but we don't get that with the semi-join. The reason is that each side is projecting a tuple, and we are constraining those to be equal. A rule that detects this projection and converts to multiple equalities should fix this.

exec-ddl
CREATE TABLE ab (
  a INT,
  b INT,
  PRIMARY KEY (a,b)
)
----


exec-ddl
CREATE TABLE cd (
  c INT,
  d INT
)
----

exec-ddl
ALTER TABLE cd INJECT STATISTICS '[
    {
        "columns": [ "c" ],
        "created_at": "2019-12-13 16:17:43.033228+00:00",
        "distinct_count": 4,
        "row_count": 4
    }
]'
----

opt
SELECT * FROM ab, cd WHERE a=c AND b=d
----
inner-join (lookup ab)
 ├── columns: a:1(int!null) b:2(int!null) c:3(int!null) d:4(int!null)
 ├── key columns: [3 4] = [1 2]
 ├── lookup columns are key
 ├── fd: (1)==(3), (3)==(1), (2)==(4), (4)==(2)
 ├── scan cd
 │    └── columns: c:3(int) d:4(int)
 └── filters (true)

opt
SELECT * FROM ab WHERE (a, b) IN (SELECT c, d FROM cd)
----
project
 ├── columns: a:1(int!null) b:2(int!null)
 ├── key: (1,2)
 └── semi-join (hash)
      ├── columns: a:1(int!null) b:2(int!null) column7:7(tuple{int, int})
      ├── key: (1,2)
      ├── fd: (1,2)-->(7)
      ├── project
      │    ├── columns: column7:7(tuple{int, int}) a:1(int!null) b:2(int!null)
      │    ├── key: (1,2)
      │    ├── fd: (1,2)-->(7)
      │    ├── scan ab
      │    │    ├── columns: a:1(int!null) b:2(int!null)
      │    │    └── key: (1,2)
      │    └── projections
      │         └── (a, b) [type=tuple{int, int}, outer=(1,2)]
      ├── project
      │    ├── columns: column6:6(tuple{int, int})
      │    ├── scan cd
      │    │    └── columns: c:3(int) d:4(int)
      │    └── projections
      │         └── (c, d) [type=tuple{int, int}, outer=(3,4)]
      └── filters
           └── column7 = column6 [type=bool, outer=(6,7), constraints=(/6: (/NULL - ]; /7: (/NULL - ]), fd=(6)==(7), (7)==(6)]

The text was updated successfully, but these errors were encountered:

techytoes · 2020-01-07T07:10:02Z

Hi @RaduBerinde , I am new to the organization. Can I take this issue?

RaduBerinde · 2020-01-07T07:20:24Z

Sure. I would start by understanding some existing normalization rules, e.g. in opt/norm/rules/join.opt.

RoryBlevins · 2020-02-17T13:37:19Z

@RaduBerinde I'm a little confused about what the ask is here - I thought it was to add an optimisation rule that rewrites IN queries using tuples to pull the tuples up into equalities, so
SELECT * FROM ab WHERE (a, b) IN (SELECT c, d FROM cd)
becomes
SELECT a,b FROM ab, cd WHERE a=c AND b=d
, which would then use a lookup join - however, if I look at the query plan for the rewritten query:

EXPLAIN SELECT a,b FROM ab, cd WHERE a=c AND b=d;
       tree      |       field       |   description
    -----------------+-------------------+------------------
                 | distributed       | true
                 | vectorized        | false
  render         |                   |
   └── hash-join |                   |
        │        | type              | inner
        │        | equality          | (a, b) = (c, d)
        │        | left cols are key |
        ├── scan |                   |
        │        | table             | ab@primary
        │        | spans             | ALL
        └── scan |                   |
                 | table             | cd@primary
                 | spans             | ALL

it seems that the rewritten query would still be using a hash join - have I misunderstood what the rewritten query should look like?

RaduBerinde · 2020-02-17T16:09:20Z

The query plan depends on the table statistics. If both tables are about the same size, there is no reason to choose a lookup join. You'd have to make the ab table a lot bigger, or disable the automated statistics and inject some statistics. The easiest is to just do it in opt tests, you can paste the stuff above in opt/xform/testdata/external/foo and run the xform tests.

…roachdb#43198) Initial WIP progress commit for sanity check of proposed changes Release note (<category, see below>): <what> <show> <why>

RoryBlevins · 2020-03-03T16:56:04Z

@RaduBerinde Ok, I think I've worked out what's going on - the IN statement is converted to an equality on a projected variable, which we can't currently detect and simply, resulting in the extra projections. I've written a new rule to detect such projections and try to convert the tuple equalities into multiple single-valued equalities, and hoist the projection above the join. I'm not sure whether it's the correct approach, though, as it produces some quite specific (and currently quite messy) code, so I'd appreciate some guidance that I'm on the right track before I go any further - would you be able to look at the above pull request and give me some feedback?

RoryBlevins · 2020-03-03T17:08:25Z

Initial pull request was to the main repo, which it's definitely not ready for, so I've closed it - forked pull request is here RoryBlevins#1

…roachdb#43198) Initial WIP progress commit for sanity check of proposed changes Release note (<category, see below>): <what> <show> <why>

…roachdb#43198) Changes in response to code review

RoryBlevins · 2020-03-15T15:42:19Z

@RaduBerinde new updated pull request at RoryBlevins#2 , further feedback would be useful, thanks

RaduBerinde · 2020-03-16T16:55:11Z

This looks great overall, thanks! I think you can post up a PR against the main repo for a full review.

RoryBlevins · 2020-03-16T20:38:39Z

Great, thanks for your help

RoryBlevins · 2020-03-16T21:33:31Z

@RaduBerinde Before I submit a proper PR, it looks like I have one failing data test - to me that looks like an improvement (at any rate, it seems considerably simpler as it's eliminated two projections), but this is my first time in the world of SQL optimisation, so I'd like some reassurance that it is actually an improvement before I rewrite someone else's test:

    --- FAIL: TestNormRules/scalar (0.01s)
        datadriven.go:165: 
            testdata/rules/scalar:1047: SELECT k FROM a WHERE (k, i) IN (SELECT b, a FROM (VALUES (1, 1), (2, 2), (3, 3)) AS v(a,b))
            expected:
            project
             ├── columns: k:1!null
             ├── key: (1)
             └── semi-join (hash)
                  ├── columns: k:1!null column10:10
                  ├── key: (1)
                  ├── fd: (1)-->(10)
                  ├── project
                  │    ├── columns: column10:10 k:1!null
                  │    ├── key: (1)
                  │    ├── fd: (1)-->(10)
                  │    ├── scan a
                  │    │    ├── columns: k:1!null i:2
                  │    │    ├── key: (1)
                  │    │    └── fd: (1)-->(2)
                  │    └── projections
                  │         └── (k:1, i:2) [as=column10:10, outer=(1,2)]
                  ├── project
                  │    ├── columns: column9:9!null
                  │    ├── cardinality: [3 - 3]
                  │    ├── values
                  │    │    ├── columns: column1:7!null column2:8!null
                  │    │    ├── cardinality: [3 - 3]
                  │    │    ├── (1, 1)
                  │    │    ├── (2, 2)
                  │    │    └── (3, 3)
                  │    └── projections
                  │         └── (column2:8, column1:7) [as=column9:9, outer=(7,8)]
                  └── filters
                       └── column10:10 = column9:9 [outer=(9,10), constraints=(/9: (/NULL - ]; /10: (/NULL - ]), fd=(9)==(10), (10)==(9)]
            
            found:
            project
             ├── columns: k:1!null
             ├── key: (1)
             └── semi-join (hash)
                  ├── columns: k:1!null i:2
                  ├── key: (1)
                  ├── fd: (1)-->(2)
                  ├── scan a
                  │    ├── columns: k:1!null i:2
                  │    ├── key: (1)
                  │    └── fd: (1)-->(2)
                  ├── values
                  │    ├── columns: column1:7!null column2:8!null
                  │    ├── cardinality: [3 - 3]
                  │    ├── (1, 1)
                  │    ├── (2, 2)
                  │    └── (3, 3)
                  └── filters
                       ├── column2:8 = k:1 [outer=(1,8), constraints=(/1: (/NULL - ]; /8: (/NULL - ]), fd=(1)==(8), (8)==(1)]
                       └── column1:7 = i:2 [outer=(2,7), constraints=(/2: (/NULL - ]; /7: (/NULL - ]), fd=(2)==(7), (7)==(2)]

RaduBerinde · 2020-03-16T21:45:47Z

The plan is definitely an improvement. One other thing to consider is if the test is still testing what it was meant to test. In this case it looks like it's for the InlineAnyValuesMultiCol rule and it does look like it's still testing the same thing (the InlineAnyValuesMultiCol would have fired before our new rule would have gotten a chance to run).

This commit adds a new rule to join normalisation that detects joins on tuple equalities between a tuple and a variable that is a projected tuple and splits them out into multiple equalities on the underlying columns, and hoists the project above the join. This allows us to simplify a number of queries where previously the projected tuple was preventing us merging projections. Resolves: cockroachdb#43198 Release justification: low-risk performance improvement

longnguyennn · 2021-03-13T06:49:48Z

Hi @RaduBerinde @RoryBlevins , it seems like the PR for this was abandoned. Can I take over this issue?

RoryBlevins · 2021-03-13T08:13:36Z

Please do - unfortunately I accepted a new job which prevents me from continuing to work on this - I would be happy to see someone finish this.

RaduBerinde · 2021-04-15T19:45:43Z

@longnguyennn Sorry, but we ran into a case where this transformation would have helped a lot so I am starting to work on it ASAP.

I will first try out @andy-kimball's suggestion from the old PR to try removing the tuples earlier, at the Select level.

…e issue. Performance issue with previous delete statement is related to cockroachdb/cockroach#43198 Change-Id: I4a301526f209ef3fcaeca560d522c02aa28c20f8

…ormance issue. Performance issue with previous delete statement is related to cockroachdb/cockroach#43198 Change-Id: I4a301526f209ef3fcaeca560d522c02aa28c20f8

This change adds a rule that handles a case which prevents Exists subqueries from becoming lookup joins. Fixes cockroachdb#43198. Release note (performance improvement): certain queries containing `<tuple> IN (<subquery>)` conditions may run significantly faster.

63706: flowinfra: remove remnants of RunSyncFlow DistSQL RPC r=yuzefovich a=yuzefovich We recently removed `RunSyncFlow` RPC call of DistSQL, but a couple of things were missed. This commit cleans up the outbox based on that removal, removes `SetupSyncFlow` method as well as adds some comments. Release note: None 63780: opt: allow IN subquery to be converted to lookup join r=RaduBerinde a=RaduBerinde #### opt: add opttester facility to test placeholder assignment We like to assume that the result of building a memo with placeholders followed by AssignPlaceholders is equivalent to building the query with the values directly. This is not necessarily the case - it is possible that some normalization rules act on a higher part of the tree in a way that would not happen if we had fully normalized a lower part of the tree. This commit adds two new opttester directives: `assign-placeholders-norm` and `assign-placeholders-opt`. These take a query that has placeholders and simulates the prepared query planning path. We use these facilities to add some tests that reproduce a customer issue. Release note: None #### opt: allow IN subquery to be converted to lookup join This change adds a rule that handles a case which prevents Exists subqueries from becoming lookup joins. Fixes #43198. Release note (performance improvement): certain queries containing `<tuple> IN (<subquery>)` conditions may run significantly faster. Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Radu Berinde <[email protected]>

This change adds a rule that handles a case which prevents Exists subqueries from becoming lookup joins. Fixes cockroachdb#43198. Release note (performance improvement): certain queries containing `<tuple> IN (<subquery>)` conditions may run significantly faster.

RaduBerinde added the good first issue label Dec 16, 2019

RoryBlevins mentioned this issue Mar 3, 2020

opt:tuple IN (subquery) cannot be converted to lookup semi-join (#43198) #45653

Closed

RoryBlevins added a commit to RoryBlevins/cockroach that referenced this issue Mar 15, 2020

opt:tuple IN (subquery) cannot be converted to lookup semi-join (cock…

f0dcb13

…roachdb#43198) Changes in response to code review

RoryBlevins mentioned this issue Mar 17, 2020

opt: detect and extract projected tuple equalities #46186

Closed

georgysavva mentioned this issue Aug 10, 2020

Subquery doesn't get resolved in lookup join #52557

Closed

RaduBerinde self-assigned this Apr 15, 2021

RaduBerinde removed the good first issue label Apr 15, 2021

RaduBerinde mentioned this issue Apr 16, 2021

opt: allow IN subquery to be converted to lookup join #63780

Merged

RaduBerinde mentioned this issue Apr 19, 2021

release-21.1: opt: allow IN subquery to be converted to lookup join #63866

Merged

RaduBerinde mentioned this issue Apr 19, 2021

release-20.2: opt: allow IN subquery to be converted to lookup join #63869

Merged

craig bot closed this as completed in e6fd187 Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opt: tuple IN (subquery) cannot be converted to lookup semi-join #43198

opt: tuple IN (subquery) cannot be converted to lookup semi-join #43198

RaduBerinde commented Dec 16, 2019

techytoes commented Jan 7, 2020

RaduBerinde commented Jan 7, 2020

RoryBlevins commented Feb 17, 2020

RaduBerinde commented Feb 17, 2020

RoryBlevins commented Mar 3, 2020

RoryBlevins commented Mar 3, 2020

RoryBlevins commented Mar 15, 2020

RaduBerinde commented Mar 16, 2020

RoryBlevins commented Mar 16, 2020

RoryBlevins commented Mar 16, 2020

RaduBerinde commented Mar 16, 2020

longnguyennn commented Mar 13, 2021

RoryBlevins commented Mar 13, 2021

RaduBerinde commented Apr 15, 2021

opt: tuple IN (subquery) cannot be converted to lookup semi-join #43198

opt: tuple IN (subquery) cannot be converted to lookup semi-join #43198

Comments

RaduBerinde commented Dec 16, 2019

techytoes commented Jan 7, 2020

RaduBerinde commented Jan 7, 2020

RoryBlevins commented Feb 17, 2020

RaduBerinde commented Feb 17, 2020

RoryBlevins commented Mar 3, 2020

RoryBlevins commented Mar 3, 2020

RoryBlevins commented Mar 15, 2020

RaduBerinde commented Mar 16, 2020

RoryBlevins commented Mar 16, 2020

RoryBlevins commented Mar 16, 2020

RaduBerinde commented Mar 16, 2020

longnguyennn commented Mar 13, 2021

RoryBlevins commented Mar 13, 2021

RaduBerinde commented Apr 15, 2021