Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPT: suboptimal behavior related to predicate evaluation #58283

Closed
xinyuliu12 opened this issue Dec 27, 2020 · 4 comments · Fixed by #58321
Closed

OPT: suboptimal behavior related to predicate evaluation #58283

xinyuliu12 opened this issue Dec 27, 2020 · 4 comments · Fixed by #58321
Labels
O-community Originated from the community X-blathers-triaged blathers was able to find an owner

Comments

@xinyuliu12
Copy link

Hello everyone!

First off, I am sorry for gestures vaguely all of this. Second, I think we may have found a suboptimal behavior related to predicate evaluation.

Here is a pair of TPC-H queries that exhibit this behavior.

SELECT "s_suppkey", 
       "s_comment", 
       "s_nationkey" 
FROM   "supplier" 
WHERE  "s_comment" NOT LIKE '. slyly regular pinto bea' 
       AND "s_comment" = '. slyly regular pinto bea' IS TRUE; 

SELECT "s_suppkey", 
       "s_comment", 
       "s_nationkey" 
FROM   "supplier" 
WHERE  "s_comment" NOT LIKE '. slyly regular pinto bea' 
       AND "s_comment" = '. slyly regular pinto bea'; 

Actual Behavior
The first query takes 49 milliseconds on the latest release v20.2.0-alpha.1, while the second query only takes 1 milliseconds.
The first query's EXPLAIN ANALYZE (DEBUG) result is
stmt-bundle-firstquery.zip

The second query's EXPLAIN ANALYZE (DEBUG) result is
stmt-bundle-secondquery.zip

Expected Behavior
I would have expected the database system run these two queries with similar execution time, given that their filters look almost identical to me. However, I am not certain whether these two filters have the same semantics in this case.
Here are the steps for reproducing our observations:

Test environment

  • Ubuntu 20.04 machine "Linux panda 5.4.0-40-generic add weighted reservoir sampling (AE-S) #44-Ubuntu SMP Tue Jun 23 00:01:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux"
  • Cockroachdb v20.2.0-alpha.1
  • Database: TPC-H benchmark (with scale factor 5)

Reproduce Bug

  1. Install cockroachdb v20.2.0-alpha.1
    $ mkdir -p $(go env GOPATH)/src/github.com/cockroachdb
    $ cd $(go env GOPATH)/src/github.com/cockroachdb
    $ git clone https://github.com/cockroachdb/cockroach
    $ cd cockroach
    $ git checkout eaa939c
    $ make build
    $ sudo make install
  • start a cockroachdb node in your preferred working directory
    $ cockroach start --insecure --store=node1 --listen-addr=localhost:26257 --http-addr=localhost:8080 --join=localhost:26257,localhost:26258,localhost:26259 --background
    $ cockroach init --insecure --host=localhost:26257
  1. Set up TPC-H test benchmark (if you already have a TPC-H benchmark set up, you can skip step2; starting from step2, all commands should run in the directory ./tpch5 after downloading and extracting it)
    $ tar xzvf tpch5_setup.tar.gz
    $ cd tpch5
    $ chmod +x setup.sh
  • Create DB and TPC-H schema
    $ cockroach sql --insecure --host=localhost:26257
    In CockroachDB's built-in SQL client to create database:
    $ create database tpch5;
    $ exit
  • Create TPC-H Schema
    $ cockroach sql --insecure --host=localhost:26257 -d tpch5 < dss.ddl
  • Set up benchmark
    $ ./setup.sh
    $ cockroach sql --insecure --host=localhost:26257 -d tpch5 < dss.ri

  1. Test SQL query that exhibits performance issue
  • Execute the queries after logging to the testing benchmark using the following command
    $ cockroach sql --insecure --host=localhost:26257 -d tpch5
@blathers-crl
Copy link

blathers-crl bot commented Dec 27, 2020

Hello, I am Blathers. I am here to help you get the issue triaged.

It looks like you have not filled out the issue in the format of any of our templates. To best assist you, we advise you to use one of these templates.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Dec 27, 2020
@rafiss
Copy link
Collaborator

rafiss commented Dec 28, 2020

One thing to note is that in general, with a boolean variable b, the expression b is not always the same as b IS TRUE. If b is a nullable variable, then these expressions can evaluate differently (the second one forces it to become non-null).

However, I'm not sure if there is any difference between the two expressions when it is used as a filter condition like this. For example, these two queries return the same thing.

create table tb(a int primary key, b boolean);
insert into tb values(1,true),(2,false),(3,null);

-- these two queries return the same results
select * from tb where b;
select * from tb where b is true;

@yuzefovich
Copy link
Member

In the fast case the optimizer is able to prove that the filter is always false, so we don't need to read anything to evaluate the query whereas in the slow case we do need to execute some operations:

root@:26257/tpch> EXPLAIN
  SELECT
    s_suppkey, s_comment, s_nationkey
  FROM
    supplier
  WHERE
    s_comment NOT LIKE '. slyly regular pinto bea'
    AND s_comment = '. slyly regular pinto bea';
         info
-----------------------
  distribution: local
  vectorized: true

  • norows
(4 rows)

Time: 1ms total (execution 0ms / network 0ms)

root@:26257/tpch> EXPLAIN
  SELECT
    s_suppkey, s_comment, s_nationkey
  FROM
    supplier
  WHERE
    s_comment NOT LIKE '. slyly regular pinto bea'
    AND s_comment = '. slyly regular pinto bea' IS true;
                                                         info
----------------------------------------------------------------------------------------------------------------------
  distribution: full
  vectorized: true

  • filter
  │ filter: (s_comment NOT LIKE '. slyly regular pinto bea') AND ((s_comment = '. slyly regular pinto bea') IS true)
  │
  └── • scan
        estimated row count: 10,000
        table: supplier@primary
        spans: FULL SCAN
(10 rows)

Time: 1ms total (execution 0ms / network 0ms)

I'm curious to know why the optimizer isn't able to deduce the same thing in the latter case.

@DrewKimball
Copy link
Collaborator

Looks like a missing rule. I think there could be a rule for selects and joins that normalizes expression IS true to expression and expression IS false to NOT expression, so long as the IS is connected to the top-level filter by a series of ANDs and ORs

DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Dec 29, 2020
Previously, a filter in the condition of a `Select` or `Join` could
not simplify an expression of the following form:
`<expr> IS (True | False)`
This could prevent other rules from firing - contradiction detection,
for example. This patch allows the SimplifyFilters logic to effect the
following transformation on `IS` expressions in the filters of a `Select`
or `Join`:
```
<expr> IS True => <expr>
<expr> IS False => NOT <expr>
```
Note that the original `IS` expression would return `False` on a null
input, whereas the replacement expression could return `Null`. In the
case when the `IS` expression is a `Select` or `Join` condition, this
does not matter, because `False` and `Null` conditions are treated the
same way (no rows returned).

However, the same logic does not apply to `IS NOT`, because then `True`
values would become `Null`, since `Null IS NOT True` returns `True`.
This would result in less rows being returned.

Currently, only IS expressions that are top-level conjuncts of the
filters are simplified for the sake of simplicity.

Fixes cockroachdb#58283

Release note: None
craig bot pushed a commit that referenced this issue Dec 30, 2020
58321: opt: simplify IS filters for selects and joins r=rytaft a=DrewKimball

Previously, a filter in the condition of a `Select` or `Join` could
not simplify an expression of the following form:
`<expr> IS (True | False)`
This could prevent other rules from firing - contradiction detection,
for example. This patch allows the SimplifyFilters logic to effect the
following transformation on `IS` expressions in the filters of a `Select`
or `Join`:
```
<expr> IS True => <expr>
<expr> IS False => NOT <expr>
```
Note that the original `IS` expression would return `False` on a null
input, whereas the replacement expression could return `Null`. In the
case when the `IS` expression is a `Select` or `Join` condition, this
does not matter, because `False` and `Null` conditions are treated the
same way (no rows returned).

However, the same logic does not apply to `IS NOT`, because then `True`
values would become `Null`, since `Null IS NOT True` returns `True`.
This would result in less rows being returned.

Currently, only IS expressions that are top-level conjuncts of the
filters are simplified for the sake of simplicity.

Fixes #58283

Release note: None

Co-authored-by: Andrew Kimball <[email protected]>
@craig craig bot closed this as completed in 1faee2d Dec 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-community Originated from the community X-blathers-triaged blathers was able to find an owner
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants