Relax the restriction on predicate pushdown over projection #8451

weiatwork · 2021-07-01T17:57:51Z

prestodb/presto#10860 fixed potential issues of exponential processing time and memory usage in PredicatePushDown. However, the new heuristics (expressions are trivial or appear only once) are so strict that it prevented pushdown opportunities for some common cases. For example,

-- Fails to push down, with a complex expression and more than 1 predicates in outer query
WITH cte AS (
    SELECT CAST(o_orderdate AS VARCHAR) AS dt
    FROM orders
)
SELECT *
FROM cte
WHERE dt = '1995-05-05' OR dt = '1996-06-06';

Similar issues have been reported by others:
prestodb/presto#11265
prestodb/presto#12538
#1114

Although a later PR #1515 did relax the heuristics a little bit, and possibly fixed issue #1114, but it's still not enough to generalize the optimization to more use cases.

This PR is proposing to make the heuristics more relaxed, while keeping the original safeguard to avoid exponential expansion.

sopel39 · 2021-07-01T19:54:36Z

core/trino-main/src/main/java/io/trino/sql/planner/optimizations/PredicatePushDown.java

@@ -327,8 +328,9 @@ private boolean isInliningCandidate(Expression expression, ProjectNode node)
            verify(AstUtils.preOrder(expression).noneMatch(TryExpression.class::isInstance));

            // candidate symbols for inlining are
-            //   1. references to simple constants or symbol references


you probably want to change io.trino.sql.planner.iterative.rule.InlineProjections#extractInliningTargets too

Good call. I understand these two rules have similarities in terms of qualifying inlining candidates, but I would prefer to handle that rule in a separate PR since this one is dealing with predicate pushdown issue

sopel39 · 2021-07-01T20:07:01Z

core/trino-main/src/main/java/io/trino/sql/planner/optimizations/PredicatePushDown.java

+            // Qualify the expression that refers to at most a single symbol
+            // but also avoid excessive duplication of long expression
+            return SymbolsExtractor.extractAll(expr).size() <= 1 &&
+                    expr.toString().length() * count <= MAX_LINEAR_EXPANSION_LENGTH;


I would rather count some term of complexity instead of string length. Otherwise depending on symbol name you would get different expansion factors.

Any suggestion on other criteria? It could be an arbitrary expression here, not sure if there's a generic way to tell the complexity

What case are you trying to solve? Maybe we could count addition/subtraction/multipliaction/OR terms?

I see some more special handling in that case :) I think "count" should be a pretty good indicator that we can use as part of the limiting heuristics

sopel39 · 2021-07-01T20:09:19Z

core/trino-main/src/main/java/io/trino/sql/planner/optimizations/PredicatePushDown.java

@@ -178,6 +178,7 @@ public PlanNode optimize(PlanNode plan, Session session, TypeProvider types, Sym
        private final TypeProvider types;
        private final ExpressionEquivalence expressionEquivalence;
        private final boolean dynamicFiltering;
+        private static final int MAX_LINEAR_EXPANSION_LENGTH = 1024;


this seems pretty high. It means I could explode:

x := x + x x := x + x x := x + x

to 1000 terms pretty quickly. What is a use case where it helps?

This constant can be negotiable, but at least we should be able to handle a dozen predicates with reasonable lengths, which is a common use case, e.g. where partition_date = DATE_PARSE('2021-07-01', '%Y-%m-%d') OR partition_date = DATE_PARSE('2021-07-012, '%Y-%m-%d')...
Maybe we can put a limit on "count" (e.g. < 20) to prevent the case for your concern?

If where partition_date = DATE_PARSE('2021-07-01', '%Y-%m-%d') OR partition_date = DATE_PARSE('2021-07-012, '%Y-%m-%d')...
is your use case, then I think we should simply translate that into IN instead of relaxing inlining conditions.
See #1118

Thanks for the pointer. It looks like #2698 has no update since last November. More importantly, I think that's orthogonal to this issue. A series of equality conditions is just one use case. There are other cases like where partition LIKE 'ab%' OR partition LIKE 'cd%'... or inequality conditions like where partition BETWEEN DATE '2000-01-01' AND '2000-12-31' OR partition BETWEEN '2020-01-01' AND '2020-12-31'... So I think it makes sense to relax this rule for all the common cases

@weiatwork is partition also a complex expression in your case?

where partition_date = DATE_PARSE('2021-07-01', '%Y-%m-%d') OR partition_date = DATE_PARSE('2021-07-012, '%Y-%m-%d')...

this is just one complex expression. What are other projections (below) which prevent pushdown?

Yes, it's also a complex expression, and the fact that the single complex expression in the inner query is referred to more than once in the outer filter caused it ineligible to be pushed down.

SELECT * FROM ( SELECT date_parse(dt, '%Y-%m-%d') AS partition_date FROM t ) where partition_date = DATE_PARSE('2021-07-01', '%Y-%m-%d') OR partition_date = DATE_PARSE('2021-07-02, '%Y-%m-%d')...

There are other cases like where partition LIKE 'ab%' OR partition LIKE 'cd%'... or inequality conditions like where partition BETWEEN DATE '2000-01-01' AND '2000-12-31' OR partition BETWEEN '2020-01-01' AND '2020-12-31'... So I think it makes sense to relax this rule for all the common cases

Do you actually have this cases. If not, I would suggest implementing #2698 first. It strict improvement compared to heuristics here that could cause regressions

Yes I've seen several use cases caused by this issue in our environment. And as I mentioned in the description of this PR, other people have reported this issue long time ago. So this is not uncommon. As we discussed earlier, we can add another limitation on "count" to avoid the possible expression explosion. I can make the update if you're ok with it.

As we discussed earlier, we can add another limitation on "count" to avoid the possible expression explosion

Yes, count limitation would be better than string expression length (this should be removed).

…ength

weiatwork · 2021-07-06T21:37:02Z

@sopel39 The 3 test failures from 2 UTs are not relevant. Not sure if I should make a dummy commit to re-trigger the checks.

:trino-tests

Error:  Failures: 
Error:    TestQueryTracker.testInterruptApplyFilter » ThreadTimeout Method io.trino.exec...
[INFO] 
Error:  Tests run: 2305, Failures: 1, Errors: 0, Skipped: 22

:trino-mysql

Error:  Failures: 
Error:    TestMySqlCaseInsensitiveMapping>BaseCaseInsensitiveMappingTest.testSchemaNameClashWithRuleMapping:213->BaseCaseInsensitiveMappingTest.withSchema:316 » Runtime
Error:    TestMySqlCaseInsensitiveMapping>BaseCaseInsensitiveMappingTest.testTableNameClashWithRuleMapping:274->BaseCaseInsensitiveMappingTest.withTable:336->BaseCaseInsensitiveMappingTest.withTable:323 » Runtime
[INFO] 
Error:  Tests run: 254, Failures: 2, Errors: 0, Skipped: 8

weiatwork · 2021-07-07T18:01:32Z

After rebase the only test failure is TestQueryTracker.testInterruptApplyFilter, which has been identified as a flaky test case by: #8432

weiatwork · 2021-07-08T18:56:13Z

@sopel39 Do you mind to take another look? Thanks.

sopel39 · 2021-07-12T09:33:10Z

@sopel39 Do you mind to take another look? Thanks.

I would like @martint to approve this approach. There was a practical query why the limit was there in first place.

weiatwork · 2021-07-21T04:28:41Z

@martint Could you please take a look? Thanks.

findepi · 2021-07-21T21:12:00Z

core/trino-main/src/main/java/io/trino/sql/planner/optimizations/PredicatePushDown.java

@@ -178,6 +178,7 @@ public PlanNode optimize(PlanNode plan, Session session, TypeProvider types, Sym
        private final TypeProvider types;
        private final ExpressionEquivalence expressionEquivalence;
        private final boolean dynamicFiltering;
+        private static final int MAX_EXPR_REFERENCE_COUNT = 50;


What's the reasoning behind 50?
Why not 10? Why not 100?

Thanks @findepi for the review. I have to admit there is no science behind that number :) It's based on common query patterns we observed. Normally queries won't have more than 50 expression references, and if they do, I think we should stop there, which is consistent to the original intention of prestodb/presto#10860

Would eg 5 be enough? (before the change we had limit of 1, right?)

(just askjing, i still don't fully appreciate the change)

I would say 5 is fine with 90% of the cases (some queries just have more filters than others). Yes you're right, without this change the limit is equivalent to 1.

IIRC, the limit of 1 was not incidental. I don't remember though, whether it was driven by a concern drawn from real life, or baed on technically possible pathological case. i wish we have had documented that.

@martint @sopel39 do you remember?

I hope the above context helps. Could any of you take another look? Thanks.

What's the impact of this change on the query listed in prestodb/presto#10455 ?

Sorry for the delay @martint. I'm not sure if other optimization kicks in, but the query from the above ticket produces a plan like this. Maybe this special case is no longer a problem?

Output[v] │ Layout: [expr_13:integer] │ v := expr_13 └─ Filter[filterPredicate = ("expr_13" = 0)] │ Layout: [expr_13:integer] └─ Values Layout: [expr_13:integer] (1)

Please let me know your thoughts on this @martint Thanks.

Apologies for the delay. I'm getting back to this PR now.

For the query above, yes, there are constant folding optimizations in recent version that get rid of all those intermediates. You can adjust the query to:

WITH t1 (v) AS (SELECT orderkey FROM orders), t2 AS( select if(v = 0, v, v) v from t1 ), t3 AS( select if(v = 0, v, v) v from t2 ), t4 AS( select if(v = 0, v, v) v from t3 ), t5 AS( select if(v = 0, v, v) v from t4 ), t6 AS( select if(v = 0, v, v) v from t5 ), t7 AS( select if(v = 0, v, v) v from t6 ), t8 AS( select if(v = 0, v, v) v from t7 ) select * from t8 where v = 0;

However, I realized why it doesn't have the exponential behavior I was worried about. It's because you're checking for SymbolsExtractor.extractAll(expr).size() <= 1. It would have been a different story if it had been extractUnique, instead.

The downside I see with this PR is a case like the following:

EXPLAIN WITH t0 AS (SELECT expensive_function(c) x FROM t), SELECT * FROM t0 WHERE x = 1 OR x = 2 OR x = 3 OR x = 4 OR x = 5 OR x = 6 ....

In that scenario, the expensive function would be evaluated up to 50 times per row, instead of just once. (This trivial case could clearly be optimized to x IN (1, 2, 3, ...), though).

Also, if I'm reading the code correctly, it wouldn't work if the call to were expensive_function(c, c).

colebow · 2022-10-27T03:47:06Z

👋 @weiatwork - this PR has drawbacks that don’t align with our preferred approach, so we’re not going to proceed with this implementation. If you'd like to continue the discussion on this at any point in the future, feel free to re-open and chime in.

Relax the restriction on predicate pushdown over projection

f255977

cla-bot bot added the cla-signed label Jul 1, 2021

hashhar requested review from martint, raunaqmorarka and sopel39 July 1, 2021 17:59

sopel39 reviewed Jul 1, 2021

View reviewed changes

Rely on expression reference count instead of combined terms string l…

e0f812a

…ength

Merge branch 'trinodb:master' into inlinemore

2c00ab8

findepi reviewed Jul 21, 2021

View reviewed changes

findepi force-pushed the master branch from 8538e49 to 1f896ea Compare July 30, 2021 22:13

colebow closed this Oct 27, 2022

JunhyungSong mentioned this pull request Nov 3, 2022

Have a list of expensive functions to relax the restriction on predicate pushdown over projection #14879

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relax the restriction on predicate pushdown over projection #8451

Relax the restriction on predicate pushdown over projection #8451

weiatwork commented Jul 1, 2021

sopel39 Jul 1, 2021

weiatwork Jul 2, 2021

sopel39 Jul 1, 2021

weiatwork Jul 2, 2021

sopel39 Jul 2, 2021 •

edited

Loading

weiatwork Jul 2, 2021

sopel39 Jul 1, 2021

weiatwork Jul 2, 2021

sopel39 Jul 2, 2021

weiatwork Jul 2, 2021

sopel39 Jul 3, 2021 •

edited

Loading

weiatwork Jul 4, 2021

sopel39 Jul 5, 2021

weiatwork Jul 5, 2021

sopel39 Jul 6, 2021

weiatwork commented Jul 6, 2021

weiatwork commented Jul 7, 2021

weiatwork commented Jul 8, 2021

sopel39 commented Jul 12, 2021

weiatwork commented Jul 21, 2021

findepi Jul 21, 2021

weiatwork Jul 22, 2021

findepi Jul 22, 2021

weiatwork Jul 23, 2021

findepi Jul 23, 2021

weiatwork Jul 26, 2021

martint Jul 26, 2021

weiatwork Jul 30, 2021

weiatwork Aug 2, 2021

martint Nov 5, 2021

colebow commented Oct 27, 2022

Relax the restriction on predicate pushdown over projection #8451

Relax the restriction on predicate pushdown over projection #8451

Conversation

weiatwork commented Jul 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sopel39 Jul 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sopel39 Jul 3, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weiatwork commented Jul 6, 2021

weiatwork commented Jul 7, 2021

weiatwork commented Jul 8, 2021

sopel39 commented Jul 12, 2021

weiatwork commented Jul 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colebow commented Oct 27, 2022

sopel39 Jul 2, 2021 •

edited

Loading

sopel39 Jul 3, 2021 •

edited

Loading