Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax the restriction on predicate pushdown over projection #8451

Closed
wants to merge 3 commits into from

Conversation

weiatwork
Copy link
Contributor

prestodb/presto#10860 fixed potential issues of exponential processing time and memory usage in PredicatePushDown. However, the new heuristics (expressions are trivial or appear only once) are so strict that it prevented pushdown opportunities for some common cases. For example,

-- Fails to push down, with a complex expression and more than 1 predicates in outer query
WITH cte AS (
    SELECT CAST(o_orderdate AS VARCHAR) AS dt
    FROM orders
)
SELECT *
FROM cte
WHERE dt = '1995-05-05' OR dt = '1996-06-06';

Similar issues have been reported by others:
prestodb/presto#11265
prestodb/presto#12538
#1114

Although a later PR #1515 did relax the heuristics a little bit, and possibly fixed issue #1114, but it's still not enough to generalize the optimization to more use cases.

This PR is proposing to make the heuristics more relaxed, while keeping the original safeguard to avoid exponential expansion.

@@ -327,8 +328,9 @@ private boolean isInliningCandidate(Expression expression, ProjectNode node)
verify(AstUtils.preOrder(expression).noneMatch(TryExpression.class::isInstance));

// candidate symbols for inlining are
// 1. references to simple constants or symbol references
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably want to change io.trino.sql.planner.iterative.rule.InlineProjections#extractInliningTargets too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I understand these two rules have similarities in terms of qualifying inlining candidates, but I would prefer to handle that rule in a separate PR since this one is dealing with predicate pushdown issue

// Qualify the expression that refers to at most a single symbol
// but also avoid excessive duplication of long expression
return SymbolsExtractor.extractAll(expr).size() <= 1 &&
expr.toString().length() * count <= MAX_LINEAR_EXPANSION_LENGTH;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather count some term of complexity instead of string length. Otherwise depending on symbol name you would get different expansion factors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestion on other criteria? It could be an arbitrary expression here, not sure if there's a generic way to tell the complexity

Copy link
Member

@sopel39 sopel39 Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What case are you trying to solve? Maybe we could count addition/subtraction/multipliaction/OR terms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see some more special handling in that case :) I think "count" should be a pretty good indicator that we can use as part of the limiting heuristics

@@ -178,6 +178,7 @@ public PlanNode optimize(PlanNode plan, Session session, TypeProvider types, Sym
private final TypeProvider types;
private final ExpressionEquivalence expressionEquivalence;
private final boolean dynamicFiltering;
private static final int MAX_LINEAR_EXPANSION_LENGTH = 1024;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems pretty high. It means I could explode:

x := x + x
x := x + x
x := x + x

to 1000 terms pretty quickly. What is a use case where it helps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constant can be negotiable, but at least we should be able to handle a dozen predicates with reasonable lengths, which is a common use case, e.g. where partition_date = DATE_PARSE('2021-07-01', '%Y-%m-%d') OR partition_date = DATE_PARSE('2021-07-012, '%Y-%m-%d')...
Maybe we can put a limit on "count" (e.g. < 20) to prevent the case for your concern?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If where partition_date = DATE_PARSE('2021-07-01', '%Y-%m-%d') OR partition_date = DATE_PARSE('2021-07-012, '%Y-%m-%d')...
is your use case, then I think we should simply translate that into IN instead of relaxing inlining conditions.
See #1118

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer. It looks like #2698 has no update since last November. More importantly, I think that's orthogonal to this issue. A series of equality conditions is just one use case. There are other cases like where partition LIKE 'ab%' OR partition LIKE 'cd%'... or inequality conditions like where partition BETWEEN DATE '2000-01-01' AND '2000-12-31' OR partition BETWEEN '2020-01-01' AND '2020-12-31'... So I think it makes sense to relax this rule for all the common cases

Copy link
Member

@sopel39 sopel39 Jul 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weiatwork is partition also a complex expression in your case?

where partition_date = DATE_PARSE('2021-07-01', '%Y-%m-%d') OR partition_date = DATE_PARSE('2021-07-012, '%Y-%m-%d')...

this is just one complex expression. What are other projections (below) which prevent pushdown?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's also a complex expression, and the fact that the single complex expression in the inner query is referred to more than once in the outer filter caused it ineligible to be pushed down.

SELECT * FROM ( 
  SELECT date_parse(dt, '%Y-%m-%d') AS partition_date FROM t 
)
where partition_date = DATE_PARSE('2021-07-01', '%Y-%m-%d') OR partition_date = DATE_PARSE('2021-07-02, '%Y-%m-%d')...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other cases like where partition LIKE 'ab%' OR partition LIKE 'cd%'... or inequality conditions like where partition BETWEEN DATE '2000-01-01' AND '2000-12-31' OR partition BETWEEN '2020-01-01' AND '2020-12-31'... So I think it makes sense to relax this rule for all the common cases

Do you actually have this cases. If not, I would suggest implementing #2698 first. It strict improvement compared to heuristics here that could cause regressions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I've seen several use cases caused by this issue in our environment. And as I mentioned in the description of this PR, other people have reported this issue long time ago. So this is not uncommon. As we discussed earlier, we can add another limitation on "count" to avoid the possible expression explosion. I can make the update if you're ok with it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed earlier, we can add another limitation on "count" to avoid the possible expression explosion

Yes, count limitation would be better than string expression length (this should be removed).

@weiatwork
Copy link
Contributor Author

@sopel39 The 3 test failures from 2 UTs are not relevant. Not sure if I should make a dummy commit to re-trigger the checks.

:trino-tests

Error:  Failures: 
Error:    TestQueryTracker.testInterruptApplyFilter » ThreadTimeout Method io.trino.exec...
[INFO] 
Error:  Tests run: 2305, Failures: 1, Errors: 0, Skipped: 22

:trino-mysql

Error:  Failures: 
Error:    TestMySqlCaseInsensitiveMapping>BaseCaseInsensitiveMappingTest.testSchemaNameClashWithRuleMapping:213->BaseCaseInsensitiveMappingTest.withSchema:316 » Runtime
Error:    TestMySqlCaseInsensitiveMapping>BaseCaseInsensitiveMappingTest.testTableNameClashWithRuleMapping:274->BaseCaseInsensitiveMappingTest.withTable:336->BaseCaseInsensitiveMappingTest.withTable:323 » Runtime
[INFO] 
Error:  Tests run: 254, Failures: 2, Errors: 0, Skipped: 8

@weiatwork
Copy link
Contributor Author

After rebase the only test failure is TestQueryTracker.testInterruptApplyFilter, which has been identified as a flaky test case by: #8432

@weiatwork
Copy link
Contributor Author

@sopel39 Do you mind to take another look? Thanks.

@sopel39
Copy link
Member

sopel39 commented Jul 12, 2021

@sopel39 Do you mind to take another look? Thanks.

I would like @martint to approve this approach. There was a practical query why the limit was there in first place.

@weiatwork
Copy link
Contributor Author

@martint Could you please take a look? Thanks.

@@ -178,6 +178,7 @@ public PlanNode optimize(PlanNode plan, Session session, TypeProvider types, Sym
private final TypeProvider types;
private final ExpressionEquivalence expressionEquivalence;
private final boolean dynamicFiltering;
private static final int MAX_EXPR_REFERENCE_COUNT = 50;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind 50?
Why not 10? Why not 100?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @findepi for the review. I have to admit there is no science behind that number :) It's based on common query patterns we observed. Normally queries won't have more than 50 expression references, and if they do, I think we should stop there, which is consistent to the original intention of prestodb/presto#10860

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would eg 5 be enough? (before the change we had limit of 1, right?)

(just askjing, i still don't fully appreciate the change)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say 5 is fine with 90% of the cases (some queries just have more filters than others). Yes you're right, without this change the limit is equivalent to 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the limit of 1 was not incidental. I don't remember though, whether it was driven by a concern drawn from real life, or baed on technically possible pathological case. i wish we have had documented that.

@martint @sopel39 do you remember?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope the above context helps. Could any of you take another look? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the impact of this change on the query listed in prestodb/presto#10455 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay @martint. I'm not sure if other optimization kicks in, but the query from the above ticket produces a plan like this. Maybe this special case is no longer a problem?

Output[v]
│   Layout: [expr_13:integer]
│   v := expr_13
└─ Filter[filterPredicate = ("expr_13" = 0)]
   │   Layout: [expr_13:integer]
   └─ Values
          Layout: [expr_13:integer]
          (1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let me know your thoughts on this @martint Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay. I'm getting back to this PR now.

For the query above, yes, there are constant folding optimizations in recent version that get rid of all those intermediates. You can adjust the query to:

WITH
t1 (v) AS (SELECT orderkey FROM orders),
t2 AS( select if(v = 0, v, v) v from t1 ),
t3 AS( select if(v = 0, v, v) v from t2 ),
t4 AS( select if(v = 0, v, v) v from t3 ),
t5 AS( select if(v = 0, v, v) v from t4 ),
t6 AS( select if(v = 0, v, v) v from t5 ),
t7 AS( select if(v = 0, v, v) v from t6 ),
t8 AS( select if(v = 0, v, v) v from t7 )
select *
from t8
where v = 0;

However, I realized why it doesn't have the exponential behavior I was worried about. It's because you're checking for SymbolsExtractor.extractAll(expr).size() <= 1. It would have been a different story if it had been extractUnique, instead.

The downside I see with this PR is a case like the following:

EXPLAIN WITH
    t0 AS (SELECT expensive_function(c) x FROM t),
SELECT *
FROM t0
WHERE x = 1 OR x = 2 OR x = 3 OR x = 4 OR x = 5 OR x = 6 ....

In that scenario, the expensive function would be evaluated up to 50 times per row, instead of just once. (This trivial case could clearly be optimized to x IN (1, 2, 3, ...), though).

Also, if I'm reading the code correctly, it wouldn't work if the call to were expensive_function(c, c).

@colebow
Copy link
Member

colebow commented Oct 27, 2022

👋 @weiatwork - this PR has drawbacks that don’t align with our preferred approach, so we’re not going to proceed with this implementation. If you'd like to continue the discussion on this at any point in the future, feel free to re-open and chime in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants