-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relax the restriction on predicate pushdown over projection #8451
Conversation
@@ -327,8 +328,9 @@ private boolean isInliningCandidate(Expression expression, ProjectNode node) | |||
verify(AstUtils.preOrder(expression).noneMatch(TryExpression.class::isInstance)); | |||
|
|||
// candidate symbols for inlining are | |||
// 1. references to simple constants or symbol references |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you probably want to change io.trino.sql.planner.iterative.rule.InlineProjections#extractInliningTargets
too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. I understand these two rules have similarities in terms of qualifying inlining candidates, but I would prefer to handle that rule in a separate PR since this one is dealing with predicate pushdown issue
// Qualify the expression that refers to at most a single symbol | ||
// but also avoid excessive duplication of long expression | ||
return SymbolsExtractor.extractAll(expr).size() <= 1 && | ||
expr.toString().length() * count <= MAX_LINEAR_EXPANSION_LENGTH; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather count some term of complexity instead of string length. Otherwise depending on symbol name you would get different expansion factors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestion on other criteria? It could be an arbitrary expression here, not sure if there's a generic way to tell the complexity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What case are you trying to solve? Maybe we could count addition/subtraction/multipliaction/OR terms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see some more special handling in that case :) I think "count" should be a pretty good indicator that we can use as part of the limiting heuristics
@@ -178,6 +178,7 @@ public PlanNode optimize(PlanNode plan, Session session, TypeProvider types, Sym | |||
private final TypeProvider types; | |||
private final ExpressionEquivalence expressionEquivalence; | |||
private final boolean dynamicFiltering; | |||
private static final int MAX_LINEAR_EXPANSION_LENGTH = 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems pretty high. It means I could explode:
x := x + x
x := x + x
x := x + x
to 1000 terms pretty quickly. What is a use case where it helps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constant can be negotiable, but at least we should be able to handle a dozen predicates with reasonable lengths, which is a common use case, e.g. where partition_date = DATE_PARSE('2021-07-01', '%Y-%m-%d') OR partition_date = DATE_PARSE('2021-07-012, '%Y-%m-%d')...
Maybe we can put a limit on "count" (e.g. < 20) to prevent the case for your concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If where partition_date = DATE_PARSE('2021-07-01', '%Y-%m-%d') OR partition_date = DATE_PARSE('2021-07-012, '%Y-%m-%d')...
is your use case, then I think we should simply translate that into IN instead of relaxing inlining conditions.
See #1118
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer. It looks like #2698 has no update since last November. More importantly, I think that's orthogonal to this issue. A series of equality conditions is just one use case. There are other cases like where partition LIKE 'ab%' OR partition LIKE 'cd%'...
or inequality conditions like where partition BETWEEN DATE '2000-01-01' AND '2000-12-31' OR partition BETWEEN '2020-01-01' AND '2020-12-31'...
So I think it makes sense to relax this rule for all the common cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@weiatwork is partition
also a complex expression in your case?
where partition_date = DATE_PARSE('2021-07-01', '%Y-%m-%d') OR partition_date = DATE_PARSE('2021-07-012, '%Y-%m-%d')...
this is just one complex expression. What are other projections (below) which prevent pushdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's also a complex expression, and the fact that the single complex expression in the inner query is referred to more than once in the outer filter caused it ineligible to be pushed down.
SELECT * FROM (
SELECT date_parse(dt, '%Y-%m-%d') AS partition_date FROM t
)
where partition_date = DATE_PARSE('2021-07-01', '%Y-%m-%d') OR partition_date = DATE_PARSE('2021-07-02, '%Y-%m-%d')...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are other cases like where partition LIKE 'ab%' OR partition LIKE 'cd%'... or inequality conditions like where partition BETWEEN DATE '2000-01-01' AND '2000-12-31' OR partition BETWEEN '2020-01-01' AND '2020-12-31'... So I think it makes sense to relax this rule for all the common cases
Do you actually have this cases. If not, I would suggest implementing #2698 first. It strict improvement compared to heuristics here that could cause regressions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I've seen several use cases caused by this issue in our environment. And as I mentioned in the description of this PR, other people have reported this issue long time ago. So this is not uncommon. As we discussed earlier, we can add another limitation on "count" to avoid the possible expression explosion. I can make the update if you're ok with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed earlier, we can add another limitation on "count" to avoid the possible expression explosion
Yes, count limitation would be better than string expression length (this should be removed).
@sopel39 The 3 test failures from 2 UTs are not relevant. Not sure if I should make a dummy commit to re-trigger the checks.
|
After rebase the only test failure is |
@sopel39 Do you mind to take another look? Thanks. |
@martint Could you please take a look? Thanks. |
@@ -178,6 +178,7 @@ public PlanNode optimize(PlanNode plan, Session session, TypeProvider types, Sym | |||
private final TypeProvider types; | |||
private final ExpressionEquivalence expressionEquivalence; | |||
private final boolean dynamicFiltering; | |||
private static final int MAX_EXPR_REFERENCE_COUNT = 50; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning behind 50?
Why not 10? Why not 100?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @findepi for the review. I have to admit there is no science behind that number :) It's based on common query patterns we observed. Normally queries won't have more than 50 expression references, and if they do, I think we should stop there, which is consistent to the original intention of prestodb/presto#10860
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would eg 5 be enough? (before the change we had limit of 1, right?)
(just askjing, i still don't fully appreciate the change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say 5 is fine with 90% of the cases (some queries just have more filters than others). Yes you're right, without this change the limit is equivalent to 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope the above context helps. Could any of you take another look? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the impact of this change on the query listed in prestodb/presto#10455 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay @martint. I'm not sure if other optimization kicks in, but the query from the above ticket produces a plan like this. Maybe this special case is no longer a problem?
Output[v]
│ Layout: [expr_13:integer]
│ v := expr_13
└─ Filter[filterPredicate = ("expr_13" = 0)]
│ Layout: [expr_13:integer]
└─ Values
Layout: [expr_13:integer]
(1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let me know your thoughts on this @martint Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay. I'm getting back to this PR now.
For the query above, yes, there are constant folding optimizations in recent version that get rid of all those intermediates. You can adjust the query to:
WITH
t1 (v) AS (SELECT orderkey FROM orders),
t2 AS( select if(v = 0, v, v) v from t1 ),
t3 AS( select if(v = 0, v, v) v from t2 ),
t4 AS( select if(v = 0, v, v) v from t3 ),
t5 AS( select if(v = 0, v, v) v from t4 ),
t6 AS( select if(v = 0, v, v) v from t5 ),
t7 AS( select if(v = 0, v, v) v from t6 ),
t8 AS( select if(v = 0, v, v) v from t7 )
select *
from t8
where v = 0;
However, I realized why it doesn't have the exponential behavior I was worried about. It's because you're checking for SymbolsExtractor.extractAll(expr).size() <= 1
. It would have been a different story if it had been extractUnique
, instead.
The downside I see with this PR is a case like the following:
EXPLAIN WITH
t0 AS (SELECT expensive_function(c) x FROM t),
SELECT *
FROM t0
WHERE x = 1 OR x = 2 OR x = 3 OR x = 4 OR x = 5 OR x = 6 ....
In that scenario, the expensive function would be evaluated up to 50 times per row, instead of just once. (This trivial case could clearly be optimized to x IN (1, 2, 3, ...)
, though).
Also, if I'm reading the code correctly, it wouldn't work if the call to were expensive_function(c, c)
.
👋 @weiatwork - this PR has drawbacks that don’t align with our preferred approach, so we’re not going to proceed with this implementation. If you'd like to continue the discussion on this at any point in the future, feel free to re-open and chime in. |
prestodb/presto#10860 fixed potential issues of exponential processing time and memory usage in PredicatePushDown. However, the new heuristics (expressions are trivial or appear only once) are so strict that it prevented pushdown opportunities for some common cases. For example,
Similar issues have been reported by others:
prestodb/presto#11265
prestodb/presto#12538
#1114
Although a later PR #1515 did relax the heuristics a little bit, and possibly fixed issue #1114, but it's still not enough to generalize the optimization to more use cases.
This PR is proposing to make the heuristics more relaxed, while keeping the original safeguard to avoid exponential expansion.