Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support simplification that requires multiple applications of constant folding / simplification #1160

Closed
alamb opened this issue Oct 21, 2021 · 7 comments · Fixed by #10358
Closed
Labels
datafusion Changes in the datafusion crate enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Oct 21, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The constant folding and constant evaluation in constant_folding.rs(and added to in #1153) can not fold certain types of expressions where it takes

For example there is an expression like this in #1153

now() < cast(to_timestamp(...) as int) + 5000000000

It is entirely evaluateable at query time, however, as formulated in #1153 it will not be folded because it requires the constant evaluator to run after the simplifier (because the simplifier can fill in now()` and then the evaluator will fill in the rest).

However, there are other types of exprs such as (true or false) != col which need to have the constant evaluator run before the simplifier to be fully simplified.

It is a more general problem where running either the Simplifier or ConstEvaluator could allow a second pass of the other to proceed

Describe the solution you'd like
In general this is typically handled with some sort of 'fixed point' algorithm where there is a loop that runs the two passes until the Expr is not changed. It may also be possible to combine the Simplifier pass with the constant rewriter.

Describe alternatives you've considered
I think there are two possible approaches I can think if:

  1. applying constant eval / simplifier in a loop until no changes are detected
  2. Combining the ConstantEval and Simplify steps into a single pass

Additional context
#1153

@alamb alamb added enhancement New feature or request datafusion Changes in the datafusion crate labels Oct 21, 2021
@alamb alamb self-assigned this Oct 21, 2021
@rdettai
Copy link
Contributor

rdettai commented Oct 27, 2021

Is this still relevant after #1175 (PR #1176) ? If you think it is, do you have an example in mind ? I can't come up with one 😃

@alamb
Copy link
Contributor Author

alamb commented Oct 27, 2021

@rdettai I think it is still relevant (though the now() example will work after #1175)

One (silly) example

(expr != NULL) OR (5 > 10)

We need to apply at least one pass twice to fully simplify

(expr != NULL) OR (5 > 10)
(expr != NULL) OR FALSE # after constant evaluation
NULL OR FALSE           # after simplify
NULL                    # after second constant evaluation

In this this case if we applied simplify first and then constant evaluation the expression would be folded completely

However, for other examples if you apply simplification first that is insufficient

expr != (5 > NULL)
expr != (5 > NULL) # after simplification
expr != NULL       # after constant evaluation

(would need a second call to simplify now)

@rdettai
Copy link
Contributor

rdettai commented Oct 27, 2021

Great thanks! This is very clear now. So if I understand correctly, this problem has two solutions:

  • walk the tree once, and at each node apply either expression folding or constant evaluation
  • walk the tree multiple times, alternating between expression folding and constant evaluation

The solution 1 (similar to what @pjmore proposed in #1128) would likely be more expensive in terms of space because more state needs no be tracked in the subtrees, and more critically would result in a much more complex algorithm. So we have opted for solution 2, which will be easier to implement and test, and the time overhead of walking the tree multiple times will likely be negligible as expression trees are usually rather small (and will grow even smaller at each iteration).

(I think this is already what you mentioned in #1128 (review), but I thought a recap wouldn't hurt 😄)

@alamb
Copy link
Contributor Author

alamb commented Oct 27, 2021

(I think this is already what you mentioned in #1128 (review), but I thought a recap wouldn't hurt 😄)

👍 it is a nice summary;

@devinjdangelo
Copy link
Contributor

I am interested in working on this and #3770 as I have a use case for a capability to simplify deeply nested expressions.

E.g.

(((col - 10) + 10) *100) / 100 

Simplifies to just

col

Also something like

WHERE ((x<1 or y<2) and z<3) and false

is just

WHERE false

Given the age of these discussions, I am wondering if any of the context or recommended approaches may have changed. @alamb any new thoughts on this?

@alamb
Copy link
Contributor Author

alamb commented Mar 11, 2024

Given the age of these discussions, I am wondering if any of the context or recommended approaches may have changed. @alamb any new thoughts on this?

I think algorithms for such order dependent simplifications look something like:

do {
  expr = simplify(expr)
} until expr is unchanged OR some fixed number of iterations is reached

Perhaps we can use the API that @peter-toth added in #8891 to check if/when the expr is unchanged.

The rewrite API for a long time did not have a way to know when the expression was actually rewritten (and thus when to terminate the iteration). I think we have it now

@erratic-pattern
Copy link
Contributor

PR for this is ready for review #10358.

Maybe as a follow up feature we should expose the maximum number of iterations as a configuration parameter?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
4 participants