-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate interval analysies from Interval
and PruningPredicate
#7887
Comments
@ozankabak and @tustvold I swear we have talked about this topic before but I could not find an existing ticket or discussion. Do you have any other pointers to past discussions? |
I am not sure this is what you searched for but there was an issue #5535. Actually, I have tried to apply cp_solver strategy to prune row groups. But we observed a performance degradation since this method sacrifices vectorized computing power, meaning that the process needs to be run for each set of statistics. As the number of sets increases, the efficiency decreases. I will again think about how to insert Interval library there without sacrificing performance. |
Thank you -- this is exactly what I was looking for.
I may be able to help with this too. One way is use |
I just updated this ticket's description with a more coherent story and examples that @appletreeisyellow and I have hit recently while working on in #9171 We were talking today and I think @appletreeisyellow may try to prototype what this solution could look like, if she has time, to move the conversation forward. |
Thank you @alamb for the updating the description |
Is your feature request related to a problem or challenge?
We now have two ways to do range / interval analysis in DataFusion.
Having two representations is challenging because we have to implement the same logic in two places. For example,
NULL
that @appletreeisyellow is working on will only affectPruningPredicates
LIKE
orsubstr
would require different code in different placesIN
lists added toPruningPredicate
was not added to Interval arithmeticThe rewrite used by the
PruningPrediate
logic is tricky to understand and only handles very specific predicate forms (see #9184 for an essay on the topic and #9230 for an example of getting it wrong). Thus it is hard to extend the number of functions / types of predicates that are supported.The existing range analysis are:
Interval
based analysisThe
ExprIntervalGraph
library is used for cardinality estimation and range analysis for the symmetric hash join, and it:a < b
, not just constantsa < 5
)Pruning Predicate
Pruning Predicate
is used to prune row groups based on min/max values which:col <op> constant
(such asa = 5
, ora < 100
), and conjunctions of thema < b
)Describe the solution you'd like
I would like to rewrite
PruningPredicate
to useExprIntervalGraph
, measuring and possibly improving the performance ofExprIntervalGraph
The benefits would be:
Describe alternatives you've considered
Doing so would likely require extending the interval analysis to support more operators (like
IN
lists) to reach feature parity with the currentPruningPredicate
rewriteAdditional context
There was a lot of discussion of this topic on the PR that originally introduced
Interval
s: #5322 (comment) between @ozankabak @metegenez and myselfThe text was updated successfully, but these errors were encountered: