Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate interval analysies from Interval and PruningPredicate #7887

Open
Tracked by #7882
alamb opened this issue Oct 20, 2023 · 5 comments
Open
Tracked by #7882

Consolidate interval analysies from Interval and PruningPredicate #7887

alamb opened this issue Oct 20, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Oct 20, 2023

Is your feature request related to a problem or challenge?

We now have two ways to do range / interval analysis in DataFusion.

Having two representations is challenging because we have to implement the same logic in two places. For example,

  • supporting columns known to be NULL that @appletreeisyellow is working on will only affect PruningPredicates
  • Supporting LIKE or substr would require different code in different places
  • Support for IN lists added to PruningPredicate was not added to Interval arithmetic

The rewrite used by the PruningPrediate logic is tricky to understand and only handles very specific predicate forms (see #9184 for an essay on the topic and #9230 for an example of getting it wrong). Thus it is hard to extend the number of functions / types of predicates that are supported.

The existing range analysis are:

Interval based analysis

The ExprIntervalGraph library is used for cardinality estimation and range analysis for the symmetric hash join, and it:

  1. Can handle arbitrary expressions (such as a < b, not just constants a < 5)
  2. Has a story for how it would support user defined functions (each function would implement a range analysis)

Pruning Predicate Pruning Predicate is used to prune row groups based on min/max values which:

  1. Is vectorized (is efficient to evaluate over 1000s of containers)
  2. Handles the common cases of col <op> constant (such as a = 5, or a < 100), and conjunctions of them
  3. It is not clear (to me at least) how the rewrite that is used can handle arbitrary expressions (e.g. a < b)

Describe the solution you'd like

I would like to rewrite PruningPredicate to use ExprIntervalGraph, measuring and possibly improving the performance of ExprIntervalGraph

The benefits would be:

  1. We would have a single code path to extend and maintain
  2. We would have a clear path for handling arbitrary predicates to prune row groups

Describe alternatives you've considered

Doing so would likely require extending the interval analysis to support more operators (like IN lists) to reach feature parity with the current PruningPredicate rewrite

Additional context

There was a lot of discussion of this topic on the PR that originally introduced Intervals: #5322 (comment) between @ozankabak @metegenez and myself

@alamb
Copy link
Contributor Author

alamb commented Oct 20, 2023

@ozankabak and @tustvold I swear we have talked about this topic before but I could not find an existing ticket or discussion. Do you have any other pointers to past discussions?

@berkaysynnada
Copy link
Contributor

I am not sure this is what you searched for but there was an issue #5535.

Actually, I have tried to apply cp_solver strategy to prune row groups. But we observed a performance degradation since this method sacrifices vectorized computing power, meaning that the process needs to be run for each set of statistics. As the number of sets increases, the efficiency decreases.

I will again think about how to insert Interval library there without sacrificing performance.

@alamb
Copy link
Contributor Author

alamb commented Oct 21, 2023

I am not sure this is what you searched for but there was an issue #5535.

Thank you -- this is exactly what I was looking for.

I will again think about how to insert Interval library there without sacrificing performance.

I may be able to help with this too. One way is use Datum (from arrow) (rather than ScalarValue)in theInterval` representation -- that way each can store a single value or multiple.

@alamb
Copy link
Contributor Author

alamb commented Feb 14, 2024

I just updated this ticket's description with a more coherent story and examples that @appletreeisyellow and I have hit recently while working on in #9171

We were talking today and I think @appletreeisyellow may try to prototype what this solution could look like, if she has time, to move the conversation forward.

@appletreeisyellow
Copy link
Contributor

Thank you @alamb for the updating the description

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants