fix guarantees in allways_true of PruningPredicate #8732

my-vegetable-has-exploded · 2024-01-03T09:54:56Z

Which issue does this PR close?

Rationale for this change

If predicte_expr get unhandled, it will return true in always_true.

https://github.com/apache/arrow-datafusion/blob/9a6cc889a40e4740bfc859557a9ca9c8d043891e/datafusion/core/src/physical_optimizer/pruning.rs#L914

https://github.com/apache/arrow-datafusion/blob/9a6cc889a40e4740bfc859557a9ca9c8d043891e/datafusion/core/src/physical_optimizer/pruning.rs#L299-L302

Then this PruningPredicate will be filtered here.

https://github.com/apache/arrow-datafusion/blob/9a6cc889a40e4740bfc859557a9ca9c8d043891e/datafusion/core/src/datasource/physical_plan/parquet/mod.rs#L122-L135

If an expr can't be handled as predicate_expr(like in_list with more than 20 elements) but hava some literalguarantees , the PruningPredicate may also be filtered.

So , we need to check PruningPredicate in allways_true function.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

my-vegetable-has-exploded · 2024-01-03T10:22:44Z

May it is better to add some integration test？ But I don't know how check whether bloomfilter works with current code. I check it by the metric I add myself(like https://github.com/apache/arrow-datafusion/compare/main...my-vegetable-has-exploded:arrow-datafusion:metric-sbbf?expand=1) , Should I add it to this branch?

alamb · 2024-01-03T11:29:36Z

Thank you @my-vegetable-has-exploded -- I will review this later today.

This also seems somewhat related to #7869 which I plan to work on this week. It is another case of not pruning even when there is sufficient information to do so

alamb

Thank you again @my-vegetable-has-exploded and @domyway -- I reviewed the code carefully and I have pushed a test to this branch

While reviewing the initial PR that added this functionalty #4280 (which I added it seems 🤔 )

The idea seems to be to entirely skipping statistics for pruning predicates that
will never prune anything out, which seems reasonable, but when I
looked at the code it seems like we always will decode the parquet
statistics anyways, and the pruning predicate only creates predicates on demand.
Thus, I don't think skipping the entire pruning predicate saves much work at all.

I'll double check in a follow on PR, but I think we might just be able to remove that code entirely.

alamb · 2024-01-03T21:20:50Z

May it is better to add some integration test？ But I don't know how check whether bloomfilter works with current code. I check it by the metric I add myself(like https://github.com/apache/arrow-datafusion/compare/main...my-vegetable-has-exploded:arrow-datafusion:metric-sbbf?expand=1) , Should I add it to this branch?

I think an integration test as well as the bloom filter metrics would be good.

Here are my recommended follow on steps

One PR to add the new metrics to distinguish filtering on bloom filters vs statistics
One PR with some integration tests to verify bloom filters are actually pruning (maybe following how it is done in https://github.com/apache/arrow-datafusion/blob/1179a76567892b259c88f08243ee01f05c4c3d5c/datafusion/core/tests/parquet/row_group_pruning.rs#L42)

my-vegetable-has-exploded · 2024-01-04T02:56:05Z

I think an integration test as well as the bloom filter metrics would be good.

Thanks @alamb, I will handle it later.

alamb · 2024-01-05T20:47:31Z

One PR to add the new metrics to distinguish filtering on bloom filters vs statistics

One PR with some integration tests to verify bloom filters are actually pruning (maybe following how it is done in https://github.com/apache/arrow-datafusion/blob/1179a76567892b259c88f08243ee01f05c4c3d5c/datafusion/core/tests/parquet/row_group_pruning.rs#L42
)

filed #8767 and #8768 to track this

fix: check guarantees in allways_true

ac4ca35

github-actions bot added the core Core DataFusion crate label Jan 3, 2024

my-vegetable-has-exploded changed the title ~~Minor: fix check guarantees in allways_true of PruningPredicate~~ Minor: fix guarantees in allways_true of PruningPredicate Jan 3, 2024

alamb mentioned this pull request Jan 3, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 1, 2024 #8704

Closed

9 tasks

alamb mentioned this pull request Jan 3, 2024

Regression: bloom filters are not being used in Parquet queries #8685

Closed

alamb changed the title ~~Minor: fix guarantees in allways_true of PruningPredicate~~ fix guarantees in allways_true of PruningPredicate Jan 3, 2024

alamb added 3 commits January 3, 2024 16:14

Add test for allways_true

fb1256f

refine comment

30f6ea0

Merge remote-tracking branch 'apache/main' into always-true

4ced6c7

alamb approved these changes Jan 3, 2024

View reviewed changes

alamb merged commit ad4b7b7 into apache:main Jan 3, 2024
22 checks passed

my-vegetable-has-exploded deleted the always-true branch January 9, 2024 08:56

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix guarantees in allways_true of PruningPredicate #8732

fix guarantees in allways_true of PruningPredicate #8732

my-vegetable-has-exploded commented Jan 3, 2024 •

edited

Loading

my-vegetable-has-exploded commented Jan 3, 2024 •

edited

Loading

alamb commented Jan 3, 2024

alamb left a comment

alamb commented Jan 3, 2024 •

edited

Loading

my-vegetable-has-exploded commented Jan 4, 2024 •

edited

Loading

alamb commented Jan 5, 2024

fix guarantees in allways_true of PruningPredicate #8732

fix guarantees in allways_true of PruningPredicate #8732

Conversation

my-vegetable-has-exploded commented Jan 3, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

my-vegetable-has-exploded commented Jan 3, 2024 • edited Loading

alamb commented Jan 3, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jan 3, 2024 • edited Loading

my-vegetable-has-exploded commented Jan 4, 2024 • edited Loading

alamb commented Jan 5, 2024

my-vegetable-has-exploded commented Jan 3, 2024 •

edited

Loading

my-vegetable-has-exploded commented Jan 3, 2024 •

edited

Loading

alamb commented Jan 3, 2024 •

edited

Loading

my-vegetable-has-exploded commented Jan 4, 2024 •

edited

Loading