Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for boolean columns in pruning logic #500

Merged
merged 4 commits into from
Jun 4, 2021

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jun 3, 2021

Closes #490

This PR adds support for pruning of boolean predicates such as flag_col, and not flag_col so that they can be used to prune row groups from parquet files and other predicates

It does not add code to handle flag_col = true and flag_col != false (which currently error and continue to do so) as those are simplified in the ConstantEvaluation pass.

This ended up being a larger change than I wanted because the logic to create col_min and col_max references was intertwined in PruningExpressionBuilder

Rationale for this change

See #490

What changes are included in this PR?

Major changes:

  1. Encapsulate stat_column_req into a new RequiredStatColumns struct
  2. Move expression reference and rewriting logic to StatisticsColumns
  3. Add rules for boolean columns

Are there any user-facing changes?

Additional predicates can be used to prune

@alamb alamb changed the title Add support for boolean columns in pruning logic d10273a Add support for boolean columns in pruning logic Jun 3, 2021
@@ -324,42 +417,20 @@ impl<'a> PruningExpressionBuilder<'a> {
self.scalar_expr
}

fn is_stat_column_missing(&self, statistics_type: StatisticsType) -> bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic was just refactored into RequiredStatColumns so I could reuse it

use crate::logical_plan;
let field = schema.field_with_name(column_name).ok()?;

if matches!(field.data_type(), &DataType::Boolean) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the actual logic / rules

let result = p.prune(&statistics).unwrap_err();
assert!(
result.to_string().contains(
"Data type Boolean not supported for scalar operation on dyn array"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these aren't great messages, but they are what happens on master today, and I figured I would document them for posterity (and maybe inspire people to help fix them)

// predicate expression can only be a binary expression
let (left, op, right) = match expr {
Expr::BinaryExpr { left, op, right } => (left, *op, right),
Expr::Column(name) => {
if let Some(expr) =
build_single_column_expr(&name, schema, required_columns, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of pattern probably can be written a bit shorter with some combinators. Something like:

build_single_column_expr(&name, schema, required_columns).ok().or(Ok(unhandled))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An excellent idea -- I will do so

@codecov-commenter
Copy link

codecov-commenter commented Jun 3, 2021

Codecov Report

Merging #500 (5ceb541) into master (28b0dad) will increase coverage by 0.09%.
The diff coverage is 93.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #500      +/-   ##
==========================================
+ Coverage   75.92%   76.02%   +0.09%     
==========================================
  Files         154      154              
  Lines       26195    26421     +226     
==========================================
+ Hits        19889    20087     +198     
- Misses       6306     6334      +28     
Impacted Files Coverage Δ
datafusion/src/physical_optimizer/pruning.rs 91.52% <93.75%> (+1.44%) ⬆️
datafusion/src/optimizer/utils.rs 48.22% <0.00%> (-1.78%) ⬇️
...ta/rust/core/src/serde/physical_plan/from_proto.rs 38.79% <0.00%> (-0.85%) ⬇️
...sta/rust/core/src/serde/logical_plan/from_proto.rs 35.96% <0.00%> (-0.22%) ⬇️
datafusion/src/logical_plan/builder.rs 90.04% <0.00%> (-0.05%) ⬇️
datafusion/src/physical_plan/planner.rs 80.32% <0.00%> (ø)
datafusion/src/optimizer/projection_push_down.rs 98.46% <0.00%> (+<0.01%) ⬆️
datafusion/src/logical_plan/expr.rs 84.60% <0.00%> (+0.07%) ⬆️
...lista/rust/core/src/serde/logical_plan/to_proto.rs 62.48% <0.00%> (+0.15%) ⬆️
datafusion/src/sql/planner.rs 84.37% <0.00%> (+0.26%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 28b0dad...5ceb541. Read the comment docs.

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice feature + great refactoring

@alamb
Copy link
Contributor Author

alamb commented Jun 4, 2021

I rebased this PR and added a few more tests. The code is unchanged

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me. 👍 Very well documented also 💯

@alamb alamb added the datafusion Changes in the datafusion crate label Jun 4, 2021
@alamb alamb merged commit 964f494 into apache:master Jun 4, 2021
@houqp houqp added the enhancement New feature or request label Jul 31, 2021
@alamb alamb deleted the alamb/prune_bool branch October 6, 2022 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support pruning for boolean columns
5 participants