Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support =, <, <=, >, >=, !=, is distinct from, is not distinct from for BooleanArray #1163

Merged
merged 5 commits into from
Nov 20, 2021

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Oct 21, 2021

Which issue does this PR close?

Resolves #1159

PR is mostly tests

Sorry for what looks like a large PR :( I blame the test

Rationale for this change

See #1159

This is mostly interesting so that during constant folding / simplification we can simplify down de-generate expressions like true = false

Now that @jimexist added apache/arrow-rs#860 and @Dandandan added apache/arrow-rs#844 in arrow, this PR hooks that up

Also, it has the nice side effect benefit parquet row group pruning is now supported for boolean columns as well 🎉

What changes are included in this PR?

  1. Update to arrow 6.2.0
  2. Support =, <, <=, >, >=, !=,is distinct from, is not distinct from for BooleanArray(aka for boolean columns)
  3. Simple implementations of *_scalar_bool
  4. Many tests
  5. Update the pruning tests to reflect the fact that boolean pruning now happens

Are there any user-facing changes?

Less errors

@@ -814,14 +870,68 @@ pub fn binary(
Ok(Arc::new(BinaryExpr::new(l, op, r)))
}

// TODO file a ticket with arrow-rs to include these kernels
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO file a ticket with arrow-rs to include these kernels
// When arrow-rs has these kernels, can remove this implementation
// see https://github.com/apache/arrow-rs/issues/842

Filed ticket in arrow apache/arrow-rs#842

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW @Dandandan has added these kernels upstream to Arrow so we can use 6.1.0 when that comes out (in a week or so): apache/arrow-rs#844

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimexist has actually implemented operations like bool_lt etc in apache/arrow-rs#860 so when that is available in datafusion (next week) I will update this PR to include those operations as well

@alamb alamb marked this pull request as draft October 26, 2021 10:51
@alamb alamb changed the title Support <bool col> = <bool col> and <bool col> != <bool col> (WIP) Support <bool col> = <bool col> and <bool col> != <bool col> Oct 26, 2021
@alamb
Copy link
Contributor Author

alamb commented Nov 2, 2021

This one is waiting on arrow-rs 6.1 to be released, and then I should be able to clean it up and get it ready for a proper review

@alamb
Copy link
Contributor Author

alamb commented Nov 5, 2021

Turns out that we forgot to make the required functions public 🤦 . Will wait for arrow 6.2 to include apache/arrow-rs#913

@alamb alamb changed the title (WIP) Support <bool col> = <bool col> and <bool col> != <bool col> Support <bool col> = <bool col> and <bool col> != <bool col> Nov 15, 2021
@alamb alamb marked this pull request as ready for review November 15, 2021 19:19
@alamb alamb changed the title Support <bool col> = <bool col> and <bool col> != <bool col> Support =, <, <=, >, >=, != for BooleanArray Nov 15, 2021
@alamb alamb changed the title Support =, <, <=, >, >=, != for BooleanArray Support =, <, <=, >, >=, !=, is distinct from, is not distinct from for BooleanArray Nov 15, 2021
Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result
)
let result = p.prune(&statistics).unwrap();
assert_eq!(result, expected_true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pruning works for boolean columns now

datafusion/src/physical_plan/expressions/binary.rs Outdated Show resolved Hide resolved
@@ -276,6 +377,7 @@ macro_rules! binary_array_op_scalar {
DataType::Date64 => {
compute_op_scalar!($LEFT, $RIGHT, $OP, Date64Array)
}
DataType::Boolean => compute_bool_op_scalar!($LEFT, $RIGHT, $OP, BooleanArray),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding this line and the one below it adds all the new support, which is kind of cool! It is terrifying how many functions end up being called :)

// where a null array is generated for some statistics columns
// int > 1 and bool = true => c1_max > 1 and null
let expr = col("c1").gt(lit(15)).and(col("c2").eq(lit(true)));
// test row group predicate with an unknown (Null) expr
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now bool stats don't result in null columns, so I needed to use a constant to get the same effect

@alamb alamb requested review from Dandandan and jimexist November 15, 2021 22:06
@alamb
Copy link
Contributor Author

alamb commented Nov 18, 2021

This PR is ready for review / analysis if/when you get a chance @jimexist / @Dandandan / @houqp / @rdettai. It looks much bigger than it is because of the tests. It is mostly about hooking up some more arrow compute kernels

There are many PRs flying in DataFusion recently 😅 fun times

Copy link
Contributor

@rdettai rdettai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition @alamb ! thanks !

datafusion/src/physical_plan/expressions/binary.rs Outdated Show resolved Hide resolved
datafusion/src/physical_plan/expressions/binary.rs Outdated Show resolved Hide resolved
Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @rdettai -- I'll plan to merge this one in tomorrow and file arrow-rs tickets if there are no other comments.

.expect("compute_op failed to downcast array");
// generate the scalar function name, such as lt_scalar, from the $OP parameter
// (which could have a value of lt) and the suffix _scalar
Ok(Arc::new(paste::expr! {[<$OP _bool_scalar>]}(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record this pattern is used elsewhere in this file, I was just following it :)

@alamb alamb merged commit 00850a4 into apache:master Nov 20, 2021
@alamb alamb deleted the alamb/bool_expr branch November 20, 2021 12:59
@alamb alamb added the enhancement New feature or request label Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support boolean == boolean and boolean != boolean operators
3 participants