Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43187: [C++] Support basic is_in predicate simplification #43761

Merged
merged 4 commits into from
Sep 10, 2024

Conversation

larry98
Copy link
Contributor

@larry98 larry98 commented Aug 19, 2024

Rationale for this change

Prior to #43256, this PR adds a basic implementation that does a linear scan filter over the value set on each guarantee. This isolates the correctness/semantics of is_in predicate simplification from the binary search performance optimization.

What changes are included in this PR?

SimplifyWithGuarantee now handles is_in expressions.

Are these changes tested?

A new unit test was added to arrow-compute-expression-test testing this change.

Are there any user-facing changes?

No.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great, thanks!

.WithGuarantee(
or_(equal(field_ref("i32"), literal(3)), is_null(field_ref("i32"))))
.ExpectUnchanged();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect this to pass:

Suggested change
Simplify{is_in(field_ref("i32"), int32(), "[1,3,null]", SetLookupOptions::MATCH)}
.WithGuarantee(
or_(equal(field_ref("i32"), literal(3)), is_null(field_ref("i32"))))
.Expect(is_in(field_ref("i32"), int32(), "[3,null]", SetLookupOptions::MATCH));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original intention was to not support simplification if nulls in the value set cannot be dropped (either the guarantee is nullable or the null matching behavior is INCONCLUSIVE). This is because in the optimized implementation where we binary search and slice the value set array, slicing the front would drop nulls (assuming they are placed at the end) so we would have to reallocate a new array for the simplified value set.

Do you think we ought to support nulls in the value set, and if so any thoughts on how we'd continue to support this with the binary search/slice implementation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filtering approach should be able to support arbitrary value sets, so it could serve as a fallback for the binary search/slice implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that sounds reasonable to me. I added new tests for nullable guarantees and nulls in the value set for all of the different null matching behaviors.

cpp/src/arrow/compute/expression_test.cc Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Aug 21, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 23, 2024
cpp/src/arrow/compute/expression_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/expression_test.cc Show resolved Hide resolved
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 28, 2024
@larry98
Copy link
Contributor Author

larry98 commented Aug 29, 2024

@bkietz any ideas on the failing checks? The errors don't appear to be relevant, but lmk if I should look into them.

@larry98 larry98 requested a review from bkietz August 29, 2024 16:27
@mapleFU
Copy link
Member

mapleFU commented Aug 29, 2024

Failed CI unrelated

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for adding this!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Aug 29, 2024
@bkietz bkietz merged commit 44b72d5 into apache:main Sep 10, 2024
37 of 40 checks passed
@bkietz bkietz removed the awaiting merge Awaiting merge label Sep 10, 2024
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 44b72d5.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 26 possible false positives for unstable benchmarks that are known to sometimes produce them.

khwilson pushed a commit to khwilson/arrow that referenced this pull request Sep 14, 2024
…pache#43761)

### Rationale for this change

Prior to apache#43256, this PR adds a basic implementation that does a linear scan filter over the value set on each guarantee. This isolates the correctness/semantics of `is_in` predicate simplification from the binary search performance optimization.

### What changes are included in this PR?

`SimplifyWithGuarantee` now handles `is_in` expressions.

### Are these changes tested?

A new unit test was added to arrow-compute-expression-test testing this change.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#43187

Lead-authored-by: Larry Wang <[email protected]>
Co-authored-by: larry98 <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
zeroshade pushed a commit to zeroshade/arrow that referenced this pull request Sep 30, 2024
…pache#43761)

### Rationale for this change

Prior to apache#43256, this PR adds a basic implementation that does a linear scan filter over the value set on each guarantee. This isolates the correctness/semantics of `is_in` predicate simplification from the binary search performance optimization.

### What changes are included in this PR?

`SimplifyWithGuarantee` now handles `is_in` expressions.

### Are these changes tested?

A new unit test was added to arrow-compute-expression-test testing this change.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#43187

Lead-authored-by: Larry Wang <[email protected]>
Co-authored-by: larry98 <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants