Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36420: [C++] Add An Enum Option For SetLookup Options #36739

Merged
merged 46 commits into from
Sep 22, 2023

Conversation

R-JunmingChen
Copy link
Contributor

@R-JunmingChen R-JunmingChen commented Jul 18, 2023

Rationale for this change

As #36420 says, we want add an sql-compatible is_in variant, which has a different logic handling Null. After a dicussion with @ianmcook and @bkietz, we decide to support an enum option null_matching_behavior for SetLookup, which actually adds two semantics of null handling for is_in and doesn't add an new behavior for index_in.

The enum option null_matching_behavior will replace skip_nulls in the future.

What changes are included in this PR?

Add an enum parameter null_matching_behavior for SetLookupOptions.

Are these changes tested?

Two kinds of tests are implemented

  • Replace default parameter with null_matching_behavior instead of skip_nulls for is_in and index_in tests
  • Add tests for NullMatchingBehavior::EMIT_NULL and NullMatchingBehavior::INCONCLUSIVE for is_in

Besides, since the skip_nulls is not deprecated now, I still preserve the old tests with skip_nulls. When the skip_nulls is totally deprecated, we can replace the test parameter skip_nulls=false with null_matching_behavior=MATCH and skip_nulls=true with null_matching_behavior=SKIP for these old tests.

Are there any user-facing changes?

No. Currently we support backward compatibility. In the future, we plan to replace skip_nulls with null_matching_behavior completely.

@pitrou
Copy link
Member

pitrou commented Jul 19, 2023

Why not improve SetLookupOptions instead to allow selecting the desired semantics?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jul 19, 2023
@ianmcook
Copy link
Member

Why not improve SetLookupOptions instead to allow selecting the desired semantics?

+1. When possible, we typically prefer to add new options to existing functions instead of adding new functions.

@R-JunmingChen
Copy link
Contributor Author

Why not improve SetLookupOptions instead to allow selecting the desired semantics?

Hi, here are my two considerations:

  1. The null_handling of all the kernels of is_in is NullHandling::OUTPUT_NOT_NULL. While a sql-compatible is_in can have NULL outputs.
  2. It's a bit weird to add an option like sql_compatible specific for is_in in SetLookupOptions. Since SetLookupOptions is used by is_in and index_in.

But these two considerations are not strict obstacles for using an option instead. For 1, we can set null_handling to NullHandling::COMPUTED_PREALLOCATE and init the null bitmap to true in exec for un-sql-compatible is_in. And the new option could be named as sql_compatible, which is only used by is_in.

If you think the solution is ok or have a better solution, I will change code to an new option version soon.

@R-JunmingChen
Copy link
Contributor Author

Since no further feedback, I plan to transfer to an new option version as I have commented before. Hi, @pitrou @ianmcook Do you think it's ok?

@ianmcook
Copy link
Member

ianmcook commented Aug 2, 2023

@R-JunmingChen see my comments at #36420 (comment).

Based on that, here is what I recommend:

In SetLookupOptions, replace the bool option skip_nulls with an enum option:

enum NullMatchingBehavior
How to match null values.
Values:

  • enumerator MATCH
    Any null in value_set is successfully matched to null values in the lookup set.
  • enumerator SKIP
    Any null in value_set is ignored and produces a null (IndexIn) or false (IsIn) in the output.
  • enumerator EMIT_NULL
    Any null in value_set produces a null in the output.
  • enumerator INCONCLUSIVE
    Any null in value_set produces a null in the output. Any null in the lookup set is treated as an unknown value that cannot conclusively match values in value_set and any value in value_set that cannot be conclusively matched produces a null in the output. This is consistent with the behavior of IN in SQL.

Alternatively we could name it UNKNOWN instead of INCONCLUSIVE if people think that is clearer.

For backward compatibility:

  • Default to MATCH (consistent with the current default skip_nulls = false)
  • Continue to accept the existing boolean argument skip_nulls and respect it if the user does not use the new enum-based option. false translates to MATCH and true translates to SKIP.

I believe this should work with both is_in and index_in.

@westonpace does this look OK to you?

@R-JunmingChen
Copy link
Contributor Author

R-JunmingChen commented Aug 14, 2023

Hi, @ianmcook. I think it's a good solution to use an enum option, but there are still somethings I think should be discussed.

  1. Why do we want to support EMIT_NULL? MATCH, SKIP and INCONCLUSIVE seem sufficient.
  2. Since we want support skip_nulls for backward compatibility, we need an enumerator UNSET for NullMatchingBehavior, which means we only use skip_nulls instead of NullMatchingBehavior, and make it default for the enum option. Cause we can't translate when user set the skip_nulls via the way like setLookupOptions.skip_nulls = false.

@ianmcook
Copy link
Member

Thanks @R-JunmingChen

  1. Why do we want to support EMIT_NULL? MATCH, SKIP and INCONCLUSIVE seem sufficient.

It is nice to have EMIT_NULL for completeness. There are some cases where the user might want that behavior—for example, to match the behavior of DuckDB's list_contains function like I describe here: #36420 (comment)

  1. Since we want support skip_nulls for backward compatibility, we need an enumerator UNSET for NullMatchingBehavior, which means we only use skip_nulls instead of NullMatchingBehavior, and make it default for the enum option. Cause we can't translate when user set the skip_nulls via the way like setLookupOptions.skip_nulls = false.

I think it is possible to define 2 explicit public functions of the class SetLookupOptions:

  1. SetLookupOptions(Datum value_set, bool skip_nulls = false) (same as currently)
  2. SetLookupOptions(Datum value_set, NullMatchingBehavior null_matching_behavior = NullMatchingBehavior::MATCH)

I think that will give backward compatibility.

@R-JunmingChen
Copy link
Contributor Author

R-JunmingChen commented Aug 15, 2023

I think it is possible to define 2 explicit public functions of the class SetLookupOptions:

SetLookupOptions(Datum value_set, bool skip_nulls = false) (same as currently)
SetLookupOptions(Datum value_set, NullMatchingBehavior null_matching_behavior = NullMatchingBehavior::MATCH)

Hi @ianmcook, do you mean we remove the member variable skip_nulls and make it a parameter in constructor function and then translate the skip_nulls to NullMatchingBehavior ?

I think this way will affect some users who set skip_nulls directly via member variable. Does it ok?

If we don't remove member variable `skip_nulls, the translation problem still exists. For example:

SetLookupOptions option(value_set);
option.skip_nulls = true;

@ianmcook
Copy link
Member

I think it is possible to define 2 explicit public functions of the class SetLookupOptions:
SetLookupOptions(Datum value_set, bool skip_nulls = false) (same as currently)
SetLookupOptions(Datum value_set, NullMatchingBehavior null_matching_behavior = NullMatchingBehavior::MATCH)

Hi @ianmcook, do you mean we remove the member variable skip_nulls and make it a parameter in constructor function and then translate the skip_nulls to NullMatchingBehavior ?

I think this way will affect some users who set skip_nulls directly via member variable. Does it ok?

Yes, I think this is OK. We can mark this as a breaking change if needed, but I think it should not affect most users. @westonpace or @bkietz can you please confirm?

@bkietz
Copy link
Member

bkietz commented Aug 16, 2023

I would not expect there to be many direct users of the skip_nulls data member, but disentangling the corresponding cython will be annoying. Since it's a breaking change, we could compromise with:

/// Options for IsIn and IndexIn functions
class ARROW_EXPORT SetLookupOptions : public FunctionOptions {
 public:
  enum NullMatchingBehavior { MATCH, SKIP, EMIT_NULL, INCONCLUSIVE };

  explicit SetLookupOptions(Datum value_set, NullMatchingBehavior = MATCH);
  explicit SetLookupOptions(Datum value_set, bool skip_nulls);
  SetLookupOptions();
  static constexpr char const kTypeName[] = "SetLookupOptions";

  /// The set of values to look up input values into.
  Datum value_set;

  // will be overridden by skip_nulls if that is explicitly assigned
  NullMatchingBehavior null_matching_behavior;

  // DEPRECATED(use null_matching_behavior instead)
  std::optional<bool> skip_nulls;
};

with this, opts.skip_nulls = false will still work and operate as expected and (I think) cython changes will be minimal until we're ready fully to accommodate null_matching_behavior in python.

@R-JunmingChen
Copy link
Contributor Author

R-JunmingChen commented Aug 18, 2023

I will implement this version ASAP. Let me turn this PR to a draft temporarily

@R-JunmingChen R-JunmingChen marked this pull request as draft August 18, 2023 02:26
@github-actions github-actions bot added the awaiting changes Awaiting changes label Sep 20, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 20, 2023
Copy link
Member

@ianmcook ianmcook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me in terms of its API and docs. I defer to @bkietz to give a technical approval.

It looks like Python tests are failing. Can you troubleshoot that please? Thanks

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Sep 20, 2023
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

CI failures seem unrelated:

Dev failure looks like #37803

Appveyor failure looks like #37790

I've restarted the python test (which seemed to fail inside pip). If that passes I think we can merge

@ianmcook
Copy link
Member

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Sep 20, 2023

Benchmark runs are scheduled for commit 38c1a04. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 0 benchmarking runs that have been run so far on PR commit 38c1a04.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

@R-JunmingChen
Copy link
Contributor Author

R-JunmingChen commented Sep 21, 2023

The Python failure looks like #37803 too. The error in its screenshot is similar with what we fails in our Python CI.

@ianmcook
Copy link
Member

I'm not sure why Conbench isn't working as expected here. The benchmarks should run after this merges and we can check for regressions there.

@austin3dickey
Copy link
Contributor

austin3dickey commented Sep 21, 2023

I'm not sure why Conbench isn't working as expected here.

Sorry, this looks to be a bug with the "ursabot please benchmark" workflow. Will fix today.

voltrondata-labs/benchmarks#150

@austin3dickey
Copy link
Contributor

That should be fixed. If you push an empty commit (voltrondata-labs/arrow-benchmarks-ci#111) and then re-run the command, you can try benchmarks again. Else, yeah, they will be run post-merge.

@ianmcook
Copy link
Member

@bkietz if you want to go ahead and merge when you're comfortable with this, we can just look at the benchmarks post-merge

@bkietz bkietz merged commit 772a01c into apache:main Sep 22, 2023
@bkietz bkietz removed the awaiting merge Awaiting merge label Sep 22, 2023
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 772a01c.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

etseidl pushed a commit to etseidl/arrow that referenced this pull request Sep 28, 2023
…e#36739)

### Rationale for this change
As apache#36420 says, we want add an sql-compatible `is_in` variant, which has a different logic handling Null. After a dicussion with @ ianmcook and @ bkietz, we decide to support an enum option `null_matching_behavior` for SetLookup, which actually adds two semantics of null handling for `is_in` and doesn't add an new behavior for `index_in`.

The enum option `null_matching_behavior` will replace `skip_nulls` in the future.

### What changes are included in this PR?
Add an enum parameter `null_matching_behavior` for SetLookupOptions.

### Are these changes tested?
Two kinds of tests are implemented
- Replace default parameter with `null_matching_behavior` instead of `skip_nulls` for `is_in` and `index_in` tests 
- Add tests for `NullMatchingBehavior::EMIT_NULL` and `NullMatchingBehavior::INCONCLUSIVE` for `is_in`

Besides, since the `skip_nulls` is not deprecated now, I still preserve the old tests with `skip_nulls`. When the `skip_nulls` is totally deprecated, we can replace the test parameter `skip_nulls=false` with `null_matching_behavior=MATCH` and `skip_nulls=true` with `null_matching_behavior=SKIP` for these old tests.

### Are there any user-facing changes?
No. Currently we support backward compatibility. In the future, we plan to replace `skip_nulls` with `null_matching_behavior` completely.

* Closes: apache#36420

Lead-authored-by: Junming Chen <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…e#36739)

### Rationale for this change
As apache#36420 says, we want add an sql-compatible `is_in` variant, which has a different logic handling Null. After a dicussion with @ ianmcook and @ bkietz, we decide to support an enum option `null_matching_behavior` for SetLookup, which actually adds two semantics of null handling for `is_in` and doesn't add an new behavior for `index_in`.

The enum option `null_matching_behavior` will replace `skip_nulls` in the future.

### What changes are included in this PR?
Add an enum parameter `null_matching_behavior` for SetLookupOptions.

### Are these changes tested?
Two kinds of tests are implemented
- Replace default parameter with `null_matching_behavior` instead of `skip_nulls` for `is_in` and `index_in` tests 
- Add tests for `NullMatchingBehavior::EMIT_NULL` and `NullMatchingBehavior::INCONCLUSIVE` for `is_in`

Besides, since the `skip_nulls` is not deprecated now, I still preserve the old tests with `skip_nulls`. When the `skip_nulls` is totally deprecated, we can replace the test parameter `skip_nulls=false` with `null_matching_behavior=MATCH` and `skip_nulls=true` with `null_matching_behavior=SKIP` for these old tests.

### Are there any user-facing changes?
No. Currently we support backward compatibility. In the future, we plan to replace `skip_nulls` with `null_matching_behavior` completely.

* Closes: apache#36420

Lead-authored-by: Junming Chen <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…e#36739)

### Rationale for this change
As apache#36420 says, we want add an sql-compatible `is_in` variant, which has a different logic handling Null. After a dicussion with @ ianmcook and @ bkietz, we decide to support an enum option `null_matching_behavior` for SetLookup, which actually adds two semantics of null handling for `is_in` and doesn't add an new behavior for `index_in`.

The enum option `null_matching_behavior` will replace `skip_nulls` in the future.

### What changes are included in this PR?
Add an enum parameter `null_matching_behavior` for SetLookupOptions.

### Are these changes tested?
Two kinds of tests are implemented
- Replace default parameter with `null_matching_behavior` instead of `skip_nulls` for `is_in` and `index_in` tests 
- Add tests for `NullMatchingBehavior::EMIT_NULL` and `NullMatchingBehavior::INCONCLUSIVE` for `is_in`

Besides, since the `skip_nulls` is not deprecated now, I still preserve the old tests with `skip_nulls`. When the `skip_nulls` is totally deprecated, we can replace the test parameter `skip_nulls=false` with `null_matching_behavior=MATCH` and `skip_nulls=true` with `null_matching_behavior=SKIP` for these old tests.

### Are there any user-facing changes?
No. Currently we support backward compatibility. In the future, we plan to replace `skip_nulls` with `null_matching_behavior` completely.

* Closes: apache#36420

Lead-authored-by: Junming Chen <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…e#36739)

### Rationale for this change
As apache#36420 says, we want add an sql-compatible `is_in` variant, which has a different logic handling Null. After a dicussion with @ ianmcook and @ bkietz, we decide to support an enum option `null_matching_behavior` for SetLookup, which actually adds two semantics of null handling for `is_in` and doesn't add an new behavior for `index_in`.

The enum option `null_matching_behavior` will replace `skip_nulls` in the future.

### What changes are included in this PR?
Add an enum parameter `null_matching_behavior` for SetLookupOptions.

### Are these changes tested?
Two kinds of tests are implemented
- Replace default parameter with `null_matching_behavior` instead of `skip_nulls` for `is_in` and `index_in` tests 
- Add tests for `NullMatchingBehavior::EMIT_NULL` and `NullMatchingBehavior::INCONCLUSIVE` for `is_in`

Besides, since the `skip_nulls` is not deprecated now, I still preserve the old tests with `skip_nulls`. When the `skip_nulls` is totally deprecated, we can replace the test parameter `skip_nulls=false` with `null_matching_behavior=MATCH` and `skip_nulls=true` with `null_matching_behavior=SKIP` for these old tests.

### Are there any user-facing changes?
No. Currently we support backward compatibility. In the future, we plan to replace `skip_nulls` with `null_matching_behavior` completely.

* Closes: apache#36420

Lead-authored-by: Junming Chen <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Add an sql-compatible is_in variant
9 participants