Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support SQL-compliant NaN behavior on eq_dyn, neq_dyn, lt_dyn, lt_eq_dyn, gt_dyn, gt_eq_dyn #2570

Merged
merged 2 commits into from
Aug 26, 2022

Conversation

viirya
Copy link
Member

@viirya viirya commented Aug 24, 2022

Which issue does this PR close?

Closes #2569.

Rationale for this change

These comparison kernels behaves different with SQL semantics on NaN handling. By definition, NaN is not equal to itself. But NaN is equal to NaN with SQL semantics and NaN is larger than any other numeric values.

Using current comparison kernels in SQL system leads to different behavior and generates incorrect results.

What changes are included in this PR?

Are there any user-facing changes?

@viirya
Copy link
Member Author

viirya commented Aug 24, 2022

cc @sunchao

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 24, 2022
@@ -2386,7 +2408,30 @@ pub fn eq_dyn(left: &dyn Array, right: &dyn Array) -> Result<BooleanArray> {
_ if matches!(right.data_type(), DataType::Dictionary(_, _)) => {
typed_cmp_dict_non_dict!(right, left, |a, b| a == b, |a, b| a == b)
Copy link
Member Author

@viirya viirya Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For dictionary/non_dictionary comparison, it should be updated too. I will add it in follow-up PR. The PR is quite large.

@viirya viirya force-pushed the sql_comparison branch 3 times, most recently from f4b0ae8 to 1a6ccbf Compare August 24, 2022 06:43
@tustvold
Copy link
Contributor

Have we considered just making this the default behaviour? If we don't want to do that, I think we should name the feature flag something like ordered_nan or something to make clear it controls nan ordering and not something else?

@tustvold tustvold changed the title Support SQL-compliant behavior on eq_dyn, neq_dyn, lt_dyn, lt_eq_dyn, gt_dyn, gt_eq_dyn Support SQL-compliant NaN behavior on eq_dyn, neq_dyn, lt_dyn, lt_eq_dyn, gt_dyn, gt_eq_dyn Aug 24, 2022
Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the current behavior for NaN? maybe worth adding some context in the PR description? #264 is a good reference on this topic too. Note there is no SQL standard for NaN, and different engines may have different behaviors. For instance, in PostgresSQL NaN is only considered equal to NaN in sort, but not other cases. Therefore, I think we should clearly document the behavior introduced here.

Similar question as @tustvold : should we aim to make all the compute kernels SQL compliant? in that case we should no longer need a flag like this.

Also, could we have some tests for this too?

@viirya
Copy link
Member Author

viirya commented Aug 24, 2022

I'm open to the feature flag naming. sql_compliant is used is because this is to follow SQL semantics regarding NaN handling. Maybe it sounds too broad as currently it only deals with NaN specially.

Have we considered just making this the default behaviour?

I'm not sure that this would make sense for other usecases other than SQL.

Note there is no SQL standard for NaN, and different engines may have different behaviors. For instance, in PostgresSQL NaN is only considered equal to NaN in sort, but not other cases. Therefore, I think we should clearly document the behavior introduced here.

I think that PostgresSQL also treats NaN equal to NaN, as Spark does.

Quote from https://www.postgresql.org/docs/current/datatype-numeric.html:

In order to allow numeric values to be sorted and used in tree-based indexes, PostgreSQL treats NaN values as equal, and greater than all non-NaN values.

Just did a test SELECT double precision 'NaN' = double precision 'NaN'; in PostgresSQL, it returns true.

I agree that we should document it clearly. I will update the document.

Similar question as @tustvold : should we aim to make all the compute kernels SQL compliant? in that case we should no longer need a flag like this.

As I answered above for @tustvold's question, I think we need both behaviors of compute kernels. For non-SQL usecases, current behavior is correct. But for SQL semantics, NaN not equal to NaN will cause practical issue when processing data, so we need different behavior with it. That's said, I think that we cannot just change all compute kernels to follow SQL semantics.

Also, could we have some tests for this too?

I have some tests for this already.

@sunchao
Copy link
Member

sunchao commented Aug 24, 2022

I think that PostgresSQL also treats NaN equal to NaN, as Spark does.

Actually Vertica is a better example there.

I have some tests for this already.

Oops didn't see them.

@viirya
Copy link
Member Author

viirya commented Aug 25, 2022

Renamed the feature flag and added documentation about it on these kernels.

@viirya
Copy link
Member Author

viirya commented Aug 25, 2022

More comments? @sunchao @tustvold

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (non-binding)

@viirya viirya merged commit 63afe25 into apache:master Aug 26, 2022
@viirya
Copy link
Member Author

viirya commented Aug 26, 2022

Thank you for review.

@viirya
Copy link
Member Author

viirya commented Aug 26, 2022

I will submit the missing pieces (dictionary array with non dictionary array, etc.) later.

@ursabot
Copy link

ursabot commented Aug 26, 2022

Benchmark runs are scheduled for baseline = a685c5f and contender = 63afe25. 63afe25 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support SQL-compliant behavior on eq_dyn, neq_dyn, lt_dyn, lt_eq_dyn, gt_dyn, gt_eq_dyn
5 participants