-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support SQL-compliant NaN behavior on eq_dyn, neq_dyn, lt_dyn, lt_eq_dyn, gt_dyn, gt_eq_dyn #2570
Conversation
cc @sunchao |
@@ -2386,7 +2408,30 @@ pub fn eq_dyn(left: &dyn Array, right: &dyn Array) -> Result<BooleanArray> { | |||
_ if matches!(right.data_type(), DataType::Dictionary(_, _)) => { | |||
typed_cmp_dict_non_dict!(right, left, |a, b| a == b, |a, b| a == b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For dictionary/non_dictionary comparison, it should be updated too. I will add it in follow-up PR. The PR is quite large.
f4b0ae8
to
1a6ccbf
Compare
Have we considered just making this the default behaviour? If we don't want to do that, I think we should name the feature flag something like ordered_nan or something to make clear it controls nan ordering and not something else? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the current behavior for NaN? maybe worth adding some context in the PR description? #264 is a good reference on this topic too. Note there is no SQL standard for NaN, and different engines may have different behaviors. For instance, in PostgresSQL NaN is only considered equal to NaN in sort, but not other cases. Therefore, I think we should clearly document the behavior introduced here.
Similar question as @tustvold : should we aim to make all the compute kernels SQL compliant? in that case we should no longer need a flag like this.
Also, could we have some tests for this too?
I'm open to the feature flag naming.
I'm not sure that this would make sense for other usecases other than SQL.
I think that PostgresSQL also treats NaN equal to NaN, as Spark does. Quote from https://www.postgresql.org/docs/current/datatype-numeric.html:
Just did a test I agree that we should document it clearly. I will update the document.
As I answered above for @tustvold's question, I think we need both behaviors of compute kernels. For non-SQL usecases, current behavior is correct. But for SQL semantics, NaN not equal to NaN will cause practical issue when processing data, so we need different behavior with it. That's said, I think that we cannot just change all compute kernels to follow SQL semantics.
I have some tests for this already. |
Actually Vertica is a better example there.
Oops didn't see them. |
Renamed the feature flag and added documentation about it on these kernels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (non-binding)
Thank you for review. |
I will submit the missing pieces (dictionary array with non dictionary array, etc.) later. |
Benchmark runs are scheduled for baseline = a685c5f and contender = 63afe25. 63afe25 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #2569.
Rationale for this change
These comparison kernels behaves different with SQL semantics on NaN handling. By definition, NaN is not equal to itself. But NaN is equal to NaN with SQL semantics and NaN is larger than any other numeric values.
Using current comparison kernels in SQL system leads to different behavior and generates incorrect results.
What changes are included in this PR?
Are there any user-facing changes?