Add Datum based arithmetic kernels (#3999) #4465

tustvold · 2023-06-29T13:29:15Z

Which issue does this PR close?

Closes #527
Closes #3999

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

divide_dyn will now return NaN for division by zero
Decimal arithmetic is always checked and sets the precision and scale automatically
Temporal arithmetic is always checked

arrow-arith/src/operation.rs

alamb

😍

arrow-arith/src/operation.rs

tustvold · 2023-07-04T12:20:38Z

arrow-arith/src/arithmetic.rs

-        ]);
-
-        // unchecked
-        let result = subtract_dyn(&a, &b);


Temporal arithmetic is now always checked

tustvold · 2023-07-04T12:23:27Z

arrow-arith/src/arithmetic.rs

    fn test_f32_array_modulus_dyn_by_zero() {
        let a = Float32Array::from(vec![1.5]);
        let b = Float32Array::from(vec![0.0]);
-        modulus_dyn(&a, &b).unwrap();
+        let result = modulus_dyn(&a, &b).unwrap();
+        assert!(result.as_primitive::<Float32Type>().value(0).is_nan());


Floating point arithmetic now follows the IEEE 754 standard. My research showed databases to handle division by zero very inconsistently, some returning null and some an error. Broadly speaking it seems peculiar to special case division by zero, and not any of the other cases that can lead to Nan. Much like we do for total ordering of floats, I think we should just follow the floating point standard rather than trying to copy some subset of the databases in the wild. As a side benefit this is also significantly faster 😄

tustvold · 2023-07-04T12:40:35Z

arrow-arith/src/numeric.rs

+///
+/// Overflow or division by zero will result in an error, with exception to
+/// floating point numbers, which instead follow the IEEE 754 rules
+pub fn rem(lhs: &dyn Datum, rhs: &dyn Datum) -> Result<ArrayRef, ArrowError> {


I opted for rem instead of mod to be consistent with the Rust nomenclature for this operation

alamb

I think the code looks (really) nice 👨‍🍳 👌

Tests?

I didn't see any new tests -- how well covered is this code? Maybe we can port the old tests from arithmetic.rs?

API change

This is a non trivial API change, righ? I like that you have added deprecation notices to the kernels in arithmetic module

I think we should do some other things to help users:

Add (deprecated) backwards compatibility definitions (for example, define an add_wrapping that calls add) so they don't need to change all their code immediately
Consider writing up a "migration guide" that highlights the changes for users -- namely the new Datum abstraction and that the arithmetic kernels are now all dynamically dispatched (and renamed for consistency).

alamb · 2023-07-05T19:02:53Z

arrow-arith/src/aggregate.rs

        // create an array that actually has non-zero values at the invalid indices
-        let c = add(&a, &b).unwrap();
+        let validity = NullBuffer::new((1..=100).map(|x| x % 3 == 0).collect());


I assume these are updated because they were testing the sum kernel for two arrays rather than the aggregate sum (which is what is defined in this module)?

It was to avoid using a now deprecated kernel

alamb · 2023-07-05T19:04:10Z

arrow-arith/src/lib.rs

@@ -18,8 +18,10 @@
 //! Arrow arithmetic and aggregation kernels

 pub mod aggregate;
+#[doc(hidden)] // Kernels to be removed in a future release


I think this should be tracked in another ticket perhaps

alamb · 2023-07-05T19:04:35Z

arrow-arith/src/numeric.rs

+// specific language governing permissions and limitations
+// under the License.
+
+//! Defines numeric kernels on PrimitiveArray


Suggested change

//! Defines numeric kernels on PrimitiveArray

//! Defines numeric kernels on [`PrimitiveArray`] such as [`add`]

alamb · 2023-07-05T19:06:55Z

arrow-arith/src/numeric.rs

+    let l = l.as_primitive::<T>();
+    let r = r.as_primitive::<T>();
+    let array: PrimitiveArray<T> = match op {
+        Op::AddWrapping | Op::Add => op!(l, l_s, r, r_s, l.add_wrapping(r)),


Add and AddWrapping call add_wrapping is because there is no such thing as "float overflow", right (when they exceed the range the turn into Nan or inf)?

alamb · 2023-07-05T19:09:45Z

arrow-arith/src/numeric.rs

+    let array: PrimitiveArray<T> = match (op, r.data_type()) {
+        (Op::Sub | Op::SubWrapping, Timestamp(unit, _)) if unit == &T::UNIT => {
+            let r = r.as_primitive::<T>();
+            return Ok(try_op_ref!(T::Duration, l, l_s, r, r_s, l.sub_checked(r)));


Do i read this as Op::SubWrapping will generate an error on underflow (only) for timestamp arithmetic?

Given the other kernels seem to use Op::SubWrapping and Op::Sub distinguish between non-erroring erroring variants, is there a reason for the discrepancy in timestamp behavior?

If this behavior will stay, I think it should be documented in add, add_wrapping, etc

The docs on add_wrapping state that it only performs wrapping overflow for integers, i.e. not for termporal, decimal, etc... This is because the overflow behaviour is not very well defined, and at least in the temporal case has never existed. Let me know if the existing docs are insufficient

The docs say:

/// Perform lhs + rhs, wrapping on overflow for integers

I think it would improve the UX a lot of it explicitly calls out that wrapping doesn't apply to temporal (given they are stored as integers, we one could imagine people like myself not realizing they didn't wrap on overflow)

I will update to reference DataType::is_integer

arrow-arith/src/numeric.rs

alamb · 2023-07-05T19:17:21Z

arrow-arith/src/numeric.rs

+use crate::arity::{binary, try_binary};
+
+/// Perform `lhs + rhs`, returning an error on overflow
+pub fn add(lhs: &dyn Datum, rhs: &dyn Datum) -> Result<ArrayRef, ArrowError> {


Rather than add maybe we could call this add_checked to:

Make it explicit the user is choosing the checked variant

Be consistent with https://docs.rs/num-traits/latest/num_traits/ops/wrapping/index.html ?

I don't feel super strongly about this

I thought it was less confusing to make wrapping the special case, as it only impacts integers

alamb · 2023-07-05T19:21:31Z

arrow-arith/src/arithmetic.rs

 pub fn add_checked<T: ArrowNumericType>(
    left: &PrimitiveArray<T>,
    right: &PrimitiveArray<T>,
 ) -> Result<PrimitiveArray<T>, ArrowError> {
-    math_checked_op(left, right, |a, b| a.add_checked(b))
+    try_binary(left, right, |a, b| a.add_checked(b))


could this call the new add_checked kernel directly?

Theoretically yes, but I was trying to avoid changing the behaviour of the generic kernels

But aren't the tests in terms of the original kernels? If you don't call into the new kernels they aren't tested.

Or perhaps I am missing something

The dyn kernels now call through to the datum kernels

tustvold · 2023-07-06T01:21:40Z

how well covered is this code

The existing tests of the dyn kernels which now call into this logic should give fairly good coverage, definitely could be improved though. Happy to do as a follow on

add_wrapping

This was an attempt to encourage the checked logic by default, I can change it back if you feel strongly

so they don't need to change all their code immediately

No changes are needed immediately, the old APIs are just deprecated. However, switching over will always require some changes as the types are different

alamb

This was an attempt to encourage the checked logic by default, I can change it back if you feel strongly

I don't feel strongly

No changes are needed immediately, the old APIs are just deprecated. However, switching over will always require some changes as the types are different

Other than the test coverage, I feel really good about this PR. Really nice work @tustvold

I suggest:

File a follow on ticket to track removing the old kernels
FIle a ticket to port the tests to the new kernels (will be required to remove the old kernels anyways)
Merge this PR

tustvold · 2023-07-07T12:08:36Z

Filed tickets #4480 #4481

github-actions bot added the arrow Changes to the arrow crate label Jun 29, 2023

tustvold mentioned this pull request Jun 29, 2023

Add Scalar/Datum abstraction (#1047) #4393

Merged

tustvold commented Jun 29, 2023

View reviewed changes

arrow-arith/src/operation.rs Outdated Show resolved Hide resolved

alamb reviewed Jun 29, 2023

View reviewed changes

arrow-arith/src/operation.rs Outdated Show resolved Hide resolved

arrow-arith/src/operation.rs Outdated Show resolved Hide resolved

arrow-arith/src/operation.rs Outdated Show resolved Hide resolved

This was referenced Jul 3, 2023

Incorrect Decimal Division Coercion apache/datafusion#6828

Closed

Update Arrow 45.0.0 And Datum Arithmetic, change Decimal Division semantics apache/datafusion#6832

Merged

Add Duration to ScalarValue apache/datafusion#6838

Merged

tustvold force-pushed the scalar-op branch from f402dd3 to 0395e3b Compare July 4, 2023 12:16

tustvold added the api-change Changes to the arrow API label Jul 4, 2023

tustvold commented Jul 4, 2023

View reviewed changes

Add Datum based arithmetic kernels (apache#3999)

9c461f7

tustvold force-pushed the scalar-op branch from 0395e3b to 9c461f7 Compare July 4, 2023 12:40

tustvold commented Jul 4, 2023

View reviewed changes

Fix benchmark

37eb45d

tustvold marked this pull request as ready for review July 4, 2023 14:10

This was referenced Jul 4, 2023

Use Specialization Instead of ScalarValue Binary Operations apache/datafusion#6842

Open

Support Date - Date kernel #4383

Closed

alamb reviewed Jul 5, 2023

View reviewed changes

alamb approved these changes Jul 6, 2023

View reviewed changes

This was referenced Jul 7, 2023

Port Tests from Deprecated Arithmetic Kernels #4480

Closed

Remove Deprecated Arithmetic Kernels #4481

Closed

Review feedback

ce42ed5

tustvold merged commit ee2c292 into apache:master Jul 8, 2023

tustvold mentioned this pull request Aug 22, 2023

Use Datum for string kernels #4632

Closed

tustvold mentioned this pull request Jan 1, 2024

Temporal Extract/Date Part Kernel #5266

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Datum based arithmetic kernels (#3999) #4465

Add Datum based arithmetic kernels (#3999) #4465

tustvold commented Jun 29, 2023 •

edited

Loading

alamb left a comment

tustvold Jul 4, 2023

tustvold Jul 4, 2023

tustvold Jul 4, 2023

alamb left a comment

alamb Jul 5, 2023

tustvold Jul 6, 2023

alamb Jul 5, 2023

alamb Jul 5, 2023

alamb Jul 5, 2023

tustvold Jul 6, 2023

alamb Jul 5, 2023

tustvold Jul 6, 2023 •

edited

Loading

alamb Jul 6, 2023

tustvold Jul 7, 2023

alamb Jul 5, 2023

tustvold Jul 6, 2023

alamb Jul 5, 2023

tustvold Jul 6, 2023

alamb Jul 6, 2023

tustvold Jul 6, 2023

tustvold commented Jul 6, 2023 •

edited

Loading

alamb left a comment

tustvold commented Jul 7, 2023

	//! Defines numeric kernels on PrimitiveArray
	//! Defines numeric kernels on [`PrimitiveArray`] such as [`add`]

Add Datum based arithmetic kernels (#3999) #4465

Add Datum based arithmetic kernels (#3999) #4465

Conversation

tustvold commented Jun 29, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Tests?

API change

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Jul 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Jul 6, 2023 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

tustvold commented Jul 7, 2023

tustvold commented Jun 29, 2023 •

edited

Loading

tustvold Jul 6, 2023 •

edited

Loading

tustvold commented Jul 6, 2023 •

edited

Loading