Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Datum based arithmetic kernels (#3999) #4465

Merged
merged 3 commits into from
Jul 8, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jun 29, 2023

Which issue does this PR close?

Closes #527
Closes #3999

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

  • divide_dyn will now return NaN for division by zero
  • Decimal arithmetic is always checked and sets the precision and scale automatically
  • Temporal arithmetic is always checked

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 29, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

arrow-arith/src/operation.rs Outdated Show resolved Hide resolved
arrow-arith/src/operation.rs Outdated Show resolved Hide resolved
arrow-arith/src/operation.rs Outdated Show resolved Hide resolved
]);

// unchecked
let result = subtract_dyn(&a, &b);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporal arithmetic is now always checked

fn test_f32_array_modulus_dyn_by_zero() {
let a = Float32Array::from(vec![1.5]);
let b = Float32Array::from(vec![0.0]);
modulus_dyn(&a, &b).unwrap();
let result = modulus_dyn(&a, &b).unwrap();
assert!(result.as_primitive::<Float32Type>().value(0).is_nan());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Floating point arithmetic now follows the IEEE 754 standard. My research showed databases to handle division by zero very inconsistently, some returning null and some an error. Broadly speaking it seems peculiar to special case division by zero, and not any of the other cases that can lead to Nan. Much like we do for total ordering of floats, I think we should just follow the floating point standard rather than trying to copy some subset of the databases in the wild. As a side benefit this is also significantly faster 😄

///
/// Overflow or division by zero will result in an error, with exception to
/// floating point numbers, which instead follow the IEEE 754 rules
pub fn rem(lhs: &dyn Datum, rhs: &dyn Datum) -> Result<ArrayRef, ArrowError> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted for rem instead of mod to be consistent with the Rust nomenclature for this operation

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code looks (really) nice 👨‍🍳 👌

Tests?

  1. I didn't see any new tests -- how well covered is this code? Maybe we can port the old tests from arithmetic.rs?

API change

This is a non trivial API change, righ? I like that you have added deprecation notices to the kernels in arithmetic module

I think we should do some other things to help users:

  1. Add (deprecated) backwards compatibility definitions (for example, define an add_wrapping that calls add) so they don't need to change all their code immediately

  2. Consider writing up a "migration guide" that highlights the changes for users -- namely the new Datum abstraction and that the arithmetic kernels are now all dynamically dispatched (and renamed for consistency).

// create an array that actually has non-zero values at the invalid indices
let c = add(&a, &b).unwrap();
let validity = NullBuffer::new((1..=100).map(|x| x % 3 == 0).collect());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume these are updated because they were testing the sum kernel for two arrays rather than the aggregate sum (which is what is defined in this module)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to avoid using a now deprecated kernel

@@ -18,8 +18,10 @@
//! Arrow arithmetic and aggregation kernels

pub mod aggregate;
#[doc(hidden)] // Kernels to be removed in a future release
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be tracked in another ticket perhaps

// specific language governing permissions and limitations
// under the License.

//! Defines numeric kernels on PrimitiveArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! Defines numeric kernels on PrimitiveArray
//! Defines numeric kernels on [`PrimitiveArray`] such as [`add`]

let l = l.as_primitive::<T>();
let r = r.as_primitive::<T>();
let array: PrimitiveArray<T> = match op {
Op::AddWrapping | Op::Add => op!(l, l_s, r, r_s, l.add_wrapping(r)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add and AddWrapping call add_wrapping is because there is no such thing as "float overflow", right (when they exceed the range the turn into Nan or inf)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

let array: PrimitiveArray<T> = match (op, r.data_type()) {
(Op::Sub | Op::SubWrapping, Timestamp(unit, _)) if unit == &T::UNIT => {
let r = r.as_primitive::<T>();
return Ok(try_op_ref!(T::Duration, l, l_s, r, r_s, l.sub_checked(r)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do i read this as Op::SubWrapping will generate an error on underflow (only) for timestamp arithmetic?

Given the other kernels seem to use Op::SubWrapping and Op::Sub distinguish between non-erroring erroring variants, is there a reason for the discrepancy in timestamp behavior?

If this behavior will stay, I think it should be documented in add, add_wrapping, etc

Copy link
Contributor Author

@tustvold tustvold Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs on add_wrapping state that it only performs wrapping overflow for integers, i.e. not for termporal, decimal, etc... This is because the overflow behaviour is not very well defined, and at least in the temporal case has never existed. Let me know if the existing docs are insufficient

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs say:

/// Perform lhs + rhs, wrapping on overflow for integers

I think it would improve the UX a lot of it explicitly calls out that wrapping doesn't apply to temporal (given they are stored as integers, we one could imagine people like myself not realizing they didn't wrap on overflow)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update to reference DataType::is_integer

arrow-arith/src/numeric.rs Show resolved Hide resolved
use crate::arity::{binary, try_binary};

/// Perform `lhs + rhs`, returning an error on overflow
pub fn add(lhs: &dyn Datum, rhs: &dyn Datum) -> Result<ArrayRef, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than add maybe we could call this add_checked to:

  1. Make it explicit the user is choosing the checked variant
  2. Be consistent with https://docs.rs/num-traits/latest/num_traits/ops/wrapping/index.html ?

I don't feel super strongly about this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was less confusing to make wrapping the special case, as it only impacts integers

pub fn add_checked<T: ArrowNumericType>(
left: &PrimitiveArray<T>,
right: &PrimitiveArray<T>,
) -> Result<PrimitiveArray<T>, ArrowError> {
math_checked_op(left, right, |a, b| a.add_checked(b))
try_binary(left, right, |a, b| a.add_checked(b))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this call the new add_checked kernel directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically yes, but I was trying to avoid changing the behaviour of the generic kernels

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But aren't the tests in terms of the original kernels? If you don't call into the new kernels they aren't tested.

Or perhaps I am missing something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dyn kernels now call through to the datum kernels

@tustvold
Copy link
Contributor Author

tustvold commented Jul 6, 2023

how well covered is this code

The existing tests of the dyn kernels which now call into this logic should give fairly good coverage, definitely could be improved though. Happy to do as a follow on

add_wrapping

This was an attempt to encourage the checked logic by default, I can change it back if you feel strongly

so they don't need to change all their code immediately

No changes are needed immediately, the old APIs are just deprecated. However, switching over will always require some changes as the types are different

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an attempt to encourage the checked logic by default, I can change it back if you feel strongly

I don't feel strongly

No changes are needed immediately, the old APIs are just deprecated. However, switching over will always require some changes as the types are different

Other than the test coverage, I feel really good about this PR. Really nice work @tustvold

I suggest:

  1. File a follow on ticket to track removing the old kernels
  2. FIle a ticket to port the tests to the new kernels (will be required to remove the old kernels anyways)
  3. Merge this PR

@tustvold
Copy link
Contributor Author

tustvold commented Jul 7, 2023

Filed tickets #4480 #4481

@tustvold tustvold merged commit ee2c292 into apache:master Jul 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improved Temporal and Decimal Arithmetic Add temporal kernels for arithmetic with timestamps and durations
2 participants