-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Datum based arithmetic kernels (#3999) #4465
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍
]); | ||
|
||
// unchecked | ||
let result = subtract_dyn(&a, &b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temporal arithmetic is now always checked
fn test_f32_array_modulus_dyn_by_zero() { | ||
let a = Float32Array::from(vec![1.5]); | ||
let b = Float32Array::from(vec![0.0]); | ||
modulus_dyn(&a, &b).unwrap(); | ||
let result = modulus_dyn(&a, &b).unwrap(); | ||
assert!(result.as_primitive::<Float32Type>().value(0).is_nan()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Floating point arithmetic now follows the IEEE 754 standard. My research showed databases to handle division by zero very inconsistently, some returning null and some an error. Broadly speaking it seems peculiar to special case division by zero, and not any of the other cases that can lead to Nan
. Much like we do for total ordering of floats, I think we should just follow the floating point standard rather than trying to copy some subset of the databases in the wild. As a side benefit this is also significantly faster 😄
/// | ||
/// Overflow or division by zero will result in an error, with exception to | ||
/// floating point numbers, which instead follow the IEEE 754 rules | ||
pub fn rem(lhs: &dyn Datum, rhs: &dyn Datum) -> Result<ArrayRef, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted for rem instead of mod to be consistent with the Rust nomenclature for this operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code looks (really) nice 👨🍳 👌
Tests?
- I didn't see any new tests -- how well covered is this code? Maybe we can port the old tests from arithmetic.rs?
API change
This is a non trivial API change, righ? I like that you have added deprecation notices to the kernels in arithmetic
module
I think we should do some other things to help users:
-
Add (deprecated) backwards compatibility definitions (for example, define an
add_wrapping
that callsadd)
so they don't need to change all their code immediately -
Consider writing up a "migration guide" that highlights the changes for users -- namely the new
Datum
abstraction and that the arithmetic kernels are now all dynamically dispatched (and renamed for consistency).
// create an array that actually has non-zero values at the invalid indices | ||
let c = add(&a, &b).unwrap(); | ||
let validity = NullBuffer::new((1..=100).map(|x| x % 3 == 0).collect()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume these are updated because they were testing the sum
kernel for two arrays rather than the aggregate sum
(which is what is defined in this module)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was to avoid using a now deprecated kernel
@@ -18,8 +18,10 @@ | |||
//! Arrow arithmetic and aggregation kernels | |||
|
|||
pub mod aggregate; | |||
#[doc(hidden)] // Kernels to be removed in a future release |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be tracked in another ticket perhaps
arrow-arith/src/numeric.rs
Outdated
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
//! Defines numeric kernels on PrimitiveArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//! Defines numeric kernels on PrimitiveArray | |
//! Defines numeric kernels on [`PrimitiveArray`] such as [`add`] |
let l = l.as_primitive::<T>(); | ||
let r = r.as_primitive::<T>(); | ||
let array: PrimitiveArray<T> = match op { | ||
Op::AddWrapping | Op::Add => op!(l, l_s, r, r_s, l.add_wrapping(r)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add and AddWrapping call add_wrapping
is because there is no such thing as "float overflow", right (when they exceed the range the turn into Nan or inf)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
let array: PrimitiveArray<T> = match (op, r.data_type()) { | ||
(Op::Sub | Op::SubWrapping, Timestamp(unit, _)) if unit == &T::UNIT => { | ||
let r = r.as_primitive::<T>(); | ||
return Ok(try_op_ref!(T::Duration, l, l_s, r, r_s, l.sub_checked(r))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do i read this as Op::SubWrapping will generate an error on underflow (only) for timestamp arithmetic?
Given the other kernels seem to use Op::SubWrapping
and Op::Sub
distinguish between non-erroring erroring variants, is there a reason for the discrepancy in timestamp behavior?
If this behavior will stay, I think it should be documented in add
, add_wrapping
, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs on add_wrapping state that it only performs wrapping overflow for integers, i.e. not for termporal, decimal, etc... This is because the overflow behaviour is not very well defined, and at least in the temporal case has never existed. Let me know if the existing docs are insufficient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs say:
/// Perform
lhs + rhs
, wrapping on overflow for integers
I think it would improve the UX a lot of it explicitly calls out that wrapping doesn't apply to temporal (given they are stored as integers, we one could imagine people like myself not realizing they didn't wrap on overflow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update to reference DataType::is_integer
use crate::arity::{binary, try_binary}; | ||
|
||
/// Perform `lhs + rhs`, returning an error on overflow | ||
pub fn add(lhs: &dyn Datum, rhs: &dyn Datum) -> Result<ArrayRef, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than add
maybe we could call this add_checked
to:
- Make it explicit the user is choosing the checked variant
- Be consistent with https://docs.rs/num-traits/latest/num_traits/ops/wrapping/index.html ?
I don't feel super strongly about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was less confusing to make wrapping the special case, as it only impacts integers
pub fn add_checked<T: ArrowNumericType>( | ||
left: &PrimitiveArray<T>, | ||
right: &PrimitiveArray<T>, | ||
) -> Result<PrimitiveArray<T>, ArrowError> { | ||
math_checked_op(left, right, |a, b| a.add_checked(b)) | ||
try_binary(left, right, |a, b| a.add_checked(b)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this call the new add_checked
kernel directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically yes, but I was trying to avoid changing the behaviour of the generic kernels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But aren't the tests in terms of the original kernels? If you don't call into the new kernels they aren't tested.
Or perhaps I am missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dyn kernels now call through to the datum kernels
The existing tests of the dyn kernels which now call into this logic should give fairly good coverage, definitely could be improved though. Happy to do as a follow on
This was an attempt to encourage the checked logic by default, I can change it back if you feel strongly
No changes are needed immediately, the old APIs are just deprecated. However, switching over will always require some changes as the types are different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an attempt to encourage the checked logic by default, I can change it back if you feel strongly
I don't feel strongly
No changes are needed immediately, the old APIs are just deprecated. However, switching over will always require some changes as the types are different
Other than the test coverage, I feel really good about this PR. Really nice work @tustvold
I suggest:
- File a follow on ticket to track removing the old kernels
- FIle a ticket to port the tests to the new kernels (will be required to remove the old kernels anyways)
- Merge this PR
Which issue does this PR close?
Closes #527
Closes #3999
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?