Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(rust): Arithmetics refactor and cleanup (part 1) #7537

Closed
wants to merge 14 commits into from

Conversation

aldanor
Copy link
Contributor

@aldanor aldanor commented Mar 14, 2023

This is a preliminary refactor before starting to integrate decimal arithmetics in.

Currently, there's the PolarsNumericType which (unfortunately) plays way too many roles – from question like "is it a simple numerical thing we can just copy" to assumptions like "if it's a 'numeric' type, then we can just add and subtract the underlying physical values". This logic breaks for decimals, because i128 is a physical type, however PrimitiveArray<i128> does not use native arithmetics (moreover, we'd want to reimplement those arithmetics ops for it so as to allow some flexibility in maintaining scale/precision etc). So some rearrangement is unavoidable, and I figured I should push what I currently have in progress instead of keeping it so as not to suffer from constant rebasing.

A few changes:

  • NumOpsDispatch / NumOpsDispatchInner are removed completely
  • ArrayArithmetics has been removed (will implement respective traits for decimals manually at a later time)
  • There's a NativeArithmetics bound being used now for native arithmetics ops (may need to add our own 1:1 trait when implementing decimals due to potential shadowing problems)
  • Added Try<...> traits to unify fallible ops
  • Some unwraps/panics in fallible methods replaced with errors
  • Added PolarsFloatNative, this helps simplify trait bound copy/paste
  • Implementations of chunked array and series arithmetics are now unified via macros (all except checked div etc)
  • For chunked arrays, all array vs scalar ops will now use $op_scalar() methods from arrow2 for native arithmetics (wasn't the case previously)
  • For owned array vs array arithmetics, there was some env::var("ASSIGN") check which seemed to have been something temporary and has been removed
  • Renamed arithmetic method from SeriesTrait to series_$op() for clarity
  • Added ChunkedArray::with_name() helper
  • All macros in polars-error use $crate:: now
  • Downcasted iteration is now done via a trait instead of copy/paste

@github-actions github-actions bot added refactor Code improvements that do not change functionality rust Related to Rust Polars labels Mar 14, 2023
@aldanor aldanor force-pushed the feature/arithmetics-refactor branch from c854f06 to ca2bf80 Compare March 14, 2023 02:20
Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This touches a lot of old code that indeed could use some refactoring/cleaning up.

One important question: You did not change any semantics right? Especially the owned arithmetic is an optimization I don't want to break and only works if the ref count of Arc<ChunkedArray> is 1.

@@ -246,7 +246,7 @@ fn pivot_impl(
debug_assert_eq!(value_agg_phys.len(), row_locations.len());

let mut cols = if value_agg_phys.dtype().is_numeric() {
macro_rules! dispatch {
macro_rules! pivot {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should call this pivot as we don't pivot here. dispatch is clearer to me that we simply have a utility macro to dispatch the inner function.

Copy link
Contributor Author

@aldanor aldanor Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. This is purely for error reporting, it will include name of the macro if it fails for whatever reason. I can change it back to dispatch or pivot_dispatch or any better name if you have any ideas.

Btw (1): here's another case when you check is_numeric(). Is decimal numeric or not? :) For pivoting purposes - sure! For many other purposes, no... 🤦

Btw (2), it just occurred to me, as this is yet another point where you check if something is numeric in point 1 but use this implicit assumption in point 2, wouldn't it be cleaner to instead have

fn as_numeric(&self) -> Option<NumericChunked<'_>> {
    match self.dtype() { ... }
}
enum NumericChunked<'a> {
    UInt8(&'a UInt8Chunked),
    ...
}

Yet another case where we can potentially dodge implicit assumptions and unwraps by using type system? Then errors simply can't occur and no unwraps are needed because you can do

if let Some(ca) = agg_phys.as_numeric() {
    ... // do stuff, it can't fail now
}

let rhs = rhs.struct_().unwrap();
let s_fields = s.fields();
) -> PolarsResult<Series> {
let lhs = lhs.struct_().unwrap_or_else(|_| unreachable!());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of this over unwrap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was a marker for myself while reading this, basically saying this is only called with struct dtypes.

This is one of my biggest gripes with polars codebase 🙂 When staring at a particular piece of code you have no idea what assumptions are currently being used here, you see a bunch of unwraps and have no idea which of them CAN potentially trigger and which cannot ever happen, which are unhandled errors and which are completely unreachable. You have to scan all usages of a particular thing trying to infer it.

I think a lot of these cases you can convert to traits/errors or just restructure the code. E.g.:

  • This function should probably take &StructChunked, &StructChunked and then no unwraps are needed here
  • And upstream instead of matching (lhs.dtype(), rhs.dtype()) = (Struct(_), Struct(_)) you could instead do (lhs.struct_(), rhs.struct_()) = (Some(_), Some(_)), so no unwraps are needed again. No implicit assumptions either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Maybe just put an unwrap with a comment why we can unwrap or expect("impl error")?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of my biggest gripes with polars codebase

I agree we can do better here. All unwraps should be considered asserts, but it isn't always the case yet.

@aldanor
Copy link
Contributor Author

aldanor commented Mar 14, 2023

Thanks. This touches a lot of old code that indeed could use some refactoring/cleaning up.

One important question: You did not change any semantics right? Especially the owned arithmetic is an optimization I don't want to break and only works if the ref count of Arc<ChunkedArray> is 1.

I don't think I've changed any semantics there. As you can see, all of the owned arithmetic logic. Here's a gist of recursive expansion of Add / TryAdd macro for ChunkedArray so you can see more easily what's going on: https://gist.github.com/aldanor/0f612c7ee267bf6cb70b9d6d952a1997

A few potential changes:

  • There was some check for ASSIGN env var when doing owned vs owned ops; I think it was temporary as it was only defined for a few ops and not all of them, and there were no other references in the codebase so it has been removed.
  • Ops for references were implemented as Add for &ChunkedArray<T>. This is TOO restrictive because it can be unwrapped as:
Add for &ChunkedArray<T>
Add<Self> for &ChunkedArray<T>
Add<&'a ChunkedArray<T>> for &'a ChunkedArray<T>

which is basically requiring matching lifetimes on LHS and RHS of all arithmetic ops if you're dealing with references. This is overly restrictive and a better way is to explicitly defined 'lhs, 'rhs lifetimes.

  • Some operations were using arrow2::$op_scalar, some weren't. Now all of them do.
  • Another thing is this: for "array $op scalar" operations where array is owned, there's two options (the currently used one is option 2, but could specialize to 1 for div/rem I guess):
    • use $op_scalar() – which would be faster for div/rem, but requires an allocation
    • use normal $op but no allocation (that's the way it currently does it); will be slower for div/rem but always avoid an allocation)

@ritchie46
Copy link
Member

Another thing is this: for "array $op scalar" operations where array is owned, there's two options (the currently used one is option 2, but could specialize to 1 for div/rem I guess):

Could you point to the snippet you are referring?

@ritchie46
Copy link
Member

ritchie46 commented Mar 14, 2023

Ops for references were implemented as Add for &ChunkedArray<T>. This is TOO restrictive because it can be unwrapped as:

I don't think this is the case. This compiles on master. I believe the longer living lifetime can always match the shortest one.

fn try_different_lifetimes<'a>(a: &'a Int32Chunked, b: &'static Int32Chunked) -> Int32Chunked {
    a + b
}
fn try_different_lifetimes2<'a, 'b>(a: &'a Int32Chunked, b: &'b Int32Chunked) -> Int32Chunked {
    a + b
}

fn try_different_lifetimes3<'a>(a: &'a Int32Chunked) -> Int32Chunked {
    // this lifteme is always shorter than 'a as it is local
    let ca = &Int32Chunked::full_null("", 1);
    ca + a
}

@ritchie46
Copy link
Member

There was some check for ASSIGN env var when doing owned vs owned ops; I think it was temporary as it was only defined for a few ops and not all of them, and there were no other references in the codebase so it has been removed.

I think this was a remnant of earlier testing. This can be removed indeed. 👍

@ritchie46
Copy link
Member

Any updates @aldanor? I think there is some great stuff in here.

@aldanor
Copy link
Contributor Author

aldanor commented Mar 23, 2023

@ritchie46 apologies, was busy with work, will get back re: this PR tomorrow to address the comments and rebase

@aldanor
Copy link
Contributor Author

aldanor commented Jun 1, 2023

Ok. My fault for not taking enough effort to get this merged :/ (although it was 99% there I thi) Now that #9123 has been merged, I'm not entirely sure it's rebase-able...

However, if there's still interest in pushing it through (I think it does clean up the whole array-arithmetics mess by quite a bit), I can try a rebase locally to see if it's possible at all.

@ritchie46
Copy link
Member

I'm not entirely sure it's rebase-able...

Me neither. :/ I recall cleaning up a bit on the arithmetics as well.

@svaningelgem
Copy link
Contributor

@stinodego > As you requested to be pinged.
This goes a bit above my head, but a cleanup is always nice. Though as @ritchie46 said: it's probably not rebase-able.
@aldanor : what do you think? When you look over the codebase, is there still as much klunkiness in there as in March (ie: does it still warrant the refactoring?)

@aldanor
Copy link
Contributor Author

aldanor commented Oct 11, 2023

@svaningelgem

I think it's pretty sad it wasn't merged in back in the day - because it did clean up tons of inconsistencies and weird points in the codebase (also simplifying NumDispatch etc, unifying fallible ops, unifying downcasting, converting some unwraps to proper errors etc). This was a preliminary step to improving arithmetics on decimals, so there's that.

But - I don't think any of the above has been fully resolved (maybe a few things here and there), perhaps it would make sense to break this apart into more sizeable PRs and resubmit them, idk. Rebasing would be pretty tough indeed due to some refactoring that happened inbetween...

(I'll double check a few things in the coming days, re: the current state of the codebase)

@stinodego
Copy link
Member

I don't think there's much use in keeping this PR open any longer. I am not sure what the current state of this part of the code is, but if you feel like (some of) the improvements made here would still be applicable, you're very welcome to open a new PR! Indeed it might help to split it up in smaller parts if at all possible.

@stinodego stinodego closed this Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactor Code improvements that do not change functionality rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants