Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust, python): Add ddof option to rolling_var and rolling_std #8957

Merged
merged 20 commits into from
Jun 5, 2023

Conversation

magarick
Copy link
Contributor

This is a first attempt at adding a ddof parameter to rolling_var/rolling_std as per #7999
To do so, it introduces rolling function parameters that get passed around. I'm not sure this is the greatest way to do it...

@@ -12,9 +12,9 @@ impl<
T: NativeType + IsFloat + std::iter::Sum + AddAssign + SubAssign + Div<Output = T> + NumCast,
> RollingAggWindowNoNulls<'a, T> for MeanWindow<'a, T>
{
fn new(slice: &'a [T], start: usize, end: usize) -> Self {
fn new(slice: &'a [T], start: usize, end: usize, params: Option<RollingFnParams>) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a Box<dyn Any>. Then every function can downcast to its specific arguments type. In this PR's case no function will do that except for var and std.

If other functions need different arguments they can pass their own type.

Copy link
Contributor Author

@magarick magarick May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking a look. I'm relatively new to Rust so what makes that better than doing it this way?
To be extra clear, the type definition is

pub enum RollingFnParams {
    RollingVarParams { ddof: u8 },
}

and I figured other variants could be added later but that explicitly defining what each function required was clearer and could be handled by a match where appropriate while preventing arbitrary input from going to these functions.
With the way you're suggesting, would I make RollingVarParams and any future param sets their own types? Mostly I'm curious about what issues you see here so I can learn something. Is it just pointless to define a type this vague?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On mobile:

Ddof doesn't make any sense to pass to other rolling function like a rolling sum. That's why I propose using dyn Any. Then every function can only accept its own parameter type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for replying so quickly.
In my limited understanding, dyn Any would have the same problems as what I'm doing but even worse. Nothing would stop someone from passing anything at all to a rolling function. With the way it is now, at least there are defined enum variants for what kind of parameters each rolling function should accept and they have to be set ahead of time. Yeah, you could pass MeanParams (if they existed) to variance or VarParams to mean but it's up to the individual kernel to decide what to do just like with dyn Any. Is it better because I can then give each window an associated type through it's trait and cast the Any to that directly beforehand?

The only alternative I can think of that doesn't leave it up to the window itself is to have a separate type for parameters to each rolling function and put a generic type parameter on RollingOptions. But even then I think the function parameter types would have to be related somehow so they can pass through rolling_agg and have the same signature. And I don't know if this would work and it seems complicated.

I'm not trying to give you a hard time here. Only trying to learn a little. It doesn't seem like that big of a change to do it with Any so I'm happy to make the change if that's what you want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that the Box<dyn Any> is in our private API. Not what we expose publicly for the user facing API.

The users will pass RollingVarArgs or a RollingMeanArgs and we will convert it to a Box<Any>. As I said, I don't want arguments needed for variance being passed to rolling_mean that doesn't make much sense, so we need some dynamic typing, hence the dyn Any. Think of it as **kwargs in python, where a specific implemenation of a function knows which keyword arguments to take out of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like Box<dyn Any> is even going to work here, at least if we're passing the params down to the actual implementation through RollingOptions and friends because those types derive Clone and the size of a Box<dyn Any> isn't known at compile time. At least that's the error I'm getting. Maybe you can get around it with Rc or Arc but after looking into it I think ultimately it confuses the issue and still ends up being worse than what I'm proposing.

  1. We have to get the additional function parameters all the way down many layers of indirection. Since it seems to make the most sense to put these in RollingOptions they have to be wrapped in something so everything typechecks. There's also various dispatchers like rolling_agg which expect specific signatures of the functions they are passed.
  2. RollingAggWindow also needs to accept the same type for each operation, so whatever parameters can only be unwrapped here. Whether you have an any or a generic type representing rolling args, this is the only place you can really check that you have what you need. And it's easier to check that you have the right enum variant than messing around with Any.
  3. We do know ahead of time what sets of arguments are valid for which rolling operations. The enum forces us to be clear about it. Yes, it's internal, and that's why the user-level functions will just take individual arguments, construct the appropriate variant and pass it along.
  4. It's even recommended in the Rust book to use this pattern to group types together when you know all of the possibilities ahead of time.

So far I haven't seen anything to make me think that what you're proposing is better, or even easier to implement.
But if you really want to do it with run-time dynamic typing rather than just defining a few types of our own, I can give it a try because I do want to help and I believe in this project.

Finally, as a concrete example of how this would work with multiple types, you could have the following, which I believe would let the rolling quantile functions no longer need a special case.

pub enum RollingFnParams {
    RollingVarParams { ddof: u8 },
    RollingQuantileParams { quantile: f64, interp: QuantileInterpolOptions}
}

Then all you need to do is have new for each window type do a match for the appropriate variant. If you wanted to allow the rolling windows to take Any then the handling of different parameters for each rolling function would become far more ad-hoc and error-prone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rolling variance args shouldn't be in an enum with rolling quantile args.

What I mean is this:

use std::any::Any;
use std::sync::Arc;

/// Public type
pub struct RollingVarOptions{
    ddof: u8
}

/// Public type
pub struct RollingMeanOptions{
    mean_args: usize
}

/// Public function
pub fn rolling_var(args: RollingVarOptions) {
    let args = Arc::new(args) as Arc<dyn Any>;
    rolling_var_private(args)
}

/// Public function
pub fn rolling_mean(args: RollingMeanOptions) {
    let args = Arc::new(args) as Arc<dyn Any>;
    rolling_var_private(args)
}

/// Private functions
fn rolling_mean_private(args: Arc<dyn Any>) {
    let args = args.downcast_ref::<RollingMeanOptions>().unwrap();
    // implementation work
}

/// Private functions
fn rolling_var_private(args: Arc<dyn Any>) {
    let args = args.downcast_ref::<RollingVarOptions>().unwrap();
    // implementation work
}

This should work perfectly and have clear single function structs for the end user. It also reduces binary bloat which is important in this crate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it took a little fiddling around with Any but it seems to do what it's supposed to now.
A few tests started failing due to numerical precision issues. They're being compared exactly, seemingly as a hack around pytest not being able to handle None in numeric outputs (though it supports NaN...). I could either change the values to what they are now or change the tests to check None and numerical precision of floats separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it took a little fiddling around with Any but it seems to do what it's supposed to now.

Nice!

I could either change the values to what they are now or change the tests to check None and numerical precision of floats separately.

No preference on my side. It is up to you. :)

@magarick magarick force-pushed the add-rollvar-ddof branch from faecd9a to 37e4011 Compare May 31, 2023 08:08
Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great @magarick. I have a small nit that could make the function signature a little bit easier to grok.

@@ -12,9 +12,9 @@ impl<
T: NativeType + IsFloat + std::iter::Sum + AddAssign + SubAssign + Div<Output = T> + NumCast,
> RollingAggWindowNoNulls<'a, T> for MeanWindow<'a, T>
{
fn new(slice: &'a [T], start: usize, end: usize) -> Self {
fn new(slice: &'a [T], start: usize, end: usize, params: Option<Arc<dyn Any + Sync + Send>>) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can clean up the signature a little bit by defining:

type DynArgs = Option<Arc<dyn Any + Sync + Send>>

@magarick magarick force-pushed the add-rollvar-ddof branch from 2bf1854 to bc375ac Compare May 31, 2023 09:50
@magarick
Copy link
Contributor Author

OK. I added the new type and tidied things up. I'm not sure what to do about the doctests failing. The output doesn't look wrong and I didn't change that. Also how about the "too many arguments" complaints from rust. I can't really do anything about that since this PR was meant to add new arguments...

@ritchie46
Copy link
Member

ritchie46 commented May 31, 2023

Also how about the "too many arguments" complaints from rust.

You can mark the violating function with #[allow(clippy::too_many_arguments)].

I'm not sure what to do about the doctests failing

These are fixed in main. A rebase should be sufficient.

@magarick magarick changed the title Add rollvar ddof feat(rust, python): Add ddof option to rolling_var and rolling_std May 31, 2023
@magarick magarick marked this pull request as ready for review May 31, 2023 18:55
@github-actions github-actions bot added the enhancement New feature or an improvement of an existing feature label May 31, 2023
@magarick magarick requested a review from MarcoGorelli as a code owner May 31, 2023 18:55
@github-actions github-actions bot added python Related to Python Polars rust Related to Rust Polars labels May 31, 2023
@magarick
Copy link
Contributor Author

There's also the question of what to do if ddof > 1 and the window size or count of non-nulls is too small. Unlikely, but it's a possible annoying edge case especially for nulls. In numpy, if ddof >= the array length it returns inf. Polars series will return inf if ddof == len(s) and None if ddof > len(s) (at least in Python).
Making it infinite in either case seems reasonable, but rolling var/sd aren't constrained to taking float types (though in practice it seems they are).

@magarick
Copy link
Contributor Author

magarick commented Jun 1, 2023

Other than the above question about how to handle edge cases where people end up with windows shorter than ddof I think this is ready. Any opinions on how to handle or should we just ignore it for now, assuming this is unlikely to be an issue?

@ritchie46
Copy link
Member

Can we follow numpy and also return Inf in that case?

@magarick
Copy link
Contributor Author

magarick commented Jun 2, 2023

Can we follow numpy and also return Inf in that case?

If we guarantee that the functions can only take floats then yes. It looks like int inputs get cast to floats anyway. So if you're ok adding the Float trait as a type constraint then I'll do that.

@ritchie46
Copy link
Member

If we guarantee that the functions can only take floats then yes. It looks like int inputs get cast to floats anyway. So if you're ok adding the Float trait as a type constraint then I'll do that.

Yes, that's ok.

@magarick
Copy link
Contributor Author

magarick commented Jun 2, 2023

In that case, I think this will do it.

@magarick magarick force-pushed the add-rollvar-ddof branch from f2f7399 to bf279ef Compare June 3, 2023 22:44
Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @magarick. Looks great. 👍

@ritchie46 ritchie46 merged commit 0dda45f into pola-rs:main Jun 5, 2023
@magarick magarick deleted the add-rollvar-ddof branch June 5, 2023 05:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants