Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(rust, python): rolling_groupy was returning incorrect results when offset was positive #9082

Merged
merged 6 commits into from
May 29, 2023

Conversation

MarcoGorelli
Copy link
Collaborator

@MarcoGorelli MarcoGorelli commented May 28, 2023

There's room for optimisation, but this at least closes #9077 (besides, rolling windows looking into the future don't strike me as a common use-case)

@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels May 28, 2023
assert_eq!(groups[0], [0, 5]);
assert_eq!(groups[1], [1, 5]);
assert_eq!(groups[2], [2, 5]);
assert_eq!(groups[3], [3, 5]);
assert_eq!(groups[4], [4, 5]);
assert_eq!(groups[5], [5, 4]);
assert_eq!(groups[6], [6, 3]);
assert_eq!(groups[7], [7, 2]);
assert_eq!(groups[8], [8, 0]);
assert_eq!(groups[0], [1, 4]); // (00:00, 02:00]
Copy link
Collaborator Author

@MarcoGorelli MarcoGorelli May 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this test was right to begin with

if the dates are:
[00:00, 00:30, ..., 3:30, 4:00]
and the period is 2h and the offset is 0h, then the first window is (00:00, 02:00]. So that includes points: [00:30, 01:00, 01:30, 02:00], i.e. the starting offset is 1, and the length is 4

@MarcoGorelli MarcoGorelli marked this pull request as ready for review May 28, 2023 19:34
@@ -355,7 +355,7 @@ pub(crate) fn groupby_values_iter_partial_lookbehind<'a>(
let b = Bounds::new(lower, upper);

for &t in &time[lagging_offset..] {
if b.is_member(t, closed_window) || lagging_offset == i {
if b.is_member(t, closed_window) || (offset.negative && lagging_offset == i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we branch on offset.negative outside the loop?

Then we have two loops that compute the lagging_offset. The loops are super small, so almost no code bloat.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we actually get here if offset.negative isn't set? 🤔

Copy link
Collaborator Author

@MarcoGorelli MarcoGorelli May 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as I can tell, only in this test

let offset = Duration::parse("0h");
let g0 = groupby_values_iter_full_lookahead(
period,
offset,
&dates,
closed_window,
tu,
NO_TIMEZONE.copied(),
0,
None,
)
.collect::<PolarsResult<Vec<_>>>()
.unwrap();
let g1 = groupby_values_iter_partial_lookbehind(
period,
offset,
&dates,
closed_window,
tu,
NO_TIMEZONE.copied(),
)
.collect::<PolarsResult<Vec<_>>>()
.unwrap();
assert_eq!(g0, g1);

maybe it's better to just remove that test then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this part of the test, reverted the change to groupby_values_iter_partial_lookbehind, and expanded the tests on the python side by doing quite a bit more parametrisation

@@ -400,7 +400,16 @@ pub(crate) fn groupby_values_iter_full_lookahead<'a>(

let b = Bounds::new(lower, upper);

debug_assert!(i < time.len());
// find starting point of window
for &t in &time[i..] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we have a nested loop now.

Can we maybe keep the old behavior for offset= 0?

We can branch on the offset=? case and have a loop for both cases.

Or don't we hit this path on the offset=0 case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, thanks - yes, but only if the window is closed on the left (if it's open on the left, then time[i] shouldn't be included in the group)

@@ -398,7 +404,21 @@ pub(crate) fn groupby_values_iter_full_lookahead(

let b = Bounds::new(lower, upper);

debug_assert!(i < time.len());
// find starting point of window
match inner_loop_needed {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still adds a branch.

Let's do the branch completely before iter().enumerate(). That makes all paths optimal.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so something like

    match inner_loop_needed {
        true => {
            time[start_offset..upper_bound]
                .iter()
                .enumerate()
                .map(move |(mut i, lower)| {
                    i += start_offset;
                    let lower = add(&offset, *lower, tz.as_ref())?;
                    let upper = add(&period, lower, tz.as_ref())?;

                    let b = Bounds::new(lower, upper);

                    // find starting point of window
                    for &t in &time[i..] {
                        if b.is_member(t, closed_window) {
                            break;
                        }
                        i += 1;
                    }
                    if i >= time.len() {
                        return Ok((i as IdxSize, 0));
                    }
                    let slice = unsafe { time.get_unchecked(i..) };
                    let len = slice.partition_point(|v| b.is_member(*v, closed_window));

                    Ok((i as IdxSize, len as IdxSize))
                })
        }
        false => time[start_offset..upper_bound]
            .iter()
            .enumerate()
            .map(move |(mut i, lower)| {
                i += start_offset;
                let lower = add(&offset, *lower, tz.as_ref())?;
                let upper = add(&period, lower, tz.as_ref())?;

                let b = Bounds::new(lower, upper);

                debug_assert!(i < time.len());
                let slice = unsafe { time.get_unchecked(i..) };
                let len = slice.partition_point(|v| b.is_member(*v, closed_window));

                Ok((i as IdxSize, len as IdxSize))
            }),
    }
}

?

This would give

    = note: expected struct `std::iter::Map<std::iter::Enumerate<std::slice::Iter<'_, _>>, [closure@/home/marcogorelli/polars-dev/polars/polars-time/src/windows/groupby.rs:401:22: 401:43]>`
               found struct `std::iter::Map<std::iter::Enumerate<std::slice::Iter<'_, _>>, [closure@/home/marcogorelli/polars-dev/polars/polars-time/src/windows/groupby.rs:427:18: 427:39]>`
    = note: no two closures, even if identical, have the same type
    = help: consider boxing your closure and/or using it as a trait object
help: you could change the return type to be a boxed trait object
    |
381 | ) -> Box<dyn Iterator<Item = PolarsResult<(IdxSize, IdxSize)>> + TrustedLen + '_> {
    |      ~~~~~~~                                                                    +
help: if you change the return type to expect trait objects, box the returned expressions
    |
398 ~             Box::new(time[start_offset..upper_bound]
399 |                 .iter()
  ...
421 |                     Ok((i as IdxSize, len as IdxSize))
422 ~                 }))
423 |         }
424 ~         false => Box::new(time[start_offset..upper_bound]
425 |             .iter()
  ...
438 |                 Ok((i as IdxSize, len as IdxSize))
439 ~             })),
    |

, is that OK? Wouldn't the perf implications of using Box be worse than just matching a boolean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, then we can do the branch on the caller location. Meaning we split into 2 functions. We already have done this earlier. We have a full_lookahead, a partial_lookbehind and a full_lookbehind. So now we need 1 extra, I believe?

@ritchie46
Copy link
Member

Thanks @MarcoGorelli. Looking great now. 👍

@MarcoGorelli
Copy link
Collaborator Author

cool, thanks for your review!

@ritchie46 ritchie46 merged commit e03183f into pola-rs:main May 29, 2023
c-peters pushed a commit to c-peters/polars that referenced this pull request Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

groupby_rolling bug for positive values of offset
2 participants