Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(rust, python): groupby rolling with negative offset #9428

Merged

Conversation

MarcoGorelli
Copy link
Collaborator

@MarcoGorelli MarcoGorelli commented Jun 18, 2023

closes #9250

Adding

  • a big parametric test, checking against a pure Python implementation (simple but slow)
  • a unit test with hand-calculated (pen and paper) results

Perf implications:

  • for period == offset (the default), nothing, this stays the same as before
  • for period > offset, groupby_values_iter_window_behind_t is used instead of groupby_values_iter_full_lookbehind. This is documented to be slower, but at least it's correct. In any case, this strikes me as a rarer use case than the default

@MarcoGorelli MarcoGorelli force-pushed the fix-groupby-rolling-with-offset branch from eb42eb5 to 66dd292 Compare June 18, 2023 18:21
@@ -525,49 +525,47 @@ pub fn groupby_values(

// we have a (partial) lookbehind window
if offset.negative {
if offset.duration_ns() >= period.duration_ns() {
Copy link
Collaborator Author

@MarcoGorelli MarcoGorelli Jun 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, previously there were two paths:

  • offset >= period, offset < period * 2: groupby_values_iter_full_lookbehind
  • offset >= period, offset >= period * 2: groupby_values_iter_window_behind_t
  • offset < period: groupby_values_iter_partial_lookbehind

I don't get why there's the < period * 2 check. Looks like it comes from https://github.com/pola-rs/polars/pull/4010/files, but I don't see why

Anyway, groupby_values_iter_full_lookbehind assumes t is at the end of the window (i.e. period == offset), so changing the logic to

  • offset == period: groupby_values_iter_full_lookbehind
  • offset > period: groupby_values_iter_window_behind_t (slower, but this is quite unusual anyway?)
  • offset < period: groupby_values_iter_partial_lookbehind

fixes all the test cases

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I belive groupby_values_iter_full_lookbehind assumes that t is completely behind the window. So there are more cases where we have that besides period == offset.

I will have to dive into it which cases it were again. Do you have on top of mind which predicate would inlcude all cases where t is full lookbehind?

This is beneficial as in that case we can parallelize over t and then look from that point backwards in the slice to find the window.

Copy link
Collaborator Author

@MarcoGorelli MarcoGorelli Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assumes that t is completely behind the window

If period == offset and closed =='right', then t is indeed included in the window (it's the right endpoint). For example the window could be (2020-01-01, 2020-01-02] and t could be 2020-01-02.

From testing, that function only works if offset== period. There's an explicit check for when closed=='right', i.e. when it's not a full lookbehind:

if matches!(closed_window, ClosedWindow::Right | ClosedWindow::Both) {
len += 1;
}

For offset > period, then it's incorrect for any value of closed: #9250

It may be possible to change it so it handles the case when offset > period. But for now, I'm suggesting to:

  • rename it, as when t is the right endpoint then if closed='right' then it's not a full lookbehind
  • only use it when offset == period (so at least the results are correct)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, let's first make it correct. We can try to find fast paths later if needed. 👍

Thanks!

@MarcoGorelli MarcoGorelli marked this pull request as ready for review June 19, 2023 13:16
@MarcoGorelli MarcoGorelli force-pushed the fix-groupby-rolling-with-offset branch from 093d9eb to a0ba32c Compare June 19, 2023 13:51

@given(
period=st.timedeltas(min_value=timedelta(microseconds=0)),
offset=st.timedeltas(),
Copy link

@honno honno Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI Hypothesis has a handy map() method for strategies. Currently the type hints of period/offset are wrong in the signature anywho, which this would also fix.

e.g. for offset (you can also do this for period I think)

Suggested change
offset=st.timedeltas(),
offset=st.timedeltas().map(_timedelta_to_pl_duration),

Although FWIW, sometimes you want to generate the "core" object (i.e. datetime.timedelta) anyway as you might do additional tests with it.

(had this thought and was going to message you on slack, but thought I'd check open PRs first 😅)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, good one, thanks!

@ritchie46 ritchie46 changed the title Fix groupby rolling with negative offset fix(rust, python): groupby rolling with negative offset Jun 20, 2023
@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Jun 20, 2023
@MarcoGorelli
Copy link
Collaborator Author

thanks for your review, I'll think about how to improve performance in the offset > period case

and thanks Alex and Matt for reviewing + helping with hypothesis!

@MarcoGorelli MarcoGorelli merged commit d3779ae into pola-rs:main Jun 20, 2023
c-peters pushed a commit to c-peters/polars that referenced this pull request Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong groupby_rolling result when period=2d and offset=-3d
3 participants