-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(rust, python): groupby rolling with negative offset #9428
fix(rust, python): groupby rolling with negative offset #9428
Conversation
eb42eb5
to
66dd292
Compare
@@ -525,49 +525,47 @@ pub fn groupby_values( | |||
|
|||
// we have a (partial) lookbehind window | |||
if offset.negative { | |||
if offset.duration_ns() >= period.duration_ns() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, previously there were two paths:
offset
>=period
,offset
<period * 2
: groupby_values_iter_full_lookbehindoffset
>=period
,offset
>=period * 2
: groupby_values_iter_window_behind_toffset
<period
: groupby_values_iter_partial_lookbehind
I don't get why there's the < period * 2
check. Looks like it comes from https://github.com/pola-rs/polars/pull/4010/files, but I don't see why
Anyway, groupby_values_iter_full_lookbehind
assumes t
is at the end of the window (i.e. period == offset
), so changing the logic to
offset
==period
: groupby_values_iter_full_lookbehindoffset
>period
: groupby_values_iter_window_behind_t (slower, but this is quite unusual anyway?)offset
<period
: groupby_values_iter_partial_lookbehind
fixes all the test cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I belive groupby_values_iter_full_lookbehind
assumes that t
is completely behind the window. So there are more cases where we have that besides period == offset
.
I will have to dive into it which cases it were again. Do you have on top of mind which predicate would inlcude all cases where t
is full lookbehind?
This is beneficial as in that case we can parallelize over t
and then look from that point backwards in the slice to find the window.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assumes that t is completely behind the window
If period == offset
and closed =='right'
, then t
is indeed included in the window (it's the right endpoint). For example the window could be (2020-01-01, 2020-01-02]
and t
could be 2020-01-02
.
From testing, that function only works if offset== period
. There's an explicit check for when closed=='right'
, i.e. when it's not a full lookbehind:
polars/polars/polars-time/src/windows/groupby.rs
Lines 275 to 277 in 12c4d9a
if matches!(closed_window, ClosedWindow::Right | ClosedWindow::Both) { | |
len += 1; | |
} |
For offset > period
, then it's incorrect for any value of closed
: #9250
It may be possible to change it so it handles the case when offset > period
. But for now, I'm suggesting to:
- rename it, as when
t
is the right endpoint then ifclosed='right'
then it's not a full lookbehind - only use it when
offset == period
(so at least the results are correct)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, let's first make it correct. We can try to find fast paths later if needed. 👍
Thanks!
…n-default negative offsets
093d9eb
to
a0ba32c
Compare
|
||
@given( | ||
period=st.timedeltas(min_value=timedelta(microseconds=0)), | ||
offset=st.timedeltas(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI Hypothesis has a handy map()
method for strategies. Currently the type hints of period
/offset
are wrong in the signature anywho, which this would also fix.
e.g. for offset
(you can also do this for period
I think)
offset=st.timedeltas(), | |
offset=st.timedeltas().map(_timedelta_to_pl_duration), |
Although FWIW, sometimes you want to generate the "core" object (i.e. datetime.timedelta
) anyway as you might do additional tests with it.
(had this thought and was going to message you on slack, but thought I'd check open PRs first 😅)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, good one, thanks!
thanks for your review, I'll think about how to improve performance in the offset > period case and thanks Alex and Matt for reviewing + helping with hypothesis! |
closes #9250
Adding
Perf implications:
period
==offset
(the default), nothing, this stays the same as beforeperiod > offset
,groupby_values_iter_window_behind_t
is used instead ofgroupby_values_iter_full_lookbehind
. This is documented to be slower, but at least it's correct. In any case, this strikes me as a rarer use case than the default