Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong groupby_rolling result when period=2d and offset=-3d #9250

Closed
2 tasks done
SaneBow opened this issue Jun 6, 2023 · 2 comments · Fixed by #9428
Closed
2 tasks done

Wrong groupby_rolling result when period=2d and offset=-3d #9250

SaneBow opened this issue Jun 6, 2023 · 2 comments · Fixed by #9428
Labels
A-temporal Area: date/time functionality bug Something isn't working python Related to Python Polars

Comments

@SaneBow
Copy link

SaneBow commented Jun 6, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

When period=2d and offset=-3d, the groupby_rolling gives unexpected result. Possibly due to incorrect boundary handling.
Strangely, when offset is set to some other values like offset=-4d or offset=-5d, then results are correct.

Reproducible example

import polars as pl
from datetime import datetime
dcol = pl.date_range(datetime(2021,1,1), datetime(2021,1,4), '1d', eager=True)
vcol = np.arange(1, len(dcol)+1)
df = pl.DataFrame({'datetime': dcol, 'v': vcol})
df
┌─────────────────────┬─────┐
│ datetime            ┆ v   │
│ ---                 ┆ --- │
│ datetime[μs]        ┆ i64 │
╞═════════════════════╪═════╡
│ 2021-01-01 00:00:00 ┆ 1   │
│ 2021-01-02 00:00:00 ┆ 2   │
│ 2021-01-03 00:00:00 ┆ 3   │
│ 2021-01-04 00:00:00 ┆ 4   │
└─────────────────────┴─────┘
df.groupby_rolling('datetime', period='2d', offset='-3d').agg(pl.col('v'))

Wrong output:

┌─────────────────────┬───────────┐
│ datetime            ┆ v         │
│ ---                 ┆ ---       │
│ datetime[μs]        ┆ list[i64] │
╞═════════════════════╪═══════════╡
│ 2021-01-01 00:00:00 ┆ [1]       │
│ 2021-01-02 00:00:00 ┆ [1, 2]    │
│ 2021-01-03 00:00:00 ┆ [1, 2, 3] │
│ 2021-01-04 00:00:00 ┆ [2, 3, 4] │
└─────────────────────┴───────────┘

Expected behavior

┌─────────────────────┬───────────┐
│ datetime            ┆ v         │
│ ---                 ┆ ---       │
│ datetime[μs]        ┆ list[i64] │
╞═════════════════════╪═══════════╡
│ 2021-01-01 00:00:00 ┆ []        │
│ 2021-01-02 00:00:00 ┆ [1]       │
│ 2021-01-03 00:00:00 ┆ [1, 2]    │
│ 2021-01-04 00:00:00 ┆ [2, 3]    │
└─────────────────────┴───────────┘

Installed versions

--------Version info---------
Polars:      0.18.0
Index type:  UInt32
Platform:    Linux-5.4.0-149-generic-x86_64-with-glibc2.31
Python:      3.10.11 (main, Apr 20 2023, 19:02:41) [GCC 11.2.0]

----Optional dependencies----
numpy:       1.24.3
pandas:      1.5.3
pyarrow:     12.0.0
connectorx:  <not installed>
deltalake:   <not installed>
fsspec:      2023.5.0
matplotlib:  3.6.3
xlsx2csv:    <not installed>
xlsxwriter:  <not installed>
@SaneBow SaneBow added bug Something isn't working python Related to Python Polars labels Jun 6, 2023
@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Jun 18, 2023

Thanks @SaneBow for the report

In this case, the windows will be

  • (2021-01-01 -3d, 2021-01-01 -3d + 2d]
  • (2021-01-02 -3d, 2021-01-02 -3d + 2d]
  • (2021-01-03 -3d, 2021-01-03 -3d + 2d]
  • (2021-01-04 -3d, 2021-01-04 -3d + 2d]

i.e.

  • (2020-12-29, 2020-12-31], values: []
  • (2020-12-30, 2021-01-01], values: [1]
  • (2020-12-31, 2021-01-02], values: [1, 2]
  • (2021-01-01, 2021-01-03], values: [2, 3]

So, I agree, the output does look incorrect. Will take a look

@MarcoGorelli MarcoGorelli added the A-temporal Area: date/time functionality label Jun 18, 2023
@MarcoGorelli
Copy link
Collaborator

The same example with offset='3d' is also incorrect:

In [30]: import polars as pl
    ...: from datetime import datetime
    ...: dcol = pl.date_range(datetime(2021,1,1), datetime(2021,1,4), '1d', eager=True)
    ...: vcol = np.arange(1, len(dcol)+1)
    ...: df = pl.DataFrame({'datetime': dcol, 'v': vcol})
    ...: df.groupby_rolling('datetime', period='2d', offset='-3d', closed='left').agg(pl.col('v'))
Out[30]:
shape: (4, 2)
┌─────────────────────┬───────────┐
│ datetimev         │
│ ------       │
│ datetime[μs]        ┆ list[i64] │
╞═════════════════════╪═══════════╡
│ 2021-01-01 00:00:00 ┆ []        │
│ 2021-01-02 00:00:00 ┆ []        │
│ 2021-01-03 00:00:00 ┆ []        │
│ 2021-01-04 00:00:00 ┆ []        │
└─────────────────────┴───────────┘

Was expecting:

shape: (4, 2)
┌─────────────────────┬───────────┐
│ datetimev         │
│ ------       │
│ datetime[μs]        ┆ list[i64] │
╞═════════════════════╪═══════════╡
│ 2021-01-01 00:00:00 ┆ []        │
│ 2021-01-02 00:00:00 ┆ []        │
│ 2021-01-03 00:00:00 ┆ [1]       │
│ 2021-01-04 00:00:00 ┆ [1, 2]    │
└─────────────────────┴───────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-temporal Area: date/time functionality bug Something isn't working python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants