Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grouping by frequency and resampling #9178

Merged
merged 143 commits into from
Nov 13, 2021

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Sep 3, 2021

Closes #6255, #8416

This PR implements two related features:

  1. Grouping by a frequency via the freq= argument to cudf.Grouper
  2. and time-series resampling via the .resample() API

Either operation results in _Resampler object that represents the data resampled into "bins" of a particular frequency. The following operations are supported on resampled data:

  1. Aggregations such as min() and max(), performed bin-wise
  2. ffill() and bfill() methods: forward and backward filling in the case of upsampling data
  3. asfreq(): returns the resampled data as a Series or DataFrame()

These are all best understood by example:

First, we create a time series with 1 minute intervals:

>>> index = cudf.date_range(start="2001-01-01", periods=10, freq="1T")
>>> sr = cudf.Series(range(10), index=index)
>>> sr
2001-01-01 00:00:00    0
2001-01-01 00:01:00    1
2001-01-01 00:02:00    2
2001-01-01 00:03:00    3
2001-01-01 00:04:00    4
2001-01-01 00:05:00    5
2001-01-01 00:06:00    6
2001-01-01 00:07:00    7
2001-01-01 00:08:00    8
2001-01-01 00:09:00    9
dtype: int64

Downsampling to 3 minute intervals, followed by a "sum" aggregation:

>>> sr.resample("3T").sum()  # equivalently, sr.groupby(cudf.Grouper(freq="3T")).sum()
2001-01-01 00:00:00     3
2001-01-01 00:03:00    12
2001-01-01 00:06:00    21
2001-01-01 00:09:00     9
dtype: int64

Upsampling to 30 second intervals:

>>> sr.resample("30s").asfreq()
2001-01-01 00:00:00    0.0
2001-01-01 00:00:30    NaN
2001-01-01 00:01:00    1.0
2001-01-01 00:01:30    NaN
2001-01-01 00:02:00    2.0
2001-01-01 00:02:30    NaN
2001-01-01 00:03:00    3.0
2001-01-01 00:03:30    NaN
2001-01-01 00:04:00    4.0
2001-01-01 00:04:30    NaN
2001-01-01 00:05:00    5.0
2001-01-01 00:05:30    NaN
2001-01-01 00:06:00    6.0
2001-01-01 00:06:30    NaN
2001-01-01 00:07:00    7.0
2001-01-01 00:07:30    NaN
2001-01-01 00:08:00    8.0
2001-01-01 00:08:30    NaN
2001-01-01 00:09:00    9.0
Freq: 30S, dtype: float64

Upsampling to 30 second intervals, followed by a forward fill:

>>> sr.resample("30s").ffill()
2001-01-01 00:00:00    0
2001-01-01 00:00:30    0
2001-01-01 00:01:00    1
2001-01-01 00:01:30    1
2001-01-01 00:02:00    2
2001-01-01 00:02:30    2
2001-01-01 00:03:00    3
2001-01-01 00:03:30    3
2001-01-01 00:04:00    4
2001-01-01 00:04:30    4
2001-01-01 00:05:00    5
2001-01-01 00:05:30    5
2001-01-01 00:06:00    6
2001-01-01 00:06:30    6
2001-01-01 00:07:00    7
2001-01-01 00:07:30    7
2001-01-01 00:08:00    8
2001-01-01 00:08:30    8
2001-01-01 00:09:00    9
Freq: 30S, dtype: int64

shwina and others added 30 commits July 27, 2021 16:35
…orner case when computed periods is negative or is 0
@shwina
Copy link
Contributor Author

shwina commented Nov 11, 2021

rerun tests

@isVoid
Copy link
Contributor

isVoid commented Nov 12, 2021

Currently when I run it with any operation, because IndexedFrame.sort_values calls take, I get a FutureWarning which is quite confusing to user:

In [10]: gdf.groupby(cudf.Grouper(key="Publish date", freq="1h", label="left", closed="left")).sum()
/raid/wangm/dev/rapids/cudf/python/cudf/cudf/core/frame.py:3076: FutureWarning: keep_index is deprecated and will be removed in the future.
  FutureWarning,
Out[10]: 
                       ID Price
Publish date                   
2000-01-01 12:00:00     1    30
2000-01-01 13:00:00  <NA>  <NA>
2000-01-01 14:00:00  <NA>  <NA>
...

This issue is not specific to this PR - raising an issue to track it instead. #9667

python/cudf/cudf/core/resample.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/resample.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column_accessor.py Show resolved Hide resolved
python/cudf/cudf/core/column_accessor.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column_accessor.py Outdated Show resolved Hide resolved
Comment on lines +118 to 119
if ordered and labels is not None:
if len(set(labels)) != len(labels):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could all be a single conditional

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted a few comments of the comments I made here. I already planned to do some refactoring of cut, and reading this I see quite a bit more that should be done, but it's out of scope for this PR. No need to change anything, we can revisit it at a later point.

python/cudf/cudf/core/groupby/groupby.py Show resolved Hide resolved
python/cudf/cudf/core/join/join.py Show resolved Hide resolved
python/cudf/cudf/core/tools/datetimes.py Show resolved Hide resolved
python/cudf/cudf/core/resample.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/resample.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/resample.py Show resolved Hide resolved
@vyasr vyasr self-requested a review November 12, 2021 22:27
@quasiben
Copy link
Member

@gpucibot merge

@shwina
Copy link
Contributor Author

shwina commented Nov 13, 2021

@gpucibot merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Time series resampling (df.resample) [FEA] Support time-frequency grouping for Grouper
8 participants