-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grouper object design doc #8510
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,240 @@ | ||||||||||||
# Grouper Objects | ||||||||||||
**Author**: Deepak Cherian <[email protected]> | ||||||||||||
**Created**: Nov 21, 2023 | ||||||||||||
|
||||||||||||
## Abstract | ||||||||||||
|
||||||||||||
I propose the addition of Grouper objects to Xarray's public API so that | ||||||||||||
```python | ||||||||||||
Dataset.groupby(x=BinGrouper(bins=np.arange(10, 2)))) | ||||||||||||
``` | ||||||||||||
is identical to today's syntax: | ||||||||||||
```python | ||||||||||||
Dataset.groupby_bins("x", bins=np.arange(10, 2)) | ||||||||||||
``` | ||||||||||||
|
||||||||||||
## Motivation and scope | ||||||||||||
|
||||||||||||
Xarray's GroupBy API implements the split-apply-combine pattern (Wickham, 2011)[^1], which applies to a very large number of problems: histogramming, compositing, climatological averaging, resampling to a different time frequency, etc. | ||||||||||||
The pattern abstracts the following pseudocode: | ||||||||||||
```python | ||||||||||||
results = [] | ||||||||||||
for element in unique_labels: | ||||||||||||
subset = ds.sel(x=(ds.x == element)) # split | ||||||||||||
# subset = ds.where(ds.x == element, drop=True) # alternative | ||||||||||||
result = subset.mean() # apply | ||||||||||||
results.append(result) | ||||||||||||
|
||||||||||||
xr.concat(results) # combine | ||||||||||||
``` | ||||||||||||
|
||||||||||||
to | ||||||||||||
```python | ||||||||||||
ds.groupby('x').mean() # splits, applies, and combines | ||||||||||||
``` | ||||||||||||
|
||||||||||||
Efficient vectorized implementations of this pattern are implemented in numpy's [`ufunc.at`](https://numpy.org/doc/stable/reference/generated/numpy.ufunc.at.html), [`ufunc.reduceat`](https://numpy.org/doc/stable/reference/generated/numpy.ufunc.reduceat.html), [`numbagg.grouped`](https://github.com/numbagg/numbagg/blob/main/numbagg/grouped.py), [`numpy_groupies`](https://github.com/ml31415/numpy-groupies), and probably more. | ||||||||||||
These vectorized implementations *all* require, as input, an array of integer codes or labels that identify unique elements in the array being grouped over (`'x'` in the example above). | ||||||||||||
```python | ||||||||||||
import numpy as np | ||||||||||||
|
||||||||||||
# array to reduce | ||||||||||||
a = np.array([1, 1, 1, 1, 2]) | ||||||||||||
|
||||||||||||
# initial value for result | ||||||||||||
out = np.zeros((3,), dtype=int) | ||||||||||||
|
||||||||||||
# integer codes | ||||||||||||
labels = np.array([0, 0, 1, 2, 1]) | ||||||||||||
|
||||||||||||
# groupby-reduction | ||||||||||||
np.add.at(out, labels, a) | ||||||||||||
out # array([2, 3, 1]) | ||||||||||||
``` | ||||||||||||
|
||||||||||||
One can 'factorize' or construct such an array of integer codes using `pandas.factorize` or `numpy.unique(..., return_inverse=True)` for categorical arrays; `pandas.cut`, `pandas.qcut`, or `np.digitize` for discretizing continuous variables. | ||||||||||||
In practice, since `GroupBy` objects exist, much of complexity in applying the groupby paradigm stems from appropriately factorizing or generating labels for the operation. | ||||||||||||
Consider these two examples: | ||||||||||||
1. [Bins that vary in a dimension](https://flox.readthedocs.io/en/latest/user-stories/nD-bins.html) | ||||||||||||
2. [Overlapping groups](https://flox.readthedocs.io/en/latest/user-stories/overlaps.html) | ||||||||||||
3. [Rolling resampling](https://github.com/pydata/xarray/discussions/8361) | ||||||||||||
|
||||||||||||
Anecdotally, less experienced users commonly resort to the for-loopy implementation illustrated by the pseudocode above when the analysis at hand is not easily expressed using the API presented by Xarray's GroupBy object. | ||||||||||||
Xarray's GroupBy API today abstracts away the split, apply, and combine stages but not the "factorize" stage. | ||||||||||||
Grouper objects will close the gap. | ||||||||||||
|
||||||||||||
## Usage and impact | ||||||||||||
|
||||||||||||
|
||||||||||||
Grouper objects | ||||||||||||
1. Will abstract useful factorization algorithms, and | ||||||||||||
2. Present a natural way to extend GroupBy to grouping by multiple variables: `ds.groupby(x=BinGrouper(...), t=Resampler(freq="M", ...)).mean()`. | ||||||||||||
|
||||||||||||
In addition, Grouper objects provide a nice interface to add often-requested grouping functionality | ||||||||||||
1. A new `SpaceResampler` would allow specifying resampling spatial dimensions. ([issue](https://github.com/pydata/xarray/issues/4008)) | ||||||||||||
2. `RollingTimeResampler` would allow rolling-like functionality that understands timestamps ([issue](https://github.com/pydata/xarray/issues/3216)) | ||||||||||||
3. A `QuantileBinGrouper` to abstract away `pd.cut` ([issue](https://github.com/pydata/xarray/discussions/7110)) | ||||||||||||
4. A `SeasonGrouper` and `SeasonResampler` would abstract away common annoyances with such calculations today | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Prototype Implemented here: https://github.com/dcherian/xarray/blob/grouper-public/xarray/core/groupers.py |
||||||||||||
1. Support seasons that span a year-end. | ||||||||||||
2. Only include seasons with complete data coverage. | ||||||||||||
3. Allow grouping over seasons of unequal length | ||||||||||||
4. See [this xcdat discussion](https://github.com/xCDAT/xcdat/issues/416) for a `SeasonGrouper` like functionality: | ||||||||||||
5. Return results with seasons in a sensible order | ||||||||||||
5. Weighted grouping ([issue](https://github.com/pydata/xarray/issues/3937)) | ||||||||||||
1. Once `IntervalIndex` like objects are supported, `Resampler` groupers can account for interval lengths when resampling. | ||||||||||||
|
||||||||||||
## Backward Compatibility | ||||||||||||
|
||||||||||||
Xarray's existing grouping functionality will be exposed using two new Groupers: | ||||||||||||
1. `UniqueGrouper` which uses `pandas.factorize` | ||||||||||||
2. `BinGrouper` which uses `pandas.cut` | ||||||||||||
3. `TimeResampler` which mimics pandas' `.resample` | ||||||||||||
|
||||||||||||
Grouping by single variables will be unaffected so that `ds.groupby('x')` will be identical to `ds.groupby(x=UniqueGrouper())`. | ||||||||||||
Similarly, `ds.groupby_bins('x', bins=np.arange(10, 2))` will be unchanged and identical to `ds.groupby(x=BinGrouper(bins=np.arange(10, 2)))`. | ||||||||||||
|
||||||||||||
## Detailed description | ||||||||||||
|
||||||||||||
All Grouper objects will subclass from a Grouper object | ||||||||||||
```python | ||||||||||||
import abc | ||||||||||||
|
||||||||||||
class Grouper(abc.ABC): | ||||||||||||
@abc.abstractmethod | ||||||||||||
def factorize(self, by: DataArray): | ||||||||||||
raise NotImplementedError | ||||||||||||
|
||||||||||||
class CustomGrouper(Grouper): | ||||||||||||
def factorize(self, by: DataArray): | ||||||||||||
... | ||||||||||||
return codes, group_indices, unique_coord, full_index | ||||||||||||
|
||||||||||||
def weights(self, by: DataArray) -> DataArray: | ||||||||||||
... | ||||||||||||
return weights | ||||||||||||
``` | ||||||||||||
|
||||||||||||
### The `factorize` method | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a better name? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 for something involving "labels", as IIUC this method is about grouping coordinate "labels" into common groups There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
And maybe rename the method argument from class MyGrouper(Grouper):
def find_groups(self, var: Variable) -> Groups:
...
return Groups(...) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I kind of like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree. In flox I use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm a bit late, but throughout the document you're using the term "factorize" when you talk about constructing an array of integer indices for the "split" step, so maybe that means that using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. okay, that makes sense. However, since this is advanced API I think we can require people to learn the new meaning, which is not actually new since we're borrowing from What does other "split-apply-combine" software call this? From what I can tell:
|
||||||||||||
Today, the `factorize` method takes as input the group variable and returns 4 variables (I propose to clean this up below): | ||||||||||||
1. `codes`: An array of same shape as the `group` with int dtype. NaNs in `group` are coded by `-1` and ignored later. | ||||||||||||
2. `group_indices` is a list of index location of `group` elements that belong to a single group. | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this need to be part of the interface? I think it can be calculated from There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that it can be calculated. We'd lose an optimization where the current code uses slices for resampling, and unindexed coordinates. I think it's valuable to preserve the former at least. What do you think of this function returning a named tuple with an optional There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think returning a dataclass with some optional fields would indeed be a nice way to handle this. It would also be forward compatible with any future interface changes. |
||||||||||||
3. `unique_coord` is (usually) a `pandas.Index` object of all unique `group` members present in `group`. | ||||||||||||
4. `full_index` is a `pandas.Index` of all `group` members. This is different from `unique_coord` for binning and resampling, where not all groups in the output may be represented in the input `group`. For grouping by a categorical variable e.g. `['a', 'b', 'a', 'c']`, `full_index` and `unique_coord` are identical. | ||||||||||||
There is some redundancy here since `unique_coord` is always equal to or a subset of `full_index`. | ||||||||||||
We can clean this up (see Implementation below). | ||||||||||||
|
||||||||||||
### The `weights` method (?) | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "compute_weights"? "find_weights"? |
||||||||||||
|
||||||||||||
The proposed `weights` method is optional and unimplemented today. | ||||||||||||
Groupers with `weights` will allow composing `weighted` and `groupby` ([issue](https://github.com/pydata/xarray/issues/3937)). | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with your implementation point but the |
||||||||||||
The `weights` method should return an appropriate array of weights such that the following property is satisfied | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we want to allow these operations to be efficient, we should definitely allow and encourage returning weights as sparse arrays. |
||||||||||||
```python | ||||||||||||
gb_sum = ds.groupby(by).sum() | ||||||||||||
|
||||||||||||
weights = CustomGrouper.weights(by) | ||||||||||||
weighted_sum = xr.dot(ds, weights) | ||||||||||||
|
||||||||||||
assert_identical(gb_sum, weighted_sum) | ||||||||||||
``` | ||||||||||||
For example, the boolean weights for `group=np.array(['a', 'b', 'c', 'a', 'a'])` should be | ||||||||||||
``` | ||||||||||||
[[1, 0, 0, 1, 1], | ||||||||||||
[0, 1, 0, 0, 0], | ||||||||||||
[0, 0, 1, 0, 0]] | ||||||||||||
``` | ||||||||||||
This is the boolean "summarization matrix" referred to in the classic Iverson (1980, Section 4.3)[^2] and "nub sieve" in [various APLs](https://aplwiki.com/wiki/Nub_Sieve). | ||||||||||||
|
||||||||||||
> [!NOTE] | ||||||||||||
> We can always construct `weights` automatically using `group_indices` from `factorize`, so this is not a required method. | ||||||||||||
|
||||||||||||
For a rolling resampling, windowed weights are possible | ||||||||||||
``` | ||||||||||||
[[0.5, 1, 0.5, 0, 0], | ||||||||||||
[0, 0.25, 1, 1, 0], | ||||||||||||
[0, 0, 0, 1, 1]] | ||||||||||||
``` | ||||||||||||
|
||||||||||||
### The `preferred_chunks` method (?) | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Quite speculative but seems cool! |
||||||||||||
|
||||||||||||
Rechunking support is another optional extension point. | ||||||||||||
In `flox` I experimented some with automatically rechunking to make a groupby more parallel-friendly ([example 1](https://flox.readthedocs.io/en/latest/generated/flox.rechunk_for_blockwise.html), [example 2](https://flox.readthedocs.io/en/latest/generated/flox.rechunk_for_cohorts.html)). | ||||||||||||
A great example is for resampling-style groupby reductions, for which `codes` might look like | ||||||||||||
``` | ||||||||||||
0001|11122|3333 | ||||||||||||
``` | ||||||||||||
where `|` represents chunk boundaries. A simple rechunking to | ||||||||||||
``` | ||||||||||||
000|111122|3333 | ||||||||||||
``` | ||||||||||||
would make this resampling reduction an embarassingly parallel blockwise problem. | ||||||||||||
|
||||||||||||
Similarly consider monthly-mean climatologies for which the month numbers might be | ||||||||||||
``` | ||||||||||||
1 2 3 4 5 | 6 7 8 9 10 | 11 12 1 2 3 | 4 5 6 7 8 | 9 10 11 12 | | ||||||||||||
``` | ||||||||||||
A slight rechunking to | ||||||||||||
``` | ||||||||||||
1 2 3 4 | 5 6 7 8 | 9 10 11 12 | 1 2 3 4 | 5 6 7 8 | 9 10 11 12 | | ||||||||||||
``` | ||||||||||||
allows us to reduce `1, 2, 3, 4` separately from `5,6,7,8` and `9, 10, 11, 12` while still being parallel friendly (see the [flox documentation](https://flox.readthedocs.io/en/latest/implementation.html#method-cohorts) for more). | ||||||||||||
|
||||||||||||
We could attempt to detect these patterns, or we could just have the Grouper take as input `chunks` and return a tuple of "nice" chunk sizes to rechunk to. | ||||||||||||
```python | ||||||||||||
def preferred_chunks(self, chunks: ChunksTuple) -> ChunksTuple: | ||||||||||||
pass | ||||||||||||
``` | ||||||||||||
For monthly means, since the period of repetition of labels is 12, the Grouper might choose possible chunk sizes of `((2,),(3,),(4,),(6,))`. | ||||||||||||
For resampling, the Grouper could choose to resample to a multiple or an even fraction of the resampling frequency. | ||||||||||||
|
||||||||||||
## Related work | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Very interested in what else is out there. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sklearn's clustering? |
||||||||||||
|
||||||||||||
Pandas has [Grouper objects](https://pandas.pydata.org/docs/reference/api/pandas.Grouper.html#pandas-grouper) that represent the GroupBy instruction. | ||||||||||||
However, these objects do not appear to be extension points, unlike the Grouper objects proposed here. | ||||||||||||
Instead, Pandas' `ExtensionArray` has a [`factorize`](https://pandas.pydata.org/docs/reference/api/pandas.api.extensions.ExtensionArray.factorize.html) method. | ||||||||||||
|
||||||||||||
Composing rolling with time resampling is a common workload: | ||||||||||||
1. Polars has [`group_by_dynamic`](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.group_by_dynamic.html) which appears to be like the proposed `RollingResampler`. | ||||||||||||
2. scikit-downscale provides [`PaddedDOYGrouper`]( | ||||||||||||
https://github.com/pangeo-data/scikit-downscale/blob/e16944a32b44f774980fa953ea18e29a628c71b8/skdownscale/pointwise_models/groupers.py#L19) | ||||||||||||
Comment on lines
+197
to
+198
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We would be quite happy to have this directly in xarray! We have are own implementation at xclim (https://github.com/Ouranosinc/xclim/blob/b905e5529f2757a49f4485b7713b0ef01a45df2c/xclim/sdba/base.py#L103) but it's kind of a mess. Grouping per time period with a window is very useful and common in model output statistics. |
||||||||||||
|
||||||||||||
## Implementation Proposal | ||||||||||||
|
||||||||||||
1. Get rid of `squeeze` [issue](https://github.com/pydata/xarray/issues/2157): [PR](https://github.com/pydata/xarray/pull/8506) | ||||||||||||
2. Merge existing two class implementation to a single Grouper class | ||||||||||||
1. This design was implemented in [this PR](https://github.com/pydata/xarray/pull/7206) to account for some annoying data dependencies. | ||||||||||||
2. See [PR](https://github.com/pydata/xarray/pull/8509) | ||||||||||||
3. Clean up what's returned by `factorize` methods. | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This needs input on what we expect custom There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it's probably best to try implementing the most important use cases from the lists you've been assembling: by doing so I believe we can figure out the most important ones. |
||||||||||||
1. A solution here might be to have `group_indices: Mapping[int, Sequence[int]]` be a mapping from group index in `full_index` to a sequence of integers. | ||||||||||||
2. Return a `namedtuple` or `dataclass` from existing Grouper factorize methods to facilitate API changes in the future. | ||||||||||||
4. Figure out what to pass to `factorize` | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could use some input here too, requires some internal cleanup. |
||||||||||||
1. Xarray eagerly reshapes nD variables to 1D. This is an implementation detail we need not expose. | ||||||||||||
2. When grouping by an unindexed variable Xarray passes a `_DummyGroup` object. This seems like something we don't want in the public interface. We could special case "internal" Groupers to preserve the optimizations in `UniqueGrouper`. | ||||||||||||
5. Grouper objects will exposed under the `xr.groupers` Namespace. At first these will include `UniqueGrouper`, `BinGrouper`, and `TimeResampler`. | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thoughts on the namespace? And what other "groupers" would be useful? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That seems fine to me. We'd need to figure out which groupers to declare out-of-scope for inclusion in |
||||||||||||
|
||||||||||||
## Alternatives | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We discussed this at a dev meeting a while ago, but would be good to rethink and commit to a path now. |
||||||||||||
|
||||||||||||
One major design choice made here was to adopt the syntax `ds.groupby(x=BinGrouper(...))` instead of `ds.groupby(BinGrouper('x', ...))`. | ||||||||||||
This allows reuse of Grouper objects, example | ||||||||||||
```python | ||||||||||||
grouper = BinGrouper(...) | ||||||||||||
ds.groupby(x=grouper, y=grouper) | ||||||||||||
``` | ||||||||||||
Comment on lines
+217
to
+221
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are there cases where reusing Groupers would actually be useful? I think this is mostly just about a syntax. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had one where I wanted to use the same bin specification for two variables. But I agree that its minor. Nominally every property on the grouper is independent of the data variable it is applied to. I think that motivated my suggestion to not encapsulate the data variable in the grouper. But every method on the grouper relies on the data variable as input. So 🤷🏾♂️ . Do you have any advice here on how to think about the coupling/decoupling? |
||||||||||||
but requires that all variables being grouped by (`x` and `y` above) are present in Dataset `ds`. This does not seem like a bad requirement. | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree, this is probably fine. The variables can also be coordinates (e.g., on a DataArray). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we go with that, how would we support variables that don't align with the dataset? I was under the impression that that currently works (not sure, though). If we require a new dimension name for the grouper variable that would be fine with me, but it's also something we should explicitly state. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We don't. We use an "exact" join today: Lines 616 to 620 in 41d33f5
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in that case sounds good to me, |
||||||||||||
Importantly `Grouper` instances will be copied internally so that they can safely cache state that might be shared between `factorize` and `weights`. | ||||||||||||
|
||||||||||||
Today, it is possible to `ds.groupby(DataArray, ...)`. This syntax will still be supported. | ||||||||||||
|
||||||||||||
## Discussion | ||||||||||||
|
||||||||||||
This proposal builds on these discussions: | ||||||||||||
1. https://github.com/xarray-contrib/flox/issues/191#issuecomment-1328898836 | ||||||||||||
2. https://github.com/pydata/xarray/issues/6610 | ||||||||||||
|
||||||||||||
## Copyright | ||||||||||||
|
||||||||||||
This document has been placed in the public domain. | ||||||||||||
|
||||||||||||
## References and footnotes | ||||||||||||
|
||||||||||||
[^1]: Wickham, H. (2011). The split-apply-combine strategy for data analysis. https://vita.had.co.nz/papers/plyr.html | ||||||||||||
[^2]: Iverson, K.E. (1980). Notation as a tool of thought. Commun. ACM 23, 8 (Aug. 1980), 444–465. https://doi.org/10.1145/358896.358899 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How fundamentally different are space and time resampling? If the difference is small enough (units / step size), would it be possible to merge both into a single
Resampler
Grouper object? Then it might also be possible to apply this to coordinates with physical dimensions other than space / time.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this could be done, they're the same, its just the user interface that could be different. For time it's nice to write
freq="M"
and for spacespacing=2.5
(?). The other arguments carry over but fortime
they can betimedelta
ordatetime
objects. So it feels nice to separate them for easy validation, and makes the user code more readable.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main point was that we might want to resample using (numeric?) coordinates that are neither time nor space, but "spacing" might be appropriate in that case, as well? In which case we'd only need a general "Resampler" and a specialized "TimeResampler" (4 in total:
Resampler
/RollingResampler
andTimeResampler
/RollingTimeResampler
). Unless I'm missing something?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, agree! The
Resampler
is really just a convenient wrapper aroundBinGrouper