From 9ce8d9c26ca459035da586978e88a7c17f060cf0 Mon Sep 17 00:00:00 2001 From: Deepak Cherian Date: Thu, 30 Nov 2023 16:54:40 -0700 Subject: [PATCH 1/4] Grouper object design doc --- design_notes/grouper_objects.md | 244 ++++++++++++++++++++++++++++++++ 1 file changed, 244 insertions(+) create mode 100644 design_notes/grouper_objects.md diff --git a/design_notes/grouper_objects.md b/design_notes/grouper_objects.md new file mode 100644 index 00000000000..b0b3b825107 --- /dev/null +++ b/design_notes/grouper_objects.md @@ -0,0 +1,244 @@ +# Grouper Objects +**Author**: Deepak Cherian +**Created**: Nov 21, 2023 + +## Abstract + +I propose the addition of Grouper objects to Xarray's public API so that +```python +Dataset.groupby(x=BinGrouper(bins=np.arange(10, 2)))) +``` +is identical to today's syntax: +```python +Dataset.groupby_bins("x", bins=np.arange(10, 2)) +``` + +## Motivation and scope + +Xarray's GroupBy API implements the split-apply-combine pattern (Wickham, 2011)[^1], which applies to a very large number of problems: histogramming, compositing, climatological averaging, resampling to a different time frequency, etc. +The pattern abstracts the following pseudocode: +```python +results = [] +for element in unique_labels: + subset = ds.sel(x=(ds.x == element)) # split + # subset = ds.where(ds.x == element, drop=True) # alternative + result = subset.mean() # apply + results.append(result) + +xr.concat(results) # combine +``` + +to +```python +ds.groupby('x').mean() # splits, applies, and combines +``` + +Efficient vectorized implementations of this pattern are implemented in numpy's [`ufunc.at`](https://numpy.org/doc/stable/reference/generated/numpy.ufunc.at.html), [`ufunc.reduceat`](https://numpy.org/doc/stable/reference/generated/numpy.ufunc.reduceat.html), [`numbagg.grouped`](https://github.com/numbagg/numbagg/blob/main/numbagg/grouped.py), [`numpy_groupies`](https://github.com/ml31415/numpy-groupies), and probably more. +These vectorized implementations *all* require, as input, an array of integer codes or labels that identify unique elements in the array being grouped over (`'x'` in the example above). +```python +import numpy as np + +# array to reduce +a = np.array([1, 1, 1, 1, 2]) + +# initial value for result +out = np.zeros((3,), dtype=int) + +# integer codes +labels = np.array([0, 0, 1, 2, 1]) + +# groupby-reduction +np.add.at(out, labels, a) +out # array([2, 3, 1]) +``` + +One can 'factorize' or construct such an array of integer codes using `pandas.factorize` or `numpy.unique(..., return_inverse=True)` for categorical arrays; `pandas.cut`, `pandas.qcut`, or `np.digitize` for discretizing continuous variables. +In practice, since `GroupBy` objects exist, much of complexity in applying the groupby paradigm stems from appropriately factorizing or generating labels for the operation. +Consider these two examples: +1. [Bins that vary in a dimension](https://flox.readthedocs.io/en/latest/user-stories/nD-bins.html) +2. [Overlapping groups](https://flox.readthedocs.io/en/latest/user-stories/overlaps.html) +3. [Rolling resampling](https://github.com/pydata/xarray/discussions/8361) + +Anecdotally, less sophisticated users commonly resort to the for-loopy implementation illustrated by the pseudocode above when the analysis at hand is not easily expressed using the API presented by Xarray's GroupBy object. +Xarray's GroupBy API today abstracts away the split, apply, and combine stages but not the "factorize" stage. +Grouper objects will close the gap. + +## Usage and impact + + + + + + + + + +Grouper objects +1. Will abstract useful factorization algorithms, and +2. Present a natural way to extend GroupBy to grouping by multiple variables: `ds.groupby(x=BinGrouper(...), t=Resampler(freq="M", ...)).mean()`. + +In addition, Grouper objects provide a nice interface to add often-requested grouping functionality +1. A new `SpaceResampler` would allow specifying resampling spatial dimensions. ([issue](https://github.com/pydata/xarray/issues/4008)) +2. `RollingTimeResampler` would allow rolling-like functionality that understands timestamps ([issue](https://github.com/pydata/xarray/issues/3216)) +3. A `QuantileBinGrouper` to abstract away `pd.cut` ([issue](https://github.com/pydata/xarray/discussions/7110)) +4. A `SeasonGrouper` and `SeasonResampler` would abstract away common annoyances with such calculations today + 1. Support seasons that span a year-end. + 2. Only include seasons with complete data coverage. + 3. Allow grouping over seasons of unequal length + 4. See [this xcdat discussion](https://github.com/xCDAT/xcdat/issues/416) for a `SeasonGrouper` like functionality: + 5. Return results with seasons in a sensible order +5. Weighted grouping ([issue](https://github.com/pydata/xarray/issues/3937)) + 1. Once `IntervalIndex` like objects are supported, `Resampler` groupers can account for interval lengths when resampling. + +## Backward Compatibility + +Xarray's existing grouping functionality will be exposed using two new Groupers: +1. `UniqueGrouper` which uses `pandas.factorize` +2. `BinGrouper` which uses `pandas.cut` +3. `TimeResampler` which mimics pandas' `.resample` + +Grouping by single variables will be unaffected so that `ds.groupby('x')` will be identical to `ds.groupby(x=UniqueGrouper())`. +Similarly, `ds.groupby_bins('x', bins=np.arange(10, 2))` will be unchanged and identical to `ds.groupby(x=BinGrouper(bins=np.arange(10, 2)))`. + +## Detailed description + +All Grouper objects will subclass from a Grouper object +```python +import abc + +class Grouper(abc.ABC): + @abc.abstractmethod + def factorize(self, group): + raise NotImplementedError + +class CustomGrouper(Grouper): + def factorize(self, group): + ... + return codes, group_indices, unique_coord, full_index + + def weights(self, group): + ... + return weights +``` + +### The `factorize` method +Today, the `factorize` method takes as input the group variable and returns 4 variables (I propose to clean this up below): +1. `codes`: An array of same shape as the `group` with int dtype. NaNs in `group` are coded by `-1` and ignored later. +2. `group_indices` is a list of index location of `group` elements that belong to a single group. +3. `unique_coord` is (usually) a `pandas.Index` object of all unique `group` members present in `group`. +4. `full_index` is a `pandas.Index` of all `group` members. This is different from `unique_coord` for binning and resampling, where not all groups in the output may be represented in the input `group`. For grouping by a categorical variable e.g. `['a', 'b', 'a', 'c']`, `full_index` and `unique_coord` are identical. +There is some redundancy here since `unique_coord` is always equal to or a subset of `full_index`. +We can clean this up (see Implementation below). + +### The `weights` method (?) + +The proposed `weights` method is optional and unimplemented today. +Groupers with `weights` will allow composing `weighted` and `groupby` ([issue](https://github.com/pydata/xarray/issues/3937)). +The `weights` method should return an appropriate array of weights such that the following property is satisfied +```python +gb_sum = ds.groupby(by).sum() + +weights = CustomGrouper.weights(by) +weighted_sum = xr.dot(ds, weights) + +assert_identical(gb_sum, weighted_sum) +``` +For example, the boolean weights for `group=np.array(['a', 'b', 'c', 'a', 'a'])` should be +``` +[[1, 0, 0, 1, 1], + [0, 1, 0, 0, 0], + [0, 0, 1, 0, 0]] +``` +This is the boolean "summarization matrix" referred to in the classic Iverson (1980, Section 4.3)[^2]. + +For a rolling resampling, windowed weights are possible +``` +[[0.5, 1, 0.5, 0, 0], + [0, 0.25, 1, 1, 0], + [0, 0, 0, 1, 1]] +``` + +### The `preferred_chunks` method (?) + +Rechunking support is another optional extension point. +In `flox` I experimented some with automatically rechunking to make a groupby more parallel-friendly ([example 1](https://flox.readthedocs.io/en/latest/generated/flox.rechunk_for_blockwise.html), [example 2](https://flox.readthedocs.io/en/latest/generated/flox.rechunk_for_cohorts.html)). +A great example is for resampling-style groupby reductions, for which `codes` might look like +``` +0001|11122|3333 +``` +where `|` represents chunk boundaries. A simple rechunking to +``` +000|111122|3333 +``` +would make this resampling reduction an embarassingly parallel blockwise problem. + +Similarly consider monthly-mean climatologies for which the month numbers might be +``` +1 2 3 4 5 | 6 7 8 9 10 | 11 12 1 2 3 | 4 5 6 7 8 | 9 10 11 12 | +``` +A slight rechunking to +``` +1 2 3 4 | 5 6 7 8 | 9 10 11 12 | 1 2 3 4 | 5 6 7 8 | 9 10 11 12 | +``` +allows us to reduce `1, 2, 3, 4` separately from `5,6,7,8` and `9, 10, 11, 12` while still being parallel friendly (see the [flox documentation](https://flox.readthedocs.io/en/latest/implementation.html#method-cohorts) for more). + +We could attempt to detect these patterns, or we could just have the Grouper take as input `chunks` and return a tuple of "nice" chunk sizes to rechunk to. +```python +def preferred_chunks(self, chunks: ChunksTuple) -> ChunksTuple: + pass +``` +For monthly means, since the period of repetition of labels is 12, the Grouper might choose possible chunk sizes of `((2,),(3,),(4,),(6,))`. +For resampling, the Grouper could choose to resample to a multiple or an even fraction of the resampling frequency. + +## Related work + +Pandas has [Grouper objects](https://pandas.pydata.org/docs/reference/api/pandas.Grouper.html#pandas-grouper) that represent the GroupBy instruction. +However, these objects do not appear to be extension points, unlike the Grouper objects proposed here. +Instead, Pandas' `ExtensionArray` has a [`factorize`](https://pandas.pydata.org/docs/reference/api/pandas.api.extensions.ExtensionArray.factorize.html) method. + +Composing rolling with time resampling is a common workload: +1. Polars has [`group_by_dynamic`](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.group_by_dynamic.html) which appears to be like the proposed `RollingResampler`. +2. scikit-downscale provides [`PaddedDOYGrouper`]( +https://github.com/pangeo-data/scikit-downscale/blob/e16944a32b44f774980fa953ea18e29a628c71b8/skdownscale/pointwise_models/groupers.py#L19) + +## Implementation Proposal + +1. Get rid of `squeeze` [issue](https://github.com/pydata/xarray/issues/2157): [PR](https://github.com/pydata/xarray/pull/8506) +2. Merge existing two class implementation to a single Grouper class + 1. This design was implemented in [this PR](https://github.com/pydata/xarray/pull/7206) to account for some annoying data dependencies. + 2. See [PR](https://github.com/pydata/xarray/pull/8509) +3. Clean up what's returned by `factorize` methods. + 1. A solution here might be to have `group_indices: Mapping[int, Sequence[int]]` be a mapping from group index in `full_index` to a sequence of integers. + 2. Return a `namedtuple` or `dataclass` from existing Grouper factorize methods to facilitate API changes in the future. +4. Figure out what to pass to `factorize` + 1. Xarray eagerly reshapes nD variables to 1D. This is an implementation detail we need not expose. + 2. When grouping by an unindexed variable Xarray passes a `_DummyGroup` object. This seems like something we don't want in the public interface. We could special case "internal" Groupers to preserve the optimizations in `UniqueGrouper`. +5. Grouper objects will exposed under the `xr.groupers` Namespace. At first these will include `UniqueGrouper`, `BinGrouper`, and `TimeResampler`. + +## Alternatives + +One major design choice made here was to adopt the syntax `ds.groupby(x=BinGrouper(...))` instead of `ds.groupby(BinGrouper('x', ...))`. +This allows reuse of Grouper objects, example +```python +grouper = BinGrouper(...) +ds.groupby(x=grouper, y=grouper) +``` +but requires that all variables being grouped by (`x` and `y` above) are present in Dataset `ds`. This does not seem like a bad requirement. +Importantly `Grouper` instances will be copied internally so that they can safely cache state that might be shared between `factorize` and `weights`. + +Today, it is possible to `ds.groupby(DataArray, ...)`. This syntax will still be supported. + +## Discussion + +This proposal builds on these discussions: +1. https://github.com/xarray-contrib/flox/issues/191#issuecomment-1328898836 +2. https://github.com/pydata/xarray/issues/6610 + +## Copyright + +This document has been placed in the public domain. + +## References and footnotes + +[^1]: Wickham, H. (2011). The split-apply-combine strategy for data analysis. https://vita.had.co.nz/papers/plyr.html +[^2]: Iverson, K.E. (1980). Notation as a tool of thought. Commun. ACM 23, 8 (Aug. 1980), 444–465. https://doi.org/10.1145/358896.358899 From b543047c43cf59c4f9fe0c4a11ab07b4caa2900a Mon Sep 17 00:00:00 2001 From: Deepak Cherian Date: Fri, 1 Dec 2023 22:12:28 -0700 Subject: [PATCH 2/4] [skip-ci] Update design_notes/grouper_objects.md --- design_notes/grouper_objects.md | 7 ------- 1 file changed, 7 deletions(-) diff --git a/design_notes/grouper_objects.md b/design_notes/grouper_objects.md index b0b3b825107..0d42eb41054 100644 --- a/design_notes/grouper_objects.md +++ b/design_notes/grouper_objects.md @@ -65,13 +65,6 @@ Grouper objects will close the gap. ## Usage and impact - - - - - - - Grouper objects 1. Will abstract useful factorization algorithms, and From 39c5d25af009033e3b2d7073bc6d1180ce47e63c Mon Sep 17 00:00:00 2001 From: Deepak Cherian Date: Sat, 2 Dec 2023 21:22:06 -0700 Subject: [PATCH 3/4] [skip-ci] Minor updates --- design_notes/grouper_objects.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/design_notes/grouper_objects.md b/design_notes/grouper_objects.md index 0d42eb41054..5adbff8a721 100644 --- a/design_notes/grouper_objects.md +++ b/design_notes/grouper_objects.md @@ -101,15 +101,15 @@ import abc class Grouper(abc.ABC): @abc.abstractmethod - def factorize(self, group): + def factorize(self, by: DataArray): raise NotImplementedError class CustomGrouper(Grouper): - def factorize(self, group): + def factorize(self, by: DataArray): ... return codes, group_indices, unique_coord, full_index - def weights(self, group): + def weights(self, by: DataArray) -> DataArray: ... return weights ``` @@ -142,7 +142,10 @@ For example, the boolean weights for `group=np.array(['a', 'b', 'c', 'a', 'a'])` [0, 1, 0, 0, 0], [0, 0, 1, 0, 0]] ``` -This is the boolean "summarization matrix" referred to in the classic Iverson (1980, Section 4.3)[^2]. +This is the boolean "summarization matrix" referred to in the classic Iverson (1980, Section 4.3)[^2] and "nub sieve" in [various APLs](https://aplwiki.com/wiki/Nub_Sieve). + +> [!NOTE] +> We can always construct `weights` automatically using `group_indices` from `factorize`, so this is not a required method. For a rolling resampling, windowed weights are possible ``` From b3944512c250504a6f1b3c6ea6be88238e119d54 Mon Sep 17 00:00:00 2001 From: Deepak Cherian Date: Tue, 5 Dec 2023 15:41:49 -0700 Subject: [PATCH 4/4] [skip-ci] Update design_notes/grouper_objects.md Co-authored-by: Mathias Hauser --- design_notes/grouper_objects.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/design_notes/grouper_objects.md b/design_notes/grouper_objects.md index 5adbff8a721..af42ef2f493 100644 --- a/design_notes/grouper_objects.md +++ b/design_notes/grouper_objects.md @@ -59,7 +59,7 @@ Consider these two examples: 2. [Overlapping groups](https://flox.readthedocs.io/en/latest/user-stories/overlaps.html) 3. [Rolling resampling](https://github.com/pydata/xarray/discussions/8361) -Anecdotally, less sophisticated users commonly resort to the for-loopy implementation illustrated by the pseudocode above when the analysis at hand is not easily expressed using the API presented by Xarray's GroupBy object. +Anecdotally, less experienced users commonly resort to the for-loopy implementation illustrated by the pseudocode above when the analysis at hand is not easily expressed using the API presented by Xarray's GroupBy object. Xarray's GroupBy API today abstracts away the split, apply, and combine stages but not the "factorize" stage. Grouper objects will close the gap.