Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupBy(multiple groupers) #9372

Merged
merged 12 commits into from
Aug 26, 2024
Merged

GroupBy(multiple groupers) #9372

merged 12 commits into from
Aug 26, 2024

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented Aug 16, 2024

Builds on #9389


This PR takes an approach similar to that in #924.

The GroupBy(multiple array) problem can be reduced to the GroupBy(single nD variable) problem:

  1. First convert each by array to integer codes ("factorize"). I do this first because factorization is expensive, and so it's cheaper to do it before any possible broadcasting.
  2. Broadcast DataArrays of integer codes against each other.
  3. Then np.ravel_multi_index them to construct a single nD array of integer codes.
  4. At this point, we treat it as a GroupBy(single nD array) problem i.e. reshape down to 1D problem. This isn't great for dask/cubed but it is simple and matches what we do today. That means we can apply all the operations we do with GroupBy today.
    • Importantly UDFs receive "stacked arrays" as they do today when grouping by a nD array. This isn't the most elegant but it preserves dtype. For example, you can't index out a 2D array with group==1 without inserting a missing value if group is:
    [[1, 1, 1],
     [2, 1, 1],
     [2, 2, 1]]
    
  5. Once operations are complete, we assign a new MultiIndex and then unstack back to restore the needed dimensions.
    • I'm not sure this is the best at the moment - what if one of the arrays we group by is a multiIndexed array?

TODO:

  • lots more tests!

@dcherian
Copy link
Contributor Author

dcherian commented Aug 16, 2024

@max-sixty I could use some help with adding tests,docs,docstrings if you have time :)

xarray/core/groupby.py Outdated Show resolved Hide resolved
@dcherian dcherian force-pushed the multiple-groupers-2 branch 2 times, most recently from 8701e9b to aa52ce6 Compare August 17, 2024 03:03
dcherian added a commit to xarray-contrib/flox that referenced this pull request Aug 17, 2024
dcherian added a commit to xarray-contrib/flox that referenced this pull request Aug 17, 2024
xarray/groupers.py Outdated Show resolved Hide resolved
@@ -410,6 +382,58 @@ def _resolve_group(
return newgroup


@dataclass
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main new logic.

@dcherian dcherian added the run-benchmark Run the ASV benchmark workflow label Aug 17, 2024
xarray/core/groupby.py Outdated Show resolved Hide resolved
max-sixty added a commit to max-sixty/xarray that referenced this pull request Aug 17, 2024
While trying to help with pydata#9372, I realize the error message for this could be much better, and so putting this PR in as some penance for my tardiness in helping there
@max-sixty
Copy link
Collaborator

I had a look at where I add something — it already seems extremely good? gb - gb.mean() would be cool but this already seems like a big step forwards.

I missed previous discussions / can decide this later — but do we want to have da.groupby(['foo', 'bar']) as sugar for da.groupby(foo=UniqueGrouper(), bar=UniqueGrouper())?

dcherian pushed a commit that referenced this pull request Aug 17, 2024
* Improve error message on `ds['x', 'y']`

While trying to help with #9372, I realize the error message for this could be much better, and so putting this PR in as some penance for my tardiness in helping there

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@dcherian
Copy link
Contributor Author

Yes it is surprisingly complete.

A section in the narrative docs would be nice to add.

@dcherian
Copy link
Contributor Author

missed previous discussions / can decide this later — but do we want to have da.groupby(['foo', 'bar']) as sugar for da.groupby(foo=UniqueGrouper(), bar=UniqueGrouper())?

I agree. It's a bit too much typing at the moment

@dcherian dcherian force-pushed the multiple-groupers-2 branch 3 times, most recently from 414f95a to b28a3fa Compare August 20, 2024 16:12
@dcherian dcherian mentioned this pull request Aug 21, 2024
1 task
@dcherian dcherian force-pushed the multiple-groupers-2 branch from 2514e90 to da23871 Compare August 21, 2024 04:59
dcherian and others added 2 commits August 21, 2024 09:07
fix docs

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci
@dcherian dcherian force-pushed the multiple-groupers-2 branch from da23871 to 7ec2341 Compare August 21, 2024 15:08
@dcherian dcherian marked this pull request as ready for review August 21, 2024 16:27
@dcherian dcherian requested a review from max-sixty August 21, 2024 22:56
@dcherian
Copy link
Contributor Author

dcherian commented Aug 21, 2024

We should test grouping by a multi-index array and some other array but otherwise this should be good to go.

EDIT: Now raises NotImplementedError for this case.

doc/user-guide/groupby.rst Outdated Show resolved Hide resolved
Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is huge! Thank you very much @dcherian !

@dcherian
Copy link
Contributor Author

Merging so it's ready to try out!

@dcherian dcherian merged commit 19a0428 into pydata:main Aug 26, 2024
33 checks passed
@dcherian dcherian deleted the multiple-groupers-2 branch August 26, 2024 15:56
@max-sixty
Copy link
Collaborator

Huge success! Thank you @dcherian !

@shoyer
Copy link
Member

shoyer commented Aug 28, 2024

Wow, amazing work @dcherian! I am so excited to see this finally land.

hollymandel pushed a commit to hollymandel/xarray that referenced this pull request Aug 30, 2024
* GroupBy(multiple groupers)

* Add example to docs

fix docs

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix docs

* More docs

* fix doc

* fix doc again

* Fix bug.

* Add whats-new note

* edit

* Error on multi-variable groupby with MultiIndex

* Update doc/user-guide/groupby.rst

---------

Co-authored-by: Maximilian Roos <[email protected]>
dcherian added a commit to dcherian/xarray that referenced this pull request Aug 30, 2024
* main:
  Accessibility: Add keyboard handling for XArray HTML view (pydata#9412)
  [pre-commit.ci] pre-commit autoupdate (pydata#9316)
  [skip-ci] Speed up docs build by limiting toctrees (pydata#9395)
  fix the failing `pre-commit.ci` runs (pydata#9411)
  Update benchmarks.yml (pydata#9406)
  GroupBy(multiple groupers) (pydata#9372)
  Encode/decode property tests use variables() (pydata#9401)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments run-benchmark Run the ASV benchmark workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable multi-coord grouping from xarray groupby_bins along two dims simultaneously
3 participants