Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save groupby codes after factorizing, pass to flox #7206

Merged
merged 30 commits into from
Mar 29, 2023

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented Oct 24, 2022

This is an alternative to #6689.

  • There I tried to avoid factorizing in the GroupBy constructor, and passed the by variable directly to flox. Most GroupBy methods however depend on various steps in __init__, so it became messy.
  • Here I instead preserve factorizing in the constructor and pass the factorized codes to flox. This simplifies things a lot. Since we'll want to preserve the "for loop over groups" approach for GroupBy.map, we'll need something like this anyway.

The large amount of deleted code in _flox_reduce here suggests to me that this is the better approach.

I think we could also use this to generalize to multiple groupers:

  • factorize each,
  • ravel_muti_index to generate a singe variable to groupby
  • apply
  • reshape to output shape
  • assign new indices.

@@ -388,63 +483,20 @@ def __init__(
"dimension"
)

full_index = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just refactored with no functionality changes except to return codes

@dcherian dcherian force-pushed the groupby-save-codes-new branch from 1891fdb to ac521d4 Compare November 1, 2022 02:31
@dcherian dcherian force-pushed the groupby-save-codes-new branch from ac521d4 to b64df5b Compare November 2, 2022 02:48
@dcherian dcherian marked this pull request as draft December 2, 2022 02:17
* upstream/main: (39 commits)
  Support the new compression argument in netCDF4 > 1.6.0 (pydata#6981)
  Remove setuptools-scm-git-archive, require setuptools-scm>=7 (pydata#7253)
  Fix mypy failures (pydata#7343)
  Docs: add example of writing and reading groups to netcdf (pydata#7338)
  Reset file pointer to 0 when reading file stream (pydata#7304)
  Enable mypy warn unused ignores (pydata#7335)
  Optimize some copying (pydata#7209)
  Add parse_dims func (pydata#7051)
  Fix coordinate attr handling in `xr.where(..., keep_attrs=True)` (pydata#7229)
  Remove code used to support h5py<2.10.0 (pydata#7334)
  [pre-commit.ci] pre-commit autoupdate (pydata#7330)
  Fix PR number in what’s new (pydata#7331)
  Enable `origin` and `offset` arguments in `resample` (pydata#7284)
  fix doctests: supress urllib3 warning (pydata#7326)
  fix flake8 config (pydata#7321)
  implement Zarr v3 spec support (pydata#6475)
  Fix polyval overloads (pydata#7315)
  deprecate pynio backend (pydata#7301)
  mypy - Remove some ignored packages and modules (pydata#7319)
  Switch to T_DataArray in .coords (pydata#7285)
  ...
@dcherian dcherian added the run-benchmark Run the ASV benchmark workflow label Dec 2, 2022
* main:
  absolufy-imports - No relative imports - PEP8 (pydata#7204)
  [skip-ci] whats-new for dev (pydata#7351)
  Whats-new: 2022.12.0 (pydata#7345)
  Fix assign_coords resetting all dimension coords to default index (pydata#7347)
@github-actions github-actions bot removed the run-benchmark Run the ASV benchmark workflow label Dec 8, 2022
@dcherian dcherian added the run-benchmark Run the ASV benchmark workflow label Feb 27, 2023
dcherian added 3 commits March 8, 2023 21:06
* main:
  Preserve `base` and `loffset` arguments in `resample` (pydata#7444)
  ignore the `pkg_resources` deprecation warning (pydata#7594)
  Update contains_cftime_datetimes to avoid loading entire variable array (pydata#7494)
  Support first, last with dask arrays (pydata#7562)
  update the docs environment (pydata#7442)
  Add xCDAT to list of Xarray related projects (pydata#7579)
  [pre-commit.ci] pre-commit autoupdate (pydata#7565)
  fix nczarr when libnetcdf>4.8.1 (pydata#7575)
  use numpys SupportsDtype (pydata#7521)
@dcherian dcherian changed the title Save groupby codes after factorizing Save groupby codes after factorizing, pass to flox Mar 9, 2023
@dcherian dcherian marked this pull request as ready for review March 9, 2023 21:20
@dcherian
Copy link
Contributor Author

I'd like to merge this soon. It's mostly a refactor moving code around, and a major improvement to the flox code path.

Let me know if it'd be easier to split it in to two PRs

@Illviljan
Copy link
Contributor

Hmm, did I mess something up? I believe I only changed typing related things.

_bins suffix seems to have disappeared:

     def test_groupby_bins_multidim(self):
        array = self.make_groupby_multidim_example_array()
        bins = [0, 15, 20]
        bin_coords = pd.cut(array["lat"].values.flat, bins).categories
        expected = DataArray([16, 40], dims="lat_bins", coords={"lat_bins": bin_coords})
        actual = array.groupby_bins("lat", bins).map(lambda x: x.sum())
>       assert_identical(expected, actual)
E       AssertionError: Left and right DataArray objects are not identical
E       Differing dimensions:
E           (lat_bins: 2) != (lat: 2)
E       Coordinates only on the left object:
E         * lat_bins  (lat_bins) object (0, 15] (15, 20]
E       Coordinates only on the right object:
E         * lat       (lat) object (0, 15] (15, 20]

Copy link
Contributor

@Illviljan Illviljan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @dcherian. This LGTM. :)

xarray/core/groupby.py Show resolved Hide resolved
xarray/core/groupby.py Show resolved Hide resolved
@dcherian dcherian added the plan to merge Final call for comments label Mar 25, 2023
@dcherian dcherian enabled auto-merge (squash) March 29, 2023 15:58
@dcherian dcherian merged commit 0ac5541 into pydata:main Mar 29, 2023
dcherian added a commit to kmsquire/xarray that referenced this pull request Mar 29, 2023
* upstream/main:
  Save groupby codes after factorizing, pass to flox (pydata#7206)
  [skip-ci] Add compute to groupby benchmarks (pydata#7690)
  Delete built-in cfgrib backend (pydata#7670)
  Added a pronunciation guide to the word Xarray in the README.MD fil… (pydata#7677)
  boundarynorm fix (pydata#7553)
  Fix lazy negative slice rewriting. (pydata#7586)
  [pre-commit.ci] pre-commit autoupdate (pydata#7687)
  Adjust sidebar font colors (pydata#7674)
  Bump pypa/gh-action-pypi-publish from 1.8.1 to 1.8.3 (pydata#7682)
  Raise PermissionError when insufficient permissions (pydata#7629)
@dcherian dcherian deleted the groupby-save-codes-new branch March 30, 2023 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments run-benchmark Run the ASV benchmark workflow topic-groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

groupby_bins groups not correctly applied with built-in methods
3 participants