Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Grouper objects internally #7561

Merged
merged 39 commits into from
May 4, 2023
Merged

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented Feb 27, 2023

Builds on the refactoring in #7206

@dcherian dcherian added the run-benchmark Run the ASV benchmark workflow label Feb 27, 2023
@dcherian dcherian force-pushed the grouper-objects branch 3 times, most recently from 1147411 to 78f6bda Compare March 9, 2023 21:29
@dcherian
Copy link
Contributor Author

@Illviljan could use some typing help here if you have the time :)

Copy link
Contributor

@Illviljan Illviljan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into it more later.

xarray/core/groupby.py Outdated Show resolved Hide resolved
xarray/core/groupby.py Outdated Show resolved Hide resolved
xarray/core/groupby.py Outdated Show resolved Hide resolved
xarray/core/groupby.py Outdated Show resolved Hide resolved
xarray/core/groupby.py Show resolved Hide resolved
@Illviljan
Copy link
Contributor

I'm not sure about the rest of the errors, @dcherian. Maybe IndexVariable needs to use the DataWithCoords mixin?

xarray/core/groupby.py:577: error: Value of type variable "DataAlignable" of "align" cannot be "Union[DataArray, IndexVariable]"  [type-var]
xarray/core/groupby.py:577: error: Value of type variable "DataAlignable" of "align" cannot be "Union[Dataset, DataArray, IndexVariable]"  [type-var]
xarray/tests/test_groupby.py:55: error: List item 1 has incompatible type "int"; expected "slice"  [list-item]

def align(
*objects: DataAlignable,
join: JoinOptions = "inner",
copy: bool = True,
indexes=None,
exclude=frozenset(),
fill_value=dtypes.NA,
) -> tuple[DataAlignable, ...]:

DataAlignable = TypeVar("DataAlignable", bound=DataWithCoords)

class DataWithCoords(AttrAccessMixin):
"""Shared base class for Dataset and DataArray."""

@dcherian
Copy link
Contributor Author

dcherian commented Apr 5, 2023

Variables don't have coordinates so that won't work.

mypy is correct here, it's a bug and we don't test for grouping by index variables. A commit reverting to the old len check would be great here, if you have the time.

It's not clear to me why we allow this actually. Seems like .groupby("DIMENSION") solves that use-case.

dcherian and others added 4 commits April 18, 2023 16:04
* main: (34 commits)
  Update whats-new.rst
  Fix binning by unsorted array (pydata#7762)
  Bump codecov/codecov-action from 3.1.1 to 3.1.2 (pydata#7760)
  Fix typing errors using mypy 1.2 (pydata#7752)
  [skip-ci] dev whats-new
  Add whats-new for v2023.04.0 (pydata#7757)
  remove the `black` hook (pydata#7756)
  reword the what's new entry for the `pandas` 2.0 dtype changes (pydata#7755)
  restructure the contributing guide (pydata#7681)
  Continue to use nanosecond-precision Timestamps in precision-sensitive areas (pydata#7731)
  minor doc updates to clarify extensions using accessors (pydata#7751)
  align: Avoid reindexing when join="exact" (pydata#7736)
  `pandas=2.0` support (pydata#7724)
  Clarify vectorized indexing documentation (pydata#7747)
  Avoid recasting a CFTimeIndex (pydata#7735)
  fix typo (pydata#7746)
  [pre-commit.ci] pre-commit autoupdate (pydata#7745)
  Bump pypa/gh-action-pypi-publish from 1.8.4 to 1.8.5 (pydata#7743)
  preserve boolean dtype in encoding (pydata#7720)
  [skip-ci] Add alignment benchmarks (pydata#7738)
  ...
* main:
  Bump codecov/codecov-action from 3.1.2 to 3.1.3 (pydata#7781)
  Fix whats-new
  [skip-ci] dev whats-new (pydata#7775)
  [skip-ci] Release 2023.04.2 (pydata#7774)
  Fix groupby_bins when labels are specified (pydata#7769)
  Docstrings examples for string methods (pydata#7669)
  Add dev whats-new
  Add benchmark against latest release on main. (pydata#7753)
@dcherian dcherian changed the title Introduce Grouper objects Introduce Grouper objects internally Apr 26, 2023
@dcherian
Copy link
Contributor Author

I'd like to merge this soon. It's an internal refactor with no public API changes.

I think we can expose the Grouper objects publicly in a new PR

Copy link
Contributor

@Illviljan Illviljan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some TODOs I've been thinking about, no stoppers.

xarray/core/groupby.py Outdated Show resolved Hide resolved
return len(self)

def __len__(self) -> int:
return len(self.full_index) # TODO: full_index not def, abstractmethod?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will crash if .factorize hasn't been triggered before.

Copy link
Contributor Author

@dcherian dcherian Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but it wouldn't make sense without it, and this class is internal-only. Well it would make sense if the user told us the labels or bin edges, but again its internal-only so eh...


@dataclass
class BinGrouper(Grouper):
bins: Any # TODO: What is the typing?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't have the time to figure out the typing on this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment it should be int | Sequence I think, so either number of bins, or actual bin edges in some Iterable where the order matters.

Copy link
Collaborator

@headtr1ck headtr1ck Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never really like to use Sequence because np.ndarrays are not Sequences... And I think it could be quite common to supply the bins with numpy arrays?

Copy link
Contributor Author

@dcherian dcherian Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes int or sequences or arrays

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.typing.ArrayLike?

Copy link
Contributor Author

@dcherian dcherian Apr 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but "nested sequences" won't work here:https://numpy.org/doc/stable/reference/typing.html#numpy.typing.ArrayLike, but maybe that's a tiny detail

@dcherian dcherian added the plan to merge Final call for comments label Apr 28, 2023
@dcherian dcherian merged commit fde773e into pydata:main May 4, 2023
@dcherian dcherian deleted the grouper-objects branch May 4, 2023 02:35
dcherian added a commit to dcherian/xarray that referenced this pull request May 6, 2023
* main:
  Introduce Grouper objects internally (pydata#7561)
  [skip-ci] Add cftime groupby, resample benchmarks (pydata#7795)
  Fix groupby binary ops when grouped array is subset relative to other (pydata#7798)
  adjust the deprecation policy for python (pydata#7793)
  [pre-commit.ci] pre-commit autoupdate (pydata#7803)
  Allow the label run-upstream to run upstream CI (pydata#7787)
  Update asv links in contributing guide (pydata#7801)
  Implement DataArray.to_dask_dataframe() (pydata#7635)
  `ds.to_dict` with data as arrays, not lists (pydata#7739)
  Add lshift and rshift operators (pydata#7741)
  Use canonical name for set_horizonalalignment over alias set_ha (pydata#7786)
  Remove pandas<2 pin (pydata#7785)
  [pre-commit.ci] pre-commit autoupdate (pydata#7783)
else:
newgroup = group

if newgroup.size == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With xarray 2023.5.0 I seem to now get "UnboundLocalError: local variable 'newgroup' referenced before assignment" when using groupby with a IndexVariable object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open a new issue please?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #7919

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments run-benchmark Run the ASV benchmark workflow topic-groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants