Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement idxmax and idxmin functions #3871

Merged
merged 3 commits into from
Mar 29, 2020
Merged

Conversation

toddrjen
Copy link
Contributor

@toddrjen toddrjen commented Mar 20, 2020

This implements idxmax and idxmin functions similar to thier pandas equivalents.

This is my first time contributing to the project so I am not certain the structure or approach is the best. Please let me know if there is a better way to implement this.

This also includes two other changes.

First, it drops some code for backwards-compatibility with numpy 1.12, which isn't supported. This code was hiding an error I needed to have access to in order to get the function working.

Second, it adds an option to Dataset.map to let you map DataArray methods by name. I used this to implement the Dataset versions of idxmax and idxmin.

  • Closes Implement DataArray.idxmax() #60
  • Tests added
  • Passes isort -rc . && black . && mypy . && flake8
  • Fully documented, including whats-new.rst for all changes and api.rst for new API

@max-sixty
Copy link
Collaborator

This looks extremely good and thorough! Thanks @toddrjen !

I'll have a proper look through later. I see a couple of minor questions I'll add in too. Others feel free to get ahead of me!

Copy link
Collaborator

@keewis keewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @max-sixty: this looks really good.

I will also have a proper look later, but for now I do think you don't have to change map to accept method names: we can pass callables that do the lookup for us or use unbound methods (something like cls.method instead of obj.method). I'd prefer unbound methods.

xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a couple of nits. Overall this does look very good, and thanks for adding lots of tests including those outside these functions.

To what extent should this support non-index coordinates?

Any other thoughts?

xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
@toddrjen
Copy link
Contributor Author

@keewis @max-sixty The map thing is purely a convenience function. I know there are other ways to do it, but since I thought this would be a useful feature for users in its own right, I did it that way. But of course I can do it another way if you disagree.

The one complication is that using DataArray.idxmax and DataArray.idxmin assumes that the Dataset would only ever contain DataArray objects. That may be mostly the case now, but I didn't want to bake that into the code. I could do it using a lambda or nested function, but as I said I thought this approach had other benefits to users.

I will address the other comments inline.

@toddrjen
Copy link
Contributor Author

@keewis @max-sixty The new commit with the requested changes has been pushed to this branch (except for the map one, pending ongoing discussion). Please take a look.

@toddrjen
Copy link
Contributor Author

@max-sixty

To what extent should this support non-index coordinates?

I am not familiar with non-index coordinates, what are those?

Do you mean non-dimension coordinates? Does that even make sense in a general way? If they are 1D and tied to just one dimension coordinate that could be done, but if they are not tied to any dimension or tied to multiple dimensions or otherwise not 1D I am not sure what it would mean to take the idxmin/idxmax of them.

xarray/core/dataarray.py Outdated Show resolved Hide resolved
@max-sixty
Copy link
Collaborator

The one complication is that using DataArray.idxmax and DataArray.idxmin assumes that the Dataset would only ever contain DataArray objects

I hear you and share the impulse that baking this in seems not ideal. Though I think it's a reasonable compromise to make, and there are no plans to deviate from it.

(Ideally maybe we have a ._contained_class attribute on the dataset which is almost always DataArray)

I think having a lambda is fine too.

@max-sixty
Copy link
Collaborator

Do you mean non-dimension coordinates?

Yes, thanks for clarifying

Does that even make sense in a general way? If they are 1D and tied to just one dimension coordinate that could be done, but if they are not tied to any dimension or tied to multiple dimensions or otherwise not 1D I am not sure what it would mean to take the idxmin/idxmax of them.

Yes, it wouldn't work in all cases, fair point. There are some cases in which it would work though, I'm unsure if it would be too complicated an interface to return them depending on whether it would work.

(and completely fine to contemplate these later)

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Outdated Show resolved Hide resolved
@dcherian dcherian mentioned this pull request Mar 21, 2020
13 tasks
@toddrjen
Copy link
Contributor Author

I fixed the extra space in the docstring and moved the business logic to computation.py.

@max-sixty
Copy link
Collaborator

LGTM! Any other thoughts before we merge?

xarray/core/computation.py Outdated Show resolved Hide resolved
xarray/core/computation.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/computation.py Show resolved Hide resolved
xarray/core/computation.py Outdated Show resolved Hide resolved
@shoyer
Copy link
Member

shoyer commented Mar 22, 2020

In general this is really nicely put together. My main asks:

  1. Remove the unrelated API change to map
  2. Think about if the alternative of returning an arbitrary value rather than promotion or raising an error.

For details see the comments above. If we want to support (2), then it might make sense to use a string for selecting values of promote rather than True/False/None (e.g., so we can include the option to return an arbitrary coordinate value).

xarray/core/computation.py Outdated Show resolved Hide resolved
@toddrjen
Copy link
Contributor Author

toddrjen commented Mar 25, 2020 via email

@toddrjen toddrjen force-pushed the idxmax branch 2 times, most recently from 37173ec to 272a690 Compare March 26, 2020 04:17
@toddrjen
Copy link
Contributor Author

Please see the newest version with the promote argument changed to fill_value.

@toddrjen
Copy link
Contributor Author

I am not sure why the tests are suddenly failing. The tests were all working, then I rebased on the latest master and they are failing and I can't figure out why.

@toddrjen
Copy link
Contributor Author

I figured out what is going wrong. I will make a commit with a fix and include it in this pull request later today.

xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Outdated Show resolved Hide resolved
@toddrjen toddrjen force-pushed the idxmax branch 2 times, most recently from 51e4926 to b7b8f6b Compare March 27, 2020 03:04
@toddrjen
Copy link
Contributor Author

I think I have implemented all the requested changes and all tests are passing. Please take a look.

xarray/core/duck_array_ops.py Show resolved Hide resolved
xarray/core/dataarray.py Show resolved Hide resolved
xarray/core/computation.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/computation.py Show resolved Hide resolved
@toddrjen
Copy link
Contributor Author

Here is a new commit with the discussed changes.

@max-sixty
Copy link
Collaborator

Thank you again @toddrjen, both for the content and the iteration.

@shoyer any final thoughts? Otherwise I'll merge tomorrow?

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks!

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some minor doc suggestions.

doc/whats-new.rst Outdated Show resolved Hide resolved
doc/whats-new.rst Outdated Show resolved Hide resolved
doc/whats-new.rst Outdated Show resolved Hide resolved
doc/whats-new.rst Outdated Show resolved Hide resolved
xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
@max-sixty
Copy link
Collaborator

Amazing — let's merge on green!

@toddrjen
Copy link
Contributor Author

I have gone over it one more time and made a few documentation fixes. Please take one more look before merging.

@max-sixty max-sixty merged commit 1416d5a into pydata:master Mar 29, 2020
@dcherian
Copy link
Contributor

Thanks @toddrjen that's a great contribution!

@toddrjen toddrjen deleted the idxmax branch March 29, 2020 02:00
@max-sixty
Copy link
Collaborator

@toddrjen thank you again! This ended up being quite an adventure, really appreciate you pushing all the way.

dcherian added a commit to MeraX/xarray that referenced this pull request Mar 29, 2020
* upstream/master: (75 commits)
  Implement idxmax and idxmin functions (pydata#3871)
  Update pre-commit-config.yaml (pydata#3911)
  Revert "Use `fixes` in PR template (pydata#3886)" (pydata#3912)
  update the docstring of diff (pydata#3909)
  Un-xfail test_dayofyear_after_cftime_range (pydata#3907)
  Limit repr of arrays containing long strings (pydata#3900)
  expose a few zarr backend functions as semi-public api (pydata#3897)
  Use drawstyle instead of linestyle in plot.step. (pydata#3274)
  Implementation of polyfit and polyval (pydata#3733)
  misplaced quote in whatsnew (pydata#3889)
  Rename ordered_dict_intersection -> compat_dict_intersection (pydata#3887)
  Control attrs of result in `merge()`, `concat()`, `combine_by_coords()` and `combine_nested()` (pydata#3877)
  xfail test_uamiv_format_write (pydata#3885)
  Use `fixes` in PR template (pydata#3886)
  Tweaks to "how_to_release" (pydata#3882)
  whatsnew section for 0.16.0
  Release v0.15.1
  whatsnew for 0.15.1 (pydata#3879)
  update panel documentation (pydata#3880)
  reword the whats-new entry for unit support (pydata#3878)
  ...

# Handle dask arrays.
if isinstance(array, dask_array_type):
res = dask_array.map_blocks(coordarray, indx, dtype=indx.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@toddrjen @dcherian

Sorry, I might be wrong, but it seems, that the func is missing as argument to map_blocks. I tried with lambda a, b: a[b] which seems to work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more things:

  • isinstance-check need to use array.data.
  • res need to be computed, otherwise subsequent actions with res will fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re right, this is definitely broken. Anyone up for putting together a fix in a follow up PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement DataArray.idxmax()
6 participants