Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argmin indexes #1469

Closed
wants to merge 11 commits into from
Closed

Argmin indexes #1469

wants to merge 11 commits into from

Conversation

fujiisoup
Copy link
Member

With this PR, ValueError raises if argmin() is called by a multi-dimensional array.

argmin_indexes() method is also added for xr.DataArray.
Current API design for argmin_indexes() returns
the argmin-indexes as an OrderedDict of DataArrays.

Example:

In [1]: import xarray as xr
   ...: da = xr.DataArray([[1, 2], [-1, 40], [5, 6]],
   ...:                [('x', ['c', 'b', 'a']), ('y', [1, 0])])
   ...: 
   ...: da.argmin_indexes()
   ...: 
Out[1]: 
OrderedDict([('x', <xarray.DataArray 'x' ()>
              array(1)), ('y', <xarray.DataArray 'y' ()>
              array(0))])

In [2]: da.argmin_indexes(dims='y')
Out[2]: 
OrderedDict([('y', <xarray.DataArray 'y' (x: 3)>
              array([0, 0, 0])
              Coordinates:
                * x        (x) <U1 'c' 'b' 'a')])

(Because the returned object is an OrderedDict, it is not beautifully printed. The returned type can be a xr.Dataset if we want.)

Although in #1388 argmin_indexes() was originally suggested so that we can pass the result into isel_point,

da.isel_points(**da.argmin_indexes())

current implementation of isel_points does NOT work for this case.

This is mainly because

  1. isel_points currently does not work for 0-dimensional or multi-dimensional input.
  2. Even for 1-dimensional input (the second one in the above examples), we should also pass x as an indexer rather than the coordinate of indexer.

For 1, I have prepared modification of isel_points to accept multi-dimensional arrays, but I guess it should be in another PR after the API decision.
(It is related in #475, and #974.)

For 2, we should either

  • change API of argmin_indexes to return not only the indicated dimension but also all the dimensions, like
In [2]: da.argmin_indexes(dims='y')
Out[2]: 
OrderedDict([('y', array([0, 0, 0]),
              'x', array(['c' 'b' 'a']))

or

  • change API of isel_point so that it takes care of the indexer's coordinate if xr.DataArray is passed for as indexers.

I originally worked with the second option for the modification of isel_points,
the second option breaks the backward-comaptibility and is somehow magical.

Another alternertive is to

  • change API of argmin_indexes to return xr.Dataset rather than an OrderedDict, and also change API of isel_points to accept xr.Dataset.
    It keeps backward-compatibility.

Any comments are welcome.

@shoyer
Copy link
Member

shoyer commented Jul 3, 2017

A few quick thoughts on API design:

  • The most similar pandas method is called idxmin. We may not want to use the exact same name here, but it's something to keep in mind.
  • We might want two separate methods, one like this that returns an OrderedDict/Dataset and another that returns just one DataArray (for use when reducing over only one axis). I might pick idxmin and indexes_min.
  • A keep_dims=True argument like numpy is a nice way to preserve dimensions if desired.
  • I'm a little surprised that it doesn't work to unpack a Dataset with ** in isel_points -- in theory I think it should.

@fujiisoup
Copy link
Member Author

fujiisoup commented Jul 4, 2017

@shoyer
Thanks for the comments.

  • APIs
    Sounds reasonable suggestions.
    I renamed argmin_indexs -> indexes_min.
    Also I added idxmin that works as similar to pandas's idxmin.
    The APIs are
def indexes_min(self, dims=None, skipna=True):

and

def idxmin(self, dim=None, skipna=True, keep_dims=False):

(I found keep_dims for indexes_min brings another confusion and omit from them).

  • isel_points's issue
    Currently, indexes_min works as
In [1]: import xarray as xr
   ...: da = xr.DataArray([[1, 2], [51, 40], [5, 6]],
   ...:                   [('x', ['c', 'b', 'a']), ('y', [1.4, 4.3])])
   ...: da
   ...: 
Out[1]: 
<xarray.DataArray (x: 3, y: 2)>
array([[ 1,  2],
       [51, 40],
       [ 5,  6]])
Coordinates:
  * x        (x) <U1 'c' 'b' 'a'
  * y        (y) float64 1.4 4.3

In [2]: da.indexes_min(dims='y')
Out[2]: 
<xarray.Dataset>
Dimensions:  (x: 3)
Coordinates:
  * x        (x) <U1 'c' 'b' 'a'
Data variables:
    y        (x) float64 0 1 0

However, arguments of isel_points should be like

<xarray.Dataset>
Dimensions:  (points: 3)
Data variables:
    x        (points) int64 0 1 2
    y        (points) int64 0 1 0

This behavior is more intuitive?

@shoyer
Copy link
Member

shoyer commented Jul 5, 2017

OK, I think I finally understand the nuance of the return value -- thanks for describing that fully for me.

In theory (after #974 is implemented), the current return value from indexes_min should work for indexing, e.g.,

>>> indexes = da.indexes_min(dims='y')
>>> indexes
<xarray.Dataset>
Dimensions:  (x: 3)
Coordinates:
  * x        (x) <U1 'c' 'b' 'a'
Data variables:
    y        (x) int64 0 1 0

>>> da.sel(x=indexes.x, y=indexes.y)  # or ds.sel(**indexes)
<xarray.DataArray (x: 3)>
array([ 1, 40, 5])
Coordinates:
  * x        (x) <U1 'c' 'b' 'a'

So maybe that is the right choice, though I'm not entirely certain yet.

Side note: I'm still not super happy with the names idxmin and indexes_min. They look too different for methods that are only a small variation on each other. Maybe idxmin_dataset or idxmin_dict?

@fujiisoup
Copy link
Member Author

@shoyer
Sorry. My example was still not good (I modified the previous example slightly).

I think my current implementation does not work even for (the future version of) .sel, because the resultant xr.Dataset has the index for 'y' and labels for 'x'.

As you suggested, it looks more xarray-like if it returns coordinates for both 'x' and 'y',

<xarray.Dataset>
Dimensions:  (x: 3)
Coordinates:
  * x        (x) <U1 'c' 'b' 'a'
Data variables:
    y        (x) float64 1.4 4.3 1.4

so that it can be passed to .sel.

This API looks much more natural to me than the current one, although it has to wait for #974.
I will update this PR.

For the function name.
Based on your suggestion, how about labels_min or argmin_labels?
I think function names of idxmin and (the current) indexes_min can be too different
because idxmin should return 'index' while indexes_min returns label.

@johnomotani
Copy link
Contributor

Any plans to finish/merge this PR? indexes_max and indexes_min would be very nice to have! (See also #3160.) Although I guess the idxmin/idxmax are superseded by #3871?

@fujiisoup
Copy link
Member Author

Hi @johnomotani .
Probably I have no time to finish this up and this is already too old.
It would be nice if someone can update this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

argmin / argmax behavior doesn't match documentation
4 participants