Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataArray.loc fails for duplicates where DataFrame works #2399

Closed
horta opened this issue Sep 5, 2018 · 11 comments
Closed

DataArray.loc fails for duplicates where DataFrame works #2399

horta opened this issue Sep 5, 2018 · 11 comments
Labels

Comments

@horta
Copy link
Contributor

horta commented Sep 5, 2018

import pandas as pd
import xarray as xr

df = pd.DataFrame(data=[0, 1], index=list("aa"))
da = xr.DataArray(df)
print(df.loc[list("a")])  # works
print(da.loc[{"dim_0": list("a")}])  # fails

Xarray fails with the exception

Traceback (most recent call last):
  File "session4.py", line 7, in <module>
    print(da.loc[{"dim_0": list("a")}])  # fails
  File "/usr/local/anaconda3/lib/python3.6/site-packages/xarray/core/dataarray.py", line 104, in __getitem__
    return self.data_array.sel(**key)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/xarray/core/dataarray.py", line 784, in sel
    indexers=indexers, drop=drop, method=method, tolerance=tolerance)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/xarray/core/dataset.py", line 1509, in sel
    self, indexers=indexers, method=method, tolerance=tolerance)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/xarray/core/coordinates.py", line 355, in remap_label_indexers
    obj, v_indexers, method=method, tolerance=tolerance
  File "/usr/local/anaconda3/lib/python3.6/site-packages/xarray/core/indexing.py", line 250, in remap_label_indexers
    dim, method, tolerance)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/xarray/core/indexing.py", line 186, in convert_label_indexer
    indexer = get_indexer_nd(index, label, method, tolerance)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/xarray/core/indexing.py", line 117, in get_indexer_nd
    flat_indexer = index.get_indexer(flat_labels, **kwargs)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3244, in get_indexer
    raise InvalidIndexError('Reindexing only valid with uniquely'
pandas.core.indexes.base.InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Output of xr.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

xarray: 0.10.8
pandas: 0.23.4
numpy: 1.15.1
scipy: 1.1.0
netCDF4: 1.4.1
h5netcdf: None
h5py: 2.8.0
Nio: None
zarr: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.19.0
distributed: 1.23.0
matplotlib: 2.2.3
cartopy: None
seaborn: 0.9.0
setuptools: 40.2.0
pip: 18.0
conda: 4.5.11
pytest: 3.7.3
IPython: 6.5.0
sphinx: 1.7.7
@shoyer
Copy link
Member

shoyer commented Sep 5, 2018

Thanks for the report!

This was actually a somewhat intentional omission in xarray, but if would not be particularly difficult to add in this feature if we want it. At the very least, we should note this deviation somewhere in the docs.

There are two potentially problematic aspects to the pandas behavior:

  1. It means that you cannot count on indexing a dataframe with its own index to return something equivalent to the original dataframe, e.g., consider df.loc[['a', 'a']] in your example, which returns a dataframe with 4 rows.
  2. More generally, it means you can't count on indexing a dataframe with an array to return an object of the same size as the indexer. This is particularly problematic for xarray, because we support vectorized indexing with multi-dimensional indexers. I don't know how we could define a multi-dimensional equivalent of this -- what shape should the result have if you indexed with a multi-dimensional array instead, e.g., da.loc[{"dim_0": xr.DataArray([['a']]}]? With multiple dimensions involved, it's not clear where the extra introduced dimensions should go.

Now that you bring this up, I wonder how the existing supporting for indexing like da.loc[{"dim_0": "a"}] would work if there are other multi-dimensional indexers. I don't know if we have test coverage for this...

@horta
Copy link
Contributor Author

horta commented Sep 6, 2018

Thanks for the feedback!

  1. You can count on indexing if the is_unique flag is checked beforehand. The way pandas does indexing seems to be both clear to the user and powerful. It seems clear because indexing is the result of a Cartesian product after filtering for matching values. It is powerful because it allows indexing as complex as SQL INNER JOIN, which covers the trivial case of unique elements. For example, the following operation
import pandas as pd

df = pd.DataFrame(data=[0, 1, 2], index=list("aab"))
print(df.loc[list("ab")])
#    0
# a  0
# a  1
# b  2

is an INNER JOIN between the two indexes

INNER((a, b) x (a, a, b)) = INNER(aa, aa, ab, ba, ba, bb)
                          = (aa, aa, bb)

Another example:

import pandas as pd

df = pd.DataFrame(data=[0, 1], index=list("aa"))
print(df.loc[list("aa")])
#    0
# a  0
# a  1
# a  0
# a  1

is again an INNER JOIN between the two indexes

INNER((a, a) x (a, a)) = INNER(aa, aa, aa, aa)
                       = (aa, aa, aa, aa)
  1. Assume a bidimensional array with the following indexing:
  0 1
a ! @
a # $

This translate into an unidimensional index: (a, 0), (a, 1), (a, 0), (a, 1). As such, it can be treated as usual. Assume you index the above matrix using [('a', 0), ('a', 0)]. This implies

INNER( ((a, 0), (a, 0)) x ((a, 0), (a, 1), (a, 0), (a, 1)) ) = INNER( (a,0)(a,0),
    (a,0)(a,1), (a,0)(a,0), (a,0)(a,1), (a,0)(a,0), (a,0)(a,1),
    (a,0)(a,0), (a,0)(a,1) )
  = ((a,0)(a,0), (a,0)(a,0), (a,0)(a,0), (a,0)(a,0))

Converting it back to the matricial representation:

  0 0
a ! !
a # #

In summary, my suggestion is to consider the possibility of defining indexing B by using A (i.e., B.loc(A)) as a Cartesian product followed by match filtering. Or in SQL terms, an INNER JOIN.

The multi-dimensional indexing, as far as I can see, can always be transformed into the uni-dimensional case and treated as such.

@shoyer
Copy link
Member

shoyer commented Sep 6, 2018

Let me give a more concrete example of the issue for multi-dimensional indexing:

da_unique = xr.DataArray([0, 1], dims=['x'], coords={'x': ['a', 'b']})
da_nonunique = xr.DataArray([0, 1], dims=['x'], coords={'x': ['a', 'a']})
indexer = xr.DataArray([['a']], dims=['y', 'z'])

With a unique index, notice how the result takes on the dimensions of the indexer:

>>> da_unique.loc[indexer]
<xarray.DataArray (y: 1, z: 1)>
array([[0]])
Coordinates:
    x        (y, z) object 'a'
Dimensions without coordinates: y, z

What would you propose for the result of da_nonunique.loc[indexer]?

@horta
Copy link
Contributor Author

horta commented Sep 7, 2018

Now I see the problem. But I think it is solvable.

I will ignore the dimension names for now as I don't have
much experience with xarray yet.

The code

da_nonunique = xr.DataArray([0, 1], dims=['x'], coords={'x': ['a', 'a']}
indexer = xr.DataArray([['a']], dims=['y', 'z'])

can be understood as defining two indexed arrays:

[a, a] and [[a]]. As we are allowing for non-unique indexing,
I will denote unique array elements as [e_0, e_1] and [[r_0]]
interchangeably.

Algorithm:

  1. Align. [[a], [a]] and [[a]].
  2. Ravel. [(a,a), (a,a)] and [(a,a)].
  3. Join. [(a,a), (a,a)]. I.e., [e_0, e_1].
  4. Unravel. [[e_0, e_1]]. Notice that [e_0, e_1] has been
    picked up by r_0.
  5. Reshape. [[e_0, e_1]] (solution).

Concretely, the solution is a bi-dimensional, 1x2 array:

| 0 1 |.

There is another relevant example. Let the code be

da_nonunique = xr.DataArray([0, 1, 2], dims=['x'], coords={'x': ['a', 'a', 'b']}
indexer = xr.DataArray([['a', 'b']], dims=['y', 'z'])

We have [a, a, b] and [[a, b]], also denoted as [e_0, e_1, e_2]
and [[r_0, r_1]].

Algorithm:

  1. Align. [[a], [a], [b]] and [[a, b]].
  2. Ravel. [(a,a), (a,a), (b,b)] and [(a,a), (b,b)].
  3. Join. [(a,a), (a,a), (b,b)]. I.e., [e_0, e_1, e_2].
  4. Unravel. [[e_0, e_1, e_2]]. Notice now that [e_0, e_1] has been
    picked up by r_0 and [e_2] by r_1.
  5. Reshape. [[e_0, e_1, e_2]].

The solution is a bi-dimensional, 1x3 array:

| 0 1 2 |

Explanation

  1. Align recursively adds a new dimension in the array with lower dimensionality.
  2. Ravel recursively removes a dimension by converting elements into tuples.
  3. SQL Join operation: Cartesian product plus match.
  4. Unravel performs the inverse of 2.
  5. Reshape converts it to the indexer's dimensionality.

@shoyer
Copy link
Member

shoyer commented Sep 7, 2018

Please take a look at xarray's detailed indexing rules: http://xarray.pydata.org/en/stable/indexing.html#indexing-rules

I will ignore the dimension names for now as I don't have
much experience with xarray yet.

I think this is the crux of the problem. Put another way: why should the result of indexing be a 1x2 array instead of a 2x1 array? Currently (with the exception of indexing by a scalar with an index with duplicates), xarray determines the shape/dimensions resulting from indexing from the shape/dimensions of the indexers not the array being indexed.

@horta
Copy link
Contributor Author

horta commented Sep 9, 2018

I see. Now I read about it, let me give another shot.

Let i be

<xarray.DataArray (y: 1, z: 1)>
array([['a']], dtype='<U1')
Dimensions without coordinates: y, z

and d be

<xarray.DataArray (x: 2)>
array([0, 1])
Coordinates:
  * x        (x) <U1 'a' 'a'

The result of d.loc[i] is equal to d.sel(x=i). Also, it seems reasonable to expect the its result should be the same as d0.sel(x=i) for d0 given by

<xarray.DataArray (x: 2, dim_1: 1)>
array([[0],
       [1]])
Coordinates:
  * x        (x) <U1 'a' 'a'
Dimensions without coordinates: dim_1

as per column vector representation assumption.

Answer

Laying down the first dimension gives

y z x
a a a
a

By order, x will match with y and therefore we will append a new dimension after x to
match with z:

y z x dim_1
a a a ?
a ?

where ? means any. Joining the first and second halves of the table gives

y z x dim_1
a a a ?
a a a ?

And here is my suggestions. Use the mapping y|->x and z|->dim_1 to decide which
axis to expand for the additional element. I will choose y-axis because the additional a was
originally appended to the x-axis.

The answer is

<xarray.DataArray (y: 2, z: 1)>
array([[0],
       [1]])
Coordinates:
    x        (y, z) <U1 'a' 'a'
Dimensions without coordinates: y, z

for

>>> ans.coords["x"]
<xarray.DataArray 'x' (y: 2, z: 1)>
array([['a'],
       ['a']], dtype='<U1')
Coordinates:
    x        (y, z) <U1 'a' 'a'
Dimensions without coordinates: y, z

@horta
Copy link
Contributor Author

horta commented Sep 11, 2018

Hi again. I'm working on a precise definition of xarray and indexing. I find the official one a bit hard to understand. It might help me come up with a reasonable way to handle duplicate indices. https://drive.google.com/file/d/1uJ_U6nedkNe916SMViuVKlkGwPX-mGK7/view?usp=sharing

@shoyer
Copy link
Member

shoyer commented Sep 11, 2018

CC @fujiisoup who implemented much of this. I will also take a look at your doc when I have the chance.

I do think that handling duplicate matches with indexing is an important use-case. This comes up with nearest neighbor matching as well -- it would be useful to be able to return the full set of matches within a given distance, not just the nearest match.

I wonder if it would be more productive to consider a new indexing API for one -> many matches. sel/loc is already quite complex.

@fujiisoup
Copy link
Member

Sorry that I couldn't join the discussion here.

Thanks, @horta, for giving the nice document.
We tried to use the consistent terminology in the docs, but I agree that it would be nice to have a list of the definitions.
I think it might be better to discuss in another issue. See #2410.

For loc and sel issues.
One thing I don't agree is

The result of d.loc[i] is equal to d.sel(x=i). Also, it seems reasonable to expect the its result should be the same as d0.sel(x=i) for d0 given by

As xarray inherits not only from pandas but also from numpy's multi-dimensional array.
That is, we need to be very consistent with the resultant shape of indexing.
It would be confusing if a selection from different dimensional arrays becomes the same.

I do think that handling duplicate matches with indexing is an important use-case. This comes up with nearest neighbor matching as well -- it would be useful to be able to return the full set of matches within a given distance, not just the nearest match.

I also think that what is lacking in xarray is this functionality.
Any interest to help us for this?

@horta
Copy link
Contributor Author

horta commented Sep 11, 2018

Yes, I'm working on that doc for now to come up a very precise and as simple as possible definitions.

@stale
Copy link

stale bot commented Aug 12, 2020

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Aug 12, 2020
@stale stale bot closed this as completed Sep 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants