32- vs 64-bit coordinates coordinates in where() #6573

forman · 2022-05-05T06:57:36Z

What happened?

I'm struggling whether this is a bug or not. At least I faced a very unexpected behaviour.

For two given data arrays a and b with same dimensions and equal coordinates, c for c = a.where(b) should have equal dimensions and coordinates.

However if the coordinates of a have dtype of float32 and those of b are float64, then the dimension sizes of c will always be two. Of course, this way the coordinates of a and b are no longer exactly equal, but from a user perspective they represent the same labels.

The behaviour is likely caused by the fact that the indexes generated for the coordinates are no longer strictly equal, therefore where() picks only the two outer cells of each dimension. Allowing to explicitly pass indexes may help here, see #6392.

What did you expect to happen?

In the case described above, the dimensions and coordinates of c should be equal to a (and b).

Minimal Complete Verifiable Example

import numpy as np
import xarray as xr

c32 = xr.DataArray(np.linspace(0, 1, 10, dtype=np.float32), dims='x')
c64 = xr.DataArray(np.linspace(0, 1, 10, dtype=np.float64), dims='x')

c3 = c32.where(c64 > 0.5)
assert len(c32) == len(c3)

v32 = xr.DataArray(np.random.random(10), dims='x', coords=dict(x=c32))
v64 = xr.DataArray(np.random.random(10), dims='x', coords=dict(x=c64))

v3 = v32.where(v64 > 0.5)
assert len(v32) == len(v3)
# --> Assertion error, Expected :10, Actual :2

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:17:03) [MSC v.1929 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: AMD64 Family 25 Model 80 Stepping 0, AuthenticAMD byteorder: little LC_ALL: None LANG: None LOCALE: ('de_DE', 'cp1252') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 2022.3.0 pandas: 1.4.2 numpy: 1.21.6 scipy: 1.8.0 netCDF4: 1.5.8 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.11.3 cftime: 1.6.0 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.10 cfgrib: None iris: None bottleneck: None dask: 2022.04.1 distributed: 2022.4.1 matplotlib: 3.5.1 cartopy: None seaborn: None numbagg: None fsspec: 2022.3.0 cupy: None pint: None sparse: None setuptools: 62.1.0 pip: 22.0.4 conda: None pytest: 7.1.2 IPython: 8.2.0 sphinx: None

The text was updated successfully, but these errors were encountered:

max-sixty · 2022-05-05T07:08:46Z

This does seem very odd. Does anyone have any ideas? As per @forman , changing

-c32 = xr.DataArray(np.linspace(0, 1, 10, dtype=np.float32), dims='x')
+c32 = xr.DataArray(np.linspace(0, 1, 10, dtype=np.float64), dims='x')

...causes the assertion to pass.

I'm not sure using floats as indexes is great, but I wouldn't have expected the results to be like this...

benbovy · 2022-05-05T20:28:38Z

The behaviour is likely caused by the fact that the indexes generated for the coordinates are no longer strictly equal, therefore where() picks only the two outer cells of each dimension.

Yes that's right:

v32.indexes["x"].intersection(v64.indexes["x"])
# Float64Index([0.0, 1.0], dtype='float64', name='x')

I think the issue is more general than where() and relates to the alignment of Xarray objects with 32- vs 64-bit indexed coordinates:

xr.align(c32, c64)
# (<xarray.DataArray (x: 10)>
#  array([0.        , 0.11111111, 0.22222222, 0.33333334, 0.44444445,
#         0.5555556 , 0.6666667 , 0.7777778 , 0.8888889 , 1.        ],
#        dtype=float32)
#  Dimensions without coordinates: x,
#  <xarray.DataArray (x: 10)>
#  array([0.        , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
#         0.55555556, 0.66666667, 0.77777778, 0.88888889, 1.        ])
#  Dimensions without coordinates: x)

xr.align(v32.x, v64.x)
# (<xarray.DataArray 'x' (x: 2)>
#  array([0., 1.], dtype=float32)
#  Coordinates:
#    * x        (x) float64 0.0 1.0,
#  <xarray.DataArray 'x' (x: 2)>
#  array([0., 1.])
#  Coordinates:
#    * x        (x) float64 0.0 1.0)

A possible solution would be to handle this special case internally by converting one of the index according to the dtype of the coordinate labels of the other index, similarly to what we are currently doing for the labels that are passed to .sel() (#3153). This should be pretty easy to implement in PandasIndex.join() I think.

However, I'm also wondering whether or not we should consider this as a bug. It would make sense to have a behavior that is consistent with .sel(), even though it is not a free nor transparent operation (implicit creation of a temporary pandas index). But how about .equals()? I.e.,

v32.x.equals(v64.x)
# False  -- Should we return True here?

This would be quite weird and wouldn't match the Xarray, Pandas and Numpy behavior below:

v32.indexes["x"].equals(v64.indexes["x"])
# False

c64.equals(c32)
# False

np.all(c32.values == c64.values)
# False

max-sixty · 2022-05-05T21:47:59Z

It could be coherent to have:

v32.x.equals(v64.x) be false — the indexes themselves aren't the same
the join allow some float imprecision (similar to method=nearest), which would conveniently allow cases like this to work

I could also imagine raising an error here and having the user coerce the type. That seems less surprising that the current situation. Other languages don't allow floats to be compared for equality at all...

dcherian · 2022-05-05T21:57:58Z

Maybe we should add an explicit join kwarg, so the safe thing to specify is join="exact"

forman · 2022-05-16T10:50:06Z

the join allow some float imprecision (similar to method=nearest), which would conveniently allow cases like this to work

I like that.

benbovy · 2022-09-28T08:17:09Z

I also like the idea of alignment with some tolerance. There is an open PR #4489, which needs to be reworked in the context of the explicit index refactor.

Alternatively to a new kwarg we could add an index build option, e.g., ds.set_xindex("x", index_cls=PandasIndex, align_tolerance=1e-6), but then it is not obvious how to handle different tolerance values given for the indexes to compare. Maybe this could depend on the given join method? E.g., pick the smallest tolerance for join=inner, the largest for join=outer, the tolerance of the left index for join=left, etc.

forman added bug needs triage Issue that has not been reviewed by xarray team member labels May 5, 2022

forman mentioned this issue May 5, 2022

Enhance xcube affine_transform_dataset() xcube-dev/xcube#679

Merged

5 tasks

dcherian removed the needs triage Issue that has not been reviewed by xarray team member label May 17, 2022

benbovy added topic-indexing enhancement and removed bug labels Sep 28, 2022

tom-andersson mentioned this issue Jul 17, 2023

Fix rounding errors in DeepSensorModel.predict coordinates from normalise-unnormalise operations alan-turing-institute/deepsensor#25

Merged

github-project-automation bot added this to Explicit Indexes Aug 27, 2024

github-project-automation bot moved this to Would enable this in Explicit Indexes Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

32- vs 64-bit coordinates coordinates in where() #6573

32- vs 64-bit coordinates coordinates in where() #6573

forman commented May 5, 2022

max-sixty commented May 5, 2022

benbovy commented May 5, 2022 •

edited

Loading

max-sixty commented May 5, 2022

dcherian commented May 5, 2022

forman commented May 16, 2022

benbovy commented Sep 28, 2022

32- vs 64-bit coordinates coordinates in where() #6573

32- vs 64-bit coordinates coordinates in where() #6573

Comments

forman commented May 5, 2022

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

max-sixty commented May 5, 2022

benbovy commented May 5, 2022 • edited Loading

max-sixty commented May 5, 2022

dcherian commented May 5, 2022

forman commented May 16, 2022

benbovy commented Sep 28, 2022

benbovy commented May 5, 2022 •

edited

Loading