Dataset.where performances regression. #7516

Thomas-Z · 2023-02-08T11:19:34Z

What happened?

Hello,

I'm using the Dataset.where function to select data based on some fields values and it takes way to much time!
The dask dashboard seems to show some tasks repeating themselves many times.

The provided example uses a 1D array for which the selection could be done with Dataset.sel but with our real usecase we make selections on 2D variables.

This problem seems to have appeared with the 2022.6.0 xarray release, the 2022.3.0 is working as expected.

What did you expect to happen?

Using the 2022.3 release, this selection takes 1.37 seconds.
Using the 2022.6.0 up to the 2023.2.0 (the one from yesterday), this selection takes 8.47 seconds.

This example is a very simple and small one, with real data and use case we simply cannot use this function anymore.

Minimal Complete Verifiable Example

import dask.array as da
import distributed as dist
import xarray as xr


client = dist.Client()

# Using small chunks emphasis the problem
ds = xr.Dataset(
    {"field": xr.DataArray(data=da.empty(shape=10000, chunks=10), dims=("x"))}
)
sel = ds["field"] > 0

ds.where(sel, drop=True)

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

Problematic version

INSTALLED VERSIONS

commit: None
python: 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:20:04) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-58-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: ('fr_FR', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1

xarray: 2023.2.0
pandas: 1.5.3
numpy: 1.23.5
scipy: 1.8.1
netCDF4: 1.6.2
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: 2.13.6
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.4
cfgrib: 0.9.10.3
iris: None
bottleneck: None
dask: 2023.1.1
distributed: 2023.1.1
matplotlib: 3.6.3
cartopy: 0.21.1
seaborn: None
numbagg: None
fsspec: 2023.1.0
cupy: None
pint: 0.20.1
sparse: None
flox: None
numpy_groupies: None
setuptools: 67.1.0
pip: 23.0
conda: 22.11.1
pytest: 7.2.1
mypy: None
IPython: 8.7.0
sphinx: 5.3.0

Working version

INSTALLED VERSIONS

commit: None
python: 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:20:04) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-58-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: ('fr_FR', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1

xarray: 2022.3.0
pandas: 1.5.3
numpy: 1.23.5
scipy: 1.8.1
netCDF4: 1.6.2
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: 2.13.6
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.4
cfgrib: 0.9.10.3
iris: None
bottleneck: None
dask: 2023.1.1
distributed: 2023.1.1
matplotlib: 3.6.3
cartopy: 0.21.1
seaborn: None
numbagg: None
fsspec: 2023.1.0
cupy: None
pint: 0.20.1
sparse: None
setuptools: 67.1.0
pip: 23.0
conda: 22.11.1
pytest: 7.2.1
IPython: 8.7.0
sphinx: 5.3.0

headtr1ck · 2023-02-26T21:07:56Z

Can confirm, on my machine it went from 520ms to 5s

headtr1ck · 2023-02-26T21:16:35Z

Git bisect pinpoints this to #6690 which funny enough, is my PR haha.
I will look into it when I find time :)

headtr1ck · 2023-02-27T20:27:52Z

I am a bit puzzled here...
The dask graph looks identical, so it must be the way the indexers are constructed.

The major difference I can find is:
The old version used np.unique while the new version uses xarrays cond.any(..)

Maybe someone with more experience in dask can help out?

dcherian · 2023-02-28T04:41:03Z

The old code had:

nonzeros = zip(clipcond.dims, np.nonzero(clipcond.values))

This loaded the array once and then passed numpy values to the indexing code.

Now, the dask array is passed to the indexing code and is computed many times . #5873 raises an error saying boolean indexing with dask arrays is not allowed.

For here just do ds.where(sel.compute(), drop=True). It's identical to what was happening earlier.

I think we should close this.

Thomas-Z · 2023-02-28T08:54:16Z

Just tried it and it does not seem identical at all to what was happening earlier.

This is the kind of dataset I'm working

With this selection:
sel = (dsx["longitude"] > 0) & (dsx["longitude"] < 100)

Old xarray takes a little less that 1 minute and less than 6GB of memory.
New xarray with compute did not finish and had to be stopped before consuming my 16GB of memory.

dcherian · 2023-02-28T23:30:59Z

Does sel.compute() not finish?

Thomas-Z · 2023-03-01T09:43:27Z

sel = (dsx["longitude"] > 0) & (dsx["longitude"] < 100)
sel.compute()

This "compute" finishes and takes more than 80sec on both versions with a huge memory consumption (it loads the 4 coordinates and the result itself).

I know xarray has to keep more information regarding coordinates and dimensions but doing this (just dask arrays) :

sel2 = (dsx["longitude"].data > 0) & (dsx["longitude"].data < 100)
sel2.compute()

Takes less than 6 seconds.

dcherian · 2023-03-01T19:10:15Z

Yeah that was another change I guess. We could extract out the variable using .variable.

.where(sel2.variable.compute(), drop=True)

do your "_nadir" variables have smaller chunk sizes or are slower to read for some reason?

Thomas-Z · 2023-03-02T11:59:47Z

The .variable computation is fast but it cannot be directly used like you suggest:

dsx.where(sel.variable, drop=True)

TypeError: cond argument is <xarray.Variable (num_lines: 5761870, num_pixels: 71)>
... but must be a <class 'xarray.core.dataset.Dataset'> 
or <class 'xarray.core.dataarray.DataArray'>

Doing it like this seems to be working correctly (and is fast enough):

dsx["x"]= sel.variable.compute()
dsx.where(dsx["x"], drop=True)

_nadir variables have the same chunks and are way faster to read than the other ones (lot smaller).

Thomas-Z · 2023-05-03T07:58:22Z

Hello,

I'm not sure performances problematics were fully addressed (we're now forced to fully compute/load the selection expression) but changes made in the last versions makes this issue irrelevant and I think we can close it.

Thank you!

Thomas-Z added bug needs triage Issue that has not been reviewed by xarray team member labels Feb 8, 2023

headtr1ck added topic-performance regression and removed bug labels Feb 26, 2023

dcherian removed the needs triage Issue that has not been reviewed by xarray team member label Feb 28, 2023

dcherian mentioned this issue Feb 28, 2023

Allow indexing unindexed dimensions using dask arrays #5873

Merged

7 tasks

Thomas-Z closed this as completed May 3, 2023

dcherian reopened this May 3, 2023

maawoo mentioned this issue Nov 10, 2023

Improve application of masks Jena-Earth-Observation-School/sdc-tools#10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset.where performances regression. #7516

Dataset.where performances regression. #7516

Thomas-Z commented Feb 8, 2023 •

edited

Loading

INSTALLED VERSIONS

INSTALLED VERSIONS

headtr1ck commented Feb 26, 2023

headtr1ck commented Feb 26, 2023

headtr1ck commented Feb 27, 2023

dcherian commented Feb 28, 2023

Thomas-Z commented Feb 28, 2023 •

edited

Loading

dcherian commented Feb 28, 2023

Thomas-Z commented Mar 1, 2023

dcherian commented Mar 1, 2023

Thomas-Z commented Mar 2, 2023

Thomas-Z commented May 3, 2023

Dataset.where performances regression. #7516

Dataset.where performances regression. #7516

Comments

Thomas-Z commented Feb 8, 2023 • edited Loading

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

INSTALLED VERSIONS

INSTALLED VERSIONS

headtr1ck commented Feb 26, 2023

headtr1ck commented Feb 26, 2023

headtr1ck commented Feb 27, 2023

dcherian commented Feb 28, 2023

Thomas-Z commented Feb 28, 2023 • edited Loading

dcherian commented Feb 28, 2023

Thomas-Z commented Mar 1, 2023

dcherian commented Mar 1, 2023

Thomas-Z commented Mar 2, 2023

Thomas-Z commented May 3, 2023

Thomas-Z commented Feb 8, 2023 •

edited

Loading

Thomas-Z commented Feb 28, 2023 •

edited

Loading