Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset.where performances regression. #7516

Open
4 tasks done
Thomas-Z opened this issue Feb 8, 2023 · 10 comments
Open
4 tasks done

Dataset.where performances regression. #7516

Thomas-Z opened this issue Feb 8, 2023 · 10 comments

Comments

@Thomas-Z
Copy link
Contributor

Thomas-Z commented Feb 8, 2023

What happened?

Hello,

I'm using the Dataset.where function to select data based on some fields values and it takes way to much time!
The dask dashboard seems to show some tasks repeating themselves many times.

The provided example uses a 1D array for which the selection could be done with Dataset.sel but with our real usecase we make selections on 2D variables.

This problem seems to have appeared with the 2022.6.0 xarray release, the 2022.3.0 is working as expected.

What did you expect to happen?

Using the 2022.3 release, this selection takes 1.37 seconds.
Using the 2022.6.0 up to the 2023.2.0 (the one from yesterday), this selection takes 8.47 seconds.

This example is a very simple and small one, with real data and use case we simply cannot use this function anymore.

Minimal Complete Verifiable Example

import dask.array as da
import distributed as dist
import xarray as xr


client = dist.Client()

# Using small chunks emphasis the problem
ds = xr.Dataset(
    {"field": xr.DataArray(data=da.empty(shape=10000, chunks=10), dims=("x"))}
)
sel = ds["field"] > 0

ds.where(sel, drop=True)

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

Problematic version

INSTALLED VERSIONS

commit: None
python: 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:20:04) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-58-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: ('fr_FR', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1

xarray: 2023.2.0
pandas: 1.5.3
numpy: 1.23.5
scipy: 1.8.1
netCDF4: 1.6.2
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: 2.13.6
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.4
cfgrib: 0.9.10.3
iris: None
bottleneck: None
dask: 2023.1.1
distributed: 2023.1.1
matplotlib: 3.6.3
cartopy: 0.21.1
seaborn: None
numbagg: None
fsspec: 2023.1.0
cupy: None
pint: 0.20.1
sparse: None
flox: None
numpy_groupies: None
setuptools: 67.1.0
pip: 23.0
conda: 22.11.1
pytest: 7.2.1
mypy: None
IPython: 8.7.0
sphinx: 5.3.0

Working version

INSTALLED VERSIONS

commit: None
python: 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:20:04) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-58-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: ('fr_FR', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1

xarray: 2022.3.0
pandas: 1.5.3
numpy: 1.23.5
scipy: 1.8.1
netCDF4: 1.6.2
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: 2.13.6
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.4
cfgrib: 0.9.10.3
iris: None
bottleneck: None
dask: 2023.1.1
distributed: 2023.1.1
matplotlib: 3.6.3
cartopy: 0.21.1
seaborn: None
numbagg: None
fsspec: 2023.1.0
cupy: None
pint: 0.20.1
sparse: None
setuptools: 67.1.0
pip: 23.0
conda: 22.11.1
pytest: 7.2.1
IPython: 8.7.0
sphinx: 5.3.0

@Thomas-Z Thomas-Z added bug needs triage Issue that has not been reviewed by xarray team member labels Feb 8, 2023
@headtr1ck
Copy link
Collaborator

Can confirm, on my machine it went from 520ms to 5s

@headtr1ck
Copy link
Collaborator

Git bisect pinpoints this to #6690 which funny enough, is my PR haha.
I will look into it when I find time :)

@headtr1ck
Copy link
Collaborator

I am a bit puzzled here...
The dask graph looks identical, so it must be the way the indexers are constructed.

The major difference I can find is:
The old version used np.unique while the new version uses xarrays cond.any(..)

Maybe someone with more experience in dask can help out?

@dcherian dcherian removed the needs triage Issue that has not been reviewed by xarray team member label Feb 28, 2023
@dcherian
Copy link
Contributor

The old code had:

nonzeros = zip(clipcond.dims, np.nonzero(clipcond.values))

This loaded the array once and then passed numpy values to the indexing code.

Now, the dask array is passed to the indexing code and is computed many times . #5873 raises an error saying boolean indexing with dask arrays is not allowed.

For here just do ds.where(sel.compute(), drop=True). It's identical to what was happening earlier.

I think we should close this.

@Thomas-Z
Copy link
Contributor Author

Thomas-Z commented Feb 28, 2023

Just tried it and it does not seem identical at all to what was happening earlier.

This is the kind of dataset I'm working
image

With this selection:
sel = (dsx["longitude"] > 0) & (dsx["longitude"] < 100)

Old xarray takes a little less that 1 minute and less than 6GB of memory.
New xarray with compute did not finish and had to be stopped before consuming my 16GB of memory.

@dcherian
Copy link
Contributor

Does sel.compute() not finish?

@Thomas-Z
Copy link
Contributor Author

Thomas-Z commented Mar 1, 2023

sel = (dsx["longitude"] > 0) & (dsx["longitude"] < 100)
sel.compute()

This "compute" finishes and takes more than 80sec on both versions with a huge memory consumption (it loads the 4 coordinates and the result itself).

I know xarray has to keep more information regarding coordinates and dimensions but doing this (just dask arrays) :

sel2 = (dsx["longitude"].data > 0) & (dsx["longitude"].data < 100)
sel2.compute()

Takes less than 6 seconds.

@dcherian
Copy link
Contributor

dcherian commented Mar 1, 2023

Yeah that was another change I guess. We could extract out the variable using .variable.

.where(sel2.variable.compute(), drop=True)

do your "_nadir" variables have smaller chunk sizes or are slower to read for some reason?

@Thomas-Z
Copy link
Contributor Author

Thomas-Z commented Mar 2, 2023

The .variable computation is fast but it cannot be directly used like you suggest:

dsx.where(sel.variable, drop=True)

TypeError: cond argument is <xarray.Variable (num_lines: 5761870, num_pixels: 71)>
... but must be a <class 'xarray.core.dataset.Dataset'> 
or <class 'xarray.core.dataarray.DataArray'>

Doing it like this seems to be working correctly (and is fast enough):

dsx["x"]= sel.variable.compute()
dsx.where(dsx["x"], drop=True)

_nadir variables have the same chunks and are way faster to read than the other ones (lot smaller).
image

@Thomas-Z
Copy link
Contributor Author

Thomas-Z commented May 3, 2023

Hello,

I'm not sure performances problematics were fully addressed (we're now forced to fully compute/load the selection expression) but changes made in the last versions makes this issue irrelevant and I think we can close it.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants