Performance problem when doing computation between two arrays with discontinuous indexes #3755

vincentchabot · 2020-02-06T08:44:40Z

MCVE Code Sample

import xarray as xr 
import numpy as np 

# Creating array
ds = xr.Dataset()
ds["longitude"] = np.arange(0.1,2000.1,1)
ds["latitude"] = range(3000)
ds["step"] = range(50)
ds["field"] = (("step","latitude","longitude"),np.random.randn(50,3000,2000))
ds.to_netcdf("big_array.nc")

# Create another array 
ds = xr.Dataset()
ds["longitude"] = np.arange(500.1,600.1,1) # Coordinate are a continuous subset of the first array
ds["latitude"] = np.arange(510,660)
ds["id"] = range(50)
ds["field"] = (("longitude","latitude","id"),np.random.randn(100,150,50))
ds.to_netcdf("slicing.nc")

# Create another array with "discontinuity" in longitude dimension 
ds = xr.Dataset()
ds["longitude"] = list(np.arange(500.1,598.1,1)) +[622.1,640.1]
ds["latitude"] = range(510,660)
ds["id"] = range(10)
ds["mask"] = (("longitude","latitude","id"),np.random.randn(100,150,10))
ds.to_netcdf("no_slicing.nc")

# Load the Three arrays 

da = xr.open_dataset("big_array.nc")
db = xr.open_dataset("slicing.nc").isel(id=0)
dc = xr.open_dataarray("no_slicing.nc").isel(id=0)

%timeit da*db
# 32.3 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit da*dc
#2.13 s ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

def slicing_operation(dc,da):
    """
    Slicing when knowing that dc is a subpart of da
    """
    import operator
    min_lat = np.max([dc.latitude.values.min(),da.latitude.values.min()])
    max_lat = np.min([dc.latitude.values.max(),da.latitude.values.max()])
    index_lat_field = operator.and_(da.latitude >= min_lat,da.latitude <= max_lat)
    min_lon = np.max([dc.longitude.values.min(),da.longitude.values.min()])
    max_lon = np.min([dc.longitude.values.max(),da.longitude.values.max()])
    index_lon_field = operator.and_(da.longitude >= min_lon,da.longitude <= max_lon)
    
    # Extending dc such that it covers all longitude of da 
    # Creating an empty array. 
    dempty = xr.Dataset()
    dempty["longitude"] = da.longitude.isel(longitude=index_lon_field).values
    dempty["latitude"] = da.latitude.isel(latitude=index_lat_field).values
    dc_ext = dc.broadcast_like(dempty)
    # Performing multiplication and align 
    res,_ = xr.align((dc_ext *da),dc)
    return res

%timeit slicing_operation(dc,da)
#43.9 ms ± 1.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#Checking if we get the same results
res_0 = dc*da
res = slicing_operation(dc,da)
print(abs(res_0 - res).max())

Problem Description

A performance problem occured when performing operation between da and dc.
Computing da *db takes ~32 ms
While db and dc have the same shape, opeartion dc * da takes ~2.13s (so x 60 in cpu times).
This seems linked to the fact tha dc includes discontinuous indexes of da.

Indeed, when extending dc (such that its longitude are continuous when compared at da) the computation time take ~44 ms.
For me it seems that when doing da * dc da dataset is fully loaded in memory first, while only the part involved in computation is loaded when doing da *db.
Indeed when doing :

%time da.load()

we get the following result:

CPU times: user 0 ns, sys: 1.88 s, total: 1.88 s

Wall time: 1.89 s

On top of that, da *db scales with the number of id selected while it is not the case for da *dc .

Output of `xr.show_versions()`

INSTALLED VERSIONS ------------------ commit: None

python: 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 22:33:48)

[GCC 7.3.0]

python-bits: 64

OS: Linux

OS-release: 3.10.0-957.27.2.el7.x86_64

machine: x86_64

processor: x86_64

byteorder: little

LC_ALL: None

LANG: en_US.UTF-8

LOCALE: en_US.UTF-8

libhdf5: 1.10.5

libnetcdf: 4.7.3

xarray: 0.14.1

pandas: 0.25.3

numpy: 1.17.4

scipy: 1.4.1

netCDF4: 1.5.3

pydap: None

h5netcdf: None

h5py: 2.10.0

Nio: None

zarr: None

cftime: 1.0.4.2

nc_time_axis: None

PseudoNetCDF: None

rasterio: None

cfgrib: 0.9.7.6

iris: None

bottleneck: 1.3.1

dask: 2.9.1

distributed: 2.9.1

matplotlib: 3.1.2

cartopy: None

seaborn: 0.9.0

numbagg: None

setuptools: 44.0.0.post20200106

pip: 19.3.1

conda: None

pytest: 5.3.2

IPython: 7.11.1

sphinx: 2.3.1

The text was updated successfully, but these errors were encountered:

dcherian added the topic-performance label Dec 3, 2020

dcherian mentioned this issue Dec 3, 2020

Comprehensive benchmarking suite #4648

Open

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance problem when doing computation between two arrays with discontinuous indexes #3755

Performance problem when doing computation between two arrays with discontinuous indexes #3755

vincentchabot commented Feb 6, 2020

Performance problem when doing computation between two arrays with discontinuous indexes #3755

Performance problem when doing computation between two arrays with discontinuous indexes #3755

Comments

vincentchabot commented Feb 6, 2020

MCVE Code Sample

Problem Description

Output of xr.show_versions()

Output of `xr.show_versions()`