-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issues using map_blocks with cftime indexes. #5701
Comments
I think this could be fixed by adding For IndexVariables we pass xarray/xarray/core/variable.py Lines 2688 to 2692 in ea99445
|
The overhead seems to be gone now, the map_blocks call builds in a few hundred milliseconds for me instead of 5 seconds |
I confirm the same! I don't know what was changed, but with xarray 2024.07.0 and 2024.08.0, the CFTime and Pandas cases yield the same performance. |
presumably dask is now better at tokenizing object arrays? |
Yeah we changed a bunch of things there, I haven't dug into the release that actually made it faster though |
What happened:
When using
map_blocks
on an object that is dask-backed and has aCFTimeIndex
coordinate, the construction step (not computation done) is very slow. I've seen up to 100x slower than an equivalent object with a numpy datetime index.What you expected to happen:
I would understand a performance difference since numpy/pandas objects are usually more optimized than cftime/xarray objects, but the difference is quite large here.
Minimal Complete Verifiable Example:
Here is a MCVE that I ran in a jupyter notebook. Performance is basically measured by execution time (wall time). I included the current workaround I have for my usecase.
Anything else we need to know?:
After exploration I found 2 culprits for this slowness. I used
%%prun
to profile the construction phase ofmap_blocks
and found that in the second case (cftime time coordinate):map_blocks
calls todask.base.tokenize
take the most time. Precisely, tokenizing a numpy ndarray of O dtype goes through the pickling process of the array. This is already quite slow and cftime objects take even more time to pickle. See Performance : speeding up pickling of cftime arrays Unidata/cftime#253 for the corresponding issue. Most of the construction phase execution time is spent pickling the same datetime array at least once per chunk.CFTimeIndex.__new__
is called more than twice as many times as there are chunks. And within the object creation there is this line :xarray/xarray/coding/cftimeindex.py
Line 228 in 3956b73
The larger the array, the more time is spent in this iteration. Changing the example above to use
Nt = 50_000
, the code spent a total of 25 s indask.base.tokenize
calls and 5 s inCFTimeIndex.__new__
calls.My workaround is not the best, but it was easy to code without touching xarray. The encoding of the time coordinate changes it to an integer array, which is super fast to tokenize. And the speed up of the construction phase is because there is only one call to
encode_cf_variable
compared toN_chunks
calls to the pickling, As shown above, I have not seen a slowdown in the computation phase. I think this is mostly because the addeddecode_cf
calls are done in parallel, but there might be other reason I do not understand.I do not know for sure how/why this tokenization works, but I guess the best improvment in xarray could be to:
map_blocks
and spot cftime-backed coordinatesI have no idea if that would work, but if it does that would be the best speed-up I think.
Environment:
Output of xr.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:39:48)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.2.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8
LOCALE: ('en_CA', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4
xarray: 0.19.1.dev18+g4bb9d9c.d20210810
pandas: 1.3.1
numpy: 1.21.1
scipy: 1.7.0
netCDF4: 1.5.6
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.8.3
cftime: 1.5.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2021.07.1
distributed: 2021.07.1
matplotlib: 3.4.2
cartopy: 0.19.0
seaborn: None
numbagg: None
pint: 0.17
setuptools: 49.6.0.post20210108
pip: 21.2.1
conda: None
pytest: None
IPython: 7.25.0
sphinx: None
The text was updated successfully, but these errors were encountered: