ENH: Compute hash of xarray objects #4738

andersy005 · 2020-12-28T17:18:57Z

Is your feature request related to a problem? Please describe.

I'm working on some caching/data-provenance functionality for xarray objects, and I realized that there's no standard/efficient way of computing hashes for xarray objects.

Describe the solution you'd like

It would be useful to have a configurable, reliable/standard .hexdigest() method on xarray objects. For example, zarr provides a digest method that returns you a digest/hash of the data.

In [16]: import zarr

In [17]: z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000))

In [18]: z.hexdigest() # uses sha1 by default for speed
Out[18]: '7162d416d26a68063b66ed1f30e0a866e4abed60'

In [20]: z.hexdigest(hashname='sha256')
Out[20]: '46fc6e52fc1384e37cead747075f55201667dd539e4e72d0f372eb45abdcb2aa'

I'm thinking that an xarray's built-in hashing mechanism would provide a more reliable way to treat metadata such as global attributes, encoding, etc... during the hash computation...

Describe alternatives you've considered

So far, I am using joblib's default hasher: joblib.hash() function. However, I am in favor of having a configurable/built-in hasher that is aware of xarray's data model and quirks :)

In [1]: import joblib

In [2]: import xarray as xr

In [3]: ds = xr.tutorial.open_dataset('rasm')

In [5]: joblib.hash(ds, hash_name='sha1')
Out[5]: '3e5e3f56daf81e9e04a94a3dff9fdca9638c36cf'

In [8]: ds.attrs = {}

In [9]: joblib.hash(ds, hash_name='sha1')
Out[9]: 'daab25fe735657e76514040608fadc67067d90a0'

Additional context
Add any other context about the feature request here.

The text was updated successfully, but these errors were encountered:

shoyer · 2020-12-29T06:24:30Z

Interesting! Do pandas or dask have anything like this?

andersy005 · 2020-12-29T16:47:03Z

Pandas has a built-in utility function pd.util.hash_pandas_object:

In [1]: import pandas as pd

In [3]: df = pd.DataFrame({'A': [4, 5, 6, 7], 'B': [10, 20, 30, 40], 'C': [100, 50, -30, -50]})

In [4]: df
Out[4]:
   A   B    C
0  4  10  100
1  5  20   50
2  6  30  -30
3  7  40  -50

In [6]: row_hashes = pd.util.hash_pandas_object(df)

In [7]: row_hashes
Out[7]:
0    14190898035981950066
1    16858535338008670510
2     1055569624497948892
3     5944630256416341839
dtype: uint64

Combining the returned value of hash_pandas_object() with Python's hashlib gives something one can work with:

In [8]: import hashlib

In [10]: hashlib.sha1(row_hashes.values).hexdigest() # Compute overall hash of all rows.
Out[10]: '1e1244d9b0489e1f479271f147025956d4994f67'

Regarding dask, I have no idea :) cc @TomAugspurger

TomAugspurger · 2020-12-29T16:53:16Z

IIUC, something like https://github.com/dask/dask/blob/4a7a2438219c4ee493434042e50f4cdb67b6ec9f/dask/base.py#L778 is what you're looking for. Further down we register tokenizers for various types like pandas' DataFrames and ndarrays.

shoyer · 2020-12-29T20:13:02Z

I asked because this isn't an operation I've used directly on pandas objects in the past. I'm not opposed, but my suggestion would be to write a separate utility function, e.g., in xarray.util (similar to what is in pandas) rather than making it method on xarray objects themselves.

dcherian · 2020-12-30T18:41:47Z

@andersy005 if you can rely on dask always being present, dask.base.tokenize(xarray_object) will do what you want.

andersy005 · 2021-01-10T16:34:20Z

@andersy005 if you can rely on dask always being present, dask.base.tokenize(xarray_object) will do what you want.

👍🏽 dask.base.tokenize() achieves what I need for my use case.

I asked because this isn't an operation I've used directly on pandas objects in the past. I'm not opposed, but my suggestion would be to write a separate utility function, e.g., in xarray.util (similar to what is in pandas) rather than making it method on xarray objects themselves.

Due to the simplicity of dask.base.tokenize(), I am now wondering whether it's even worth having a utility function in xarray.util for computing a deterministic token (~hash) for an xarray object? I'm happy to work on this if there's interest from other folks, otherwise I will close this issue.

andersy005 · 2021-12-21T13:08:21Z

@andersy005 if you can rely on dask always being present, dask.base.tokenize(xarray_object) will do what you want.

@dcherian, I just realized that dask.base.tokenize deosn't return a deterministic token for xarray objects:

In [2]: import dask, xarray as xr

In [3]: ds = xr.tutorial.open_dataset('rasm')

In [4]: dask.base.tokenize(ds) == dask.base.tokenize(ds)
Out[4]: False

In [5]: dask.base.tokenize(ds) == dask.base.tokenize(ds)
Out[5]: False

The issue appears to be caused by the coordinates which are used in __dask_tokenize__

xarray/xarray/core/dataarray.py

Lines 870 to 873 in dbc02d4

    
           def __dask_tokenize__(self): 
        
               from dask.base import normalize_token 
        
               return normalize_token((type(self), self._variable, self._coords, self._name))

In [8]: dask.base.tokenize(ds.Tair.data) == dask.base.tokenize(ds.Tair.data)
Out[8]: True

In [16]: dask.base.tokenize(ds.Tair._coords) == dask.base.tokenize(ds.Tair._coords)
Out[16]: False

Is this the expected behavior or am I missing something?

andersy005 · 2021-12-21T17:06:51Z

The issue appears to be caused by the coordinates which are used in dask_tokenize

I tried running the reproducer above and things seem to be working fine. I can't for the life of me understand why I got non-deterministic behavior four hours ago :(

In [1]: import dask, xarray as xr

In [2]: ds = xr.tutorial.open_dataset('rasm')

In [3]: dask.base.tokenize(ds) == dask.base.tokenize(ds)
Out[3]: True

In [4]: dask.base.tokenize(ds.Tair._coords) == dask.base.tokenize(ds.Tair._coords)
Out[4]: True

In [5]: xr.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 20:33:18) 
[Clang 11.1.0 ]
python-bits: 64
OS: Darwin
OS-release: 20.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1

xarray: 0.20.1
pandas: 1.3.4
numpy: 1.20.3
scipy: 1.7.3
netCDF4: 1.5.8
pydap: None
h5netcdf: 0.11.0
h5py: 3.6.0
Nio: None
zarr: 2.10.3
cftime: 1.5.1.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2021.11.2
distributed: 2021.11.2
matplotlib: 3.5.0
cartopy: None
seaborn: None
numbagg: None
fsspec: 2021.11.1
cupy: None
pint: 0.18
sparse: None
setuptools: 59.4.0
pip: 21.3.1
conda: None
pytest: None
IPython: 7.30.0
sphinx: 4.3.1

andersy005 · 2021-12-21T18:14:15Z

Okay... I think the following comment is still valid:

The issue appears to be caused by the coordinates which are used in dask_tokenize

It appears that the deterministic behavior of the tokenization process is affected depending on whether the dataset/datarray contains non-dimension coordinates or dimension coordinates

In [2]: ds = xr.tutorial.open_dataset('rasm')

In [39]: a = ds.isel(time=0)

In [40]: a
Out[40]: 
<xarray.Dataset>
Dimensions:  (y: 205, x: 275)
Coordinates:
    time     object 1980-09-16 12:00:00
    xc       (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91
    yc       (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51
Dimensions without coordinates: y, x
Data variables:
    Tair     (y, x) float64 ...

In [41]: dask.base.tokenize(a) == dask.base.tokenize(a)
Out[41]: True

In [42]: b = ds.isel(y=0)

In [43]: b
Out[43]: 
<xarray.Dataset>
Dimensions:  (time: 36, x: 275)
Coordinates:
  * time     (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
    xc       (x) float64 189.2 189.4 189.6 189.7 ... 293.5 293.8 294.0 294.3
    yc       (x) float64 16.53 16.78 17.02 17.27 ... 27.61 27.36 27.12 26.87
Dimensions without coordinates: x
Data variables:
    Tair     (time, x) float64 ...

In [44]: dask.base.tokenize(b) == dask.base.tokenize(b)
Out[44]: False

This looks like a bug in my opinion...

LunarLanding · 2022-05-27T11:48:14Z

This looks like a bug in my opinion...

@andersy005

This runs with not issues atm.

with dask.config.set({"tokenize.ensure-deterministic":True}):
    ds = xr.tutorial.open_dataset('rasm')
    b = ds.isel(y=0)
    assert dask.base.tokenize(b) == dask.base.tokenize(b)

With:

xarray                    2022.3.0           pyhd8ed1ab_0    conda-forge
dask                      2022.5.0           pyhd8ed1ab_0    conda-forge

matanox · 2023-12-06T18:24:58Z

Are xarray objects robustly hashable now?

andersy005 mentioned this issue Dec 21, 2021

Add tokens to the collections ncar-xdev/xcollection#33

Open

andersy005 mentioned this issue Dec 21, 2021

Compute deterministic hash tokens of datasets in a collection ncar-xdev/xcollection#34

Closed

marscher mentioned this issue Nov 7, 2022

v0.6.2 BAMWelDX/weldx#822

Merged

18 tasks

rebeccamccabe mentioned this issue Dec 28, 2023

Feature request: Avoid repeating identical calculations (speedup) sandialabs/WecOptTool#306

Open

jhamman mentioned this issue Jul 30, 2024

feature: cache responses xpublish-community/xpublish-edr#42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Compute hash of xarray objects #4738

ENH: Compute hash of xarray objects #4738

andersy005 commented Dec 28, 2020

shoyer commented Dec 29, 2020

andersy005 commented Dec 29, 2020

TomAugspurger commented Dec 29, 2020

shoyer commented Dec 29, 2020

dcherian commented Dec 30, 2020

andersy005 commented Jan 10, 2021

andersy005 commented Dec 21, 2021 •

edited

Loading

andersy005 commented Dec 21, 2021 •

edited

Loading

andersy005 commented Dec 21, 2021

LunarLanding commented May 27, 2022

matanox commented Dec 6, 2023

ENH: Compute hash of xarray objects #4738

ENH: Compute hash of xarray objects #4738

Comments

andersy005 commented Dec 28, 2020

shoyer commented Dec 29, 2020

andersy005 commented Dec 29, 2020

TomAugspurger commented Dec 29, 2020

shoyer commented Dec 29, 2020

dcherian commented Dec 30, 2020

andersy005 commented Jan 10, 2021

andersy005 commented Dec 21, 2021 • edited Loading

andersy005 commented Dec 21, 2021 • edited Loading

andersy005 commented Dec 21, 2021

LunarLanding commented May 27, 2022

matanox commented Dec 6, 2023

andersy005 commented Dec 21, 2021 •

edited

Loading

andersy005 commented Dec 21, 2021 •

edited

Loading