Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Compute hash of xarray objects #4738

Open
andersy005 opened this issue Dec 28, 2020 · 11 comments
Open

ENH: Compute hash of xarray objects #4738

andersy005 opened this issue Dec 28, 2020 · 11 comments

Comments

@andersy005
Copy link
Member

Is your feature request related to a problem? Please describe.

I'm working on some caching/data-provenance functionality for xarray objects, and I realized that there's no standard/efficient way of computing hashes for xarray objects.

Describe the solution you'd like

It would be useful to have a configurable, reliable/standard .hexdigest() method on xarray objects. For example, zarr provides a digest method that returns you a digest/hash of the data.

In [16]: import zarr

In [17]: z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000))

In [18]: z.hexdigest() # uses sha1 by default for speed
Out[18]: '7162d416d26a68063b66ed1f30e0a866e4abed60'

In [20]: z.hexdigest(hashname='sha256')
Out[20]: '46fc6e52fc1384e37cead747075f55201667dd539e4e72d0f372eb45abdcb2aa'

I'm thinking that an xarray's built-in hashing mechanism would provide a more reliable way to treat metadata such as global attributes, encoding, etc... during the hash computation...

Describe alternatives you've considered

So far, I am using joblib's default hasher: joblib.hash() function. However, I am in favor of having a configurable/built-in hasher that is aware of xarray's data model and quirks :)

In [1]: import joblib

In [2]: import xarray as xr

In [3]: ds = xr.tutorial.open_dataset('rasm')

In [5]: joblib.hash(ds, hash_name='sha1')
Out[5]: '3e5e3f56daf81e9e04a94a3dff9fdca9638c36cf'

In [8]: ds.attrs = {}

In [9]: joblib.hash(ds, hash_name='sha1')
Out[9]: 'daab25fe735657e76514040608fadc67067d90a0'

Additional context
Add any other context about the feature request here.

@shoyer
Copy link
Member

shoyer commented Dec 29, 2020

Interesting! Do pandas or dask have anything like this?

@andersy005
Copy link
Member Author

Pandas has a built-in utility function pd.util.hash_pandas_object:

In [1]: import pandas as pd

In [3]: df = pd.DataFrame({'A': [4, 5, 6, 7], 'B': [10, 20, 30, 40], 'C': [100, 50, -30, -50]})

In [4]: df
Out[4]:
   A   B    C
0  4  10  100
1  5  20   50
2  6  30  -30
3  7  40  -50

In [6]: row_hashes = pd.util.hash_pandas_object(df)

In [7]: row_hashes
Out[7]:
0    14190898035981950066
1    16858535338008670510
2     1055569624497948892
3     5944630256416341839
dtype: uint64

Combining the returned value of hash_pandas_object() with Python's hashlib gives something one can work with:

In [8]: import hashlib

In [10]: hashlib.sha1(row_hashes.values).hexdigest() # Compute overall hash of all rows.
Out[10]: '1e1244d9b0489e1f479271f147025956d4994f67'

Regarding dask, I have no idea :) cc @TomAugspurger

@TomAugspurger
Copy link
Contributor

IIUC, something like https://github.com/dask/dask/blob/4a7a2438219c4ee493434042e50f4cdb67b6ec9f/dask/base.py#L778 is what you're looking for. Further down we register tokenizers for various types like pandas' DataFrames and ndarrays.

@shoyer
Copy link
Member

shoyer commented Dec 29, 2020

I asked because this isn't an operation I've used directly on pandas objects in the past. I'm not opposed, but my suggestion would be to write a separate utility function, e.g., in xarray.util (similar to what is in pandas) rather than making it method on xarray objects themselves.

@dcherian
Copy link
Contributor

@andersy005 if you can rely on dask always being present, dask.base.tokenize(xarray_object) will do what you want.

@andersy005
Copy link
Member Author

@andersy005 if you can rely on dask always being present, dask.base.tokenize(xarray_object) will do what you want.

👍🏽 dask.base.tokenize() achieves what I need for my use case.

I asked because this isn't an operation I've used directly on pandas objects in the past. I'm not opposed, but my suggestion would be to write a separate utility function, e.g., in xarray.util (similar to what is in pandas) rather than making it method on xarray objects themselves.

Due to the simplicity of dask.base.tokenize(), I am now wondering whether it's even worth having a utility function in xarray.util for computing a deterministic token (~hash) for an xarray object? I'm happy to work on this if there's interest from other folks, otherwise I will close this issue.

@andersy005
Copy link
Member Author

andersy005 commented Dec 21, 2021

@andersy005 if you can rely on dask always being present, dask.base.tokenize(xarray_object) will do what you want.

@dcherian, I just realized that dask.base.tokenize deosn't return a deterministic token for xarray objects:

In [2]: import dask, xarray as xr

In [3]: ds = xr.tutorial.open_dataset('rasm')

In [4]: dask.base.tokenize(ds) == dask.base.tokenize(ds)
Out[4]: False

In [5]: dask.base.tokenize(ds) == dask.base.tokenize(ds)
Out[5]: False

The issue appears to be caused by the coordinates which are used in __dask_tokenize__

def __dask_tokenize__(self):
from dask.base import normalize_token
return normalize_token((type(self), self._variable, self._coords, self._name))

In [8]: dask.base.tokenize(ds.Tair.data) == dask.base.tokenize(ds.Tair.data)
Out[8]: True
In [16]: dask.base.tokenize(ds.Tair._coords) == dask.base.tokenize(ds.Tair._coords)
Out[16]: False

Is this the expected behavior or am I missing something?

@andersy005
Copy link
Member Author

andersy005 commented Dec 21, 2021

The issue appears to be caused by the coordinates which are used in dask_tokenize

I tried running the reproducer above and things seem to be working fine. I can't for the life of me understand why I got non-deterministic behavior four hours ago :(

In [1]: import dask, xarray as xr

In [2]: ds = xr.tutorial.open_dataset('rasm')

In [3]: dask.base.tokenize(ds) == dask.base.tokenize(ds)
Out[3]: True

In [4]: dask.base.tokenize(ds.Tair._coords) == dask.base.tokenize(ds.Tair._coords)
Out[4]: True
In [5]: xr.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 20:33:18) 
[Clang 11.1.0 ]
python-bits: 64
OS: Darwin
OS-release: 20.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1

xarray: 0.20.1
pandas: 1.3.4
numpy: 1.20.3
scipy: 1.7.3
netCDF4: 1.5.8
pydap: None
h5netcdf: 0.11.0
h5py: 3.6.0
Nio: None
zarr: 2.10.3
cftime: 1.5.1.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2021.11.2
distributed: 2021.11.2
matplotlib: 3.5.0
cartopy: None
seaborn: None
numbagg: None
fsspec: 2021.11.1
cupy: None
pint: 0.18
sparse: None
setuptools: 59.4.0
pip: 21.3.1
conda: None
pytest: None
IPython: 7.30.0
sphinx: 4.3.1

@andersy005
Copy link
Member Author

Okay... I think the following comment is still valid:

The issue appears to be caused by the coordinates which are used in dask_tokenize

It appears that the deterministic behavior of the tokenization process is affected depending on whether the dataset/datarray contains non-dimension coordinates or dimension coordinates

In [2]: ds = xr.tutorial.open_dataset('rasm')
In [39]: a = ds.isel(time=0)

In [40]: a
Out[40]: 
<xarray.Dataset>
Dimensions:  (y: 205, x: 275)
Coordinates:
    time     object 1980-09-16 12:00:00
    xc       (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91
    yc       (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51
Dimensions without coordinates: y, x
Data variables:
    Tair     (y, x) float64 ...

In [41]: dask.base.tokenize(a) == dask.base.tokenize(a)
Out[41]: True
In [42]: b = ds.isel(y=0)

In [43]: b
Out[43]: 
<xarray.Dataset>
Dimensions:  (time: 36, x: 275)
Coordinates:
  * time     (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
    xc       (x) float64 189.2 189.4 189.6 189.7 ... 293.5 293.8 294.0 294.3
    yc       (x) float64 16.53 16.78 17.02 17.27 ... 27.61 27.36 27.12 26.87
Dimensions without coordinates: x
Data variables:
    Tair     (time, x) float64 ...

In [44]: dask.base.tokenize(b) == dask.base.tokenize(b)
Out[44]: False

This looks like a bug in my opinion...

@LunarLanding
Copy link

This looks like a bug in my opinion...

@andersy005

This runs with not issues atm.

with dask.config.set({"tokenize.ensure-deterministic":True}):
    ds = xr.tutorial.open_dataset('rasm')
    b = ds.isel(y=0)
    assert dask.base.tokenize(b) == dask.base.tokenize(b)

With:

xarray                    2022.3.0           pyhd8ed1ab_0    conda-forge
dask                      2022.5.0           pyhd8ed1ab_0    conda-forge

@marscher marscher mentioned this issue Nov 7, 2022
18 tasks
@matanox
Copy link

matanox commented Dec 6, 2023

Are xarray objects robustly hashable now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants