-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Compute hash of xarray objects #4738
Comments
Interesting! Do pandas or dask have anything like this? |
Pandas has a built-in utility function In [1]: import pandas as pd
In [3]: df = pd.DataFrame({'A': [4, 5, 6, 7], 'B': [10, 20, 30, 40], 'C': [100, 50, -30, -50]})
In [4]: df
Out[4]:
A B C
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
In [6]: row_hashes = pd.util.hash_pandas_object(df)
In [7]: row_hashes
Out[7]:
0 14190898035981950066
1 16858535338008670510
2 1055569624497948892
3 5944630256416341839
dtype: uint64 Combining the returned value of In [8]: import hashlib
In [10]: hashlib.sha1(row_hashes.values).hexdigest() # Compute overall hash of all rows.
Out[10]: '1e1244d9b0489e1f479271f147025956d4994f67' Regarding dask, I have no idea :) cc @TomAugspurger |
IIUC, something like https://github.com/dask/dask/blob/4a7a2438219c4ee493434042e50f4cdb67b6ec9f/dask/base.py#L778 is what you're looking for. Further down we register tokenizers for various types like pandas' DataFrames and ndarrays. |
I asked because this isn't an operation I've used directly on pandas objects in the past. I'm not opposed, but my suggestion would be to write a separate utility function, e.g., in |
@andersy005 if you can rely on dask always being present, |
👍🏽
Due to the simplicity of |
@dcherian, I just realized that In [2]: import dask, xarray as xr
In [3]: ds = xr.tutorial.open_dataset('rasm')
In [4]: dask.base.tokenize(ds) == dask.base.tokenize(ds)
Out[4]: False
In [5]: dask.base.tokenize(ds) == dask.base.tokenize(ds)
Out[5]: False The issue appears to be caused by the coordinates which are used in xarray/xarray/core/dataarray.py Lines 870 to 873 in dbc02d4
In [8]: dask.base.tokenize(ds.Tair.data) == dask.base.tokenize(ds.Tair.data)
Out[8]: True In [16]: dask.base.tokenize(ds.Tair._coords) == dask.base.tokenize(ds.Tair._coords)
Out[16]: False Is this the expected behavior or am I missing something? |
I tried running the reproducer above and things seem to be working fine. I can't for the life of me understand why I got non-deterministic behavior four hours ago :( In [1]: import dask, xarray as xr
In [2]: ds = xr.tutorial.open_dataset('rasm')
In [3]: dask.base.tokenize(ds) == dask.base.tokenize(ds)
Out[3]: True
In [4]: dask.base.tokenize(ds.Tair._coords) == dask.base.tokenize(ds.Tair._coords)
Out[4]: True In [5]: xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 20:33:18)
[Clang 11.1.0 ]
python-bits: 64
OS: Darwin
OS-release: 20.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1
xarray: 0.20.1
pandas: 1.3.4
numpy: 1.20.3
scipy: 1.7.3
netCDF4: 1.5.8
pydap: None
h5netcdf: 0.11.0
h5py: 3.6.0
Nio: None
zarr: 2.10.3
cftime: 1.5.1.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2021.11.2
distributed: 2021.11.2
matplotlib: 3.5.0
cartopy: None
seaborn: None
numbagg: None
fsspec: 2021.11.1
cupy: None
pint: 0.18
sparse: None
setuptools: 59.4.0
pip: 21.3.1
conda: None
pytest: None
IPython: 7.30.0
sphinx: 4.3.1 |
Okay... I think the following comment is still valid:
It appears that the deterministic behavior of the tokenization process is affected depending on whether the dataset/datarray contains non-dimension coordinates or dimension coordinates In [2]: ds = xr.tutorial.open_dataset('rasm') In [39]: a = ds.isel(time=0)
In [40]: a
Out[40]:
<xarray.Dataset>
Dimensions: (y: 205, x: 275)
Coordinates:
time object 1980-09-16 12:00:00
xc (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91
yc (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51
Dimensions without coordinates: y, x
Data variables:
Tair (y, x) float64 ...
In [41]: dask.base.tokenize(a) == dask.base.tokenize(a)
Out[41]: True In [42]: b = ds.isel(y=0)
In [43]: b
Out[43]:
<xarray.Dataset>
Dimensions: (time: 36, x: 275)
Coordinates:
* time (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
xc (x) float64 189.2 189.4 189.6 189.7 ... 293.5 293.8 294.0 294.3
yc (x) float64 16.53 16.78 17.02 17.27 ... 27.61 27.36 27.12 26.87
Dimensions without coordinates: x
Data variables:
Tair (time, x) float64 ...
In [44]: dask.base.tokenize(b) == dask.base.tokenize(b)
Out[44]: False This looks like a bug in my opinion... |
This runs with not issues atm. with dask.config.set({"tokenize.ensure-deterministic":True}):
ds = xr.tutorial.open_dataset('rasm')
b = ds.isel(y=0)
assert dask.base.tokenize(b) == dask.base.tokenize(b) With:
|
Are xarray objects robustly hashable now? |
Is your feature request related to a problem? Please describe.
I'm working on some caching/data-provenance functionality for xarray objects, and I realized that there's no standard/efficient way of computing hashes for xarray objects.
Describe the solution you'd like
It would be useful to have a configurable, reliable/standard
.hexdigest()
method on xarray objects. For example, zarr provides a digest method that returns you a digest/hash of the data.I'm thinking that an xarray's built-in hashing mechanism would provide a more reliable way to treat metadata such as global attributes, encoding, etc... during the hash computation...
Describe alternatives you've considered
So far, I am using joblib's default hasher:
joblib.hash()
function. However, I am in favor of having a configurable/built-in hasher that is aware of xarray's data model and quirks :)Additional context
Add any other context about the feature request here.
The text was updated successfully, but these errors were encountered: