enhancement request: ability to export dicts as shareable/generic HDF5 #133

gpetty · 2020-06-18T16:45:01Z

It's fantastic that hickle can automatically export a dict object as an hdf5 file, but the resulting structure is not very transparent to someone (e.g., a non-Python user who doesn't have access to hickle) who might want to open and explore the file using some other hdf5 library or utility (e.g., panoply). In particular, the partial field name 'data_0' shows up redundantly at every level, and I'm not sure why that's necessary.

I would like to be able to use hickle to create a large number of clean/generic hdf5 files for sharing with people using C, Fortran, and other languages, so I hope a future version can somehow provide that option, so that the structure of the hdf5 file can be accessed in exactly the same way as the original dict.

1313e · 2020-06-19T07:17:48Z

Hi @gpetty,

glad to see you like the functionality that hickle brings.

This issue is very interesting and one I agree with, as I have thought about it myself quite a few times as well while rewriting hickle to v4 (see #117).
I have looked into several different ways before to reduce the number of groups and datasets that hickle creates, but solving this issue is much harder than it may seem.
And part of that problem is the reason why the seemingly redundant 'data_0' group/dataset exists (also see #44, and note that it will be renamed to just 'data' in v4 in case there is only one of them).

So, the big issue is that in Python, objects do not carry any names.
In an HDF5-file, every group or dataset must have a name, so hickle simply assigns the generic name data to it.
As certain containers in Python, like tuples and lists, can contain more than 1 type of object, we have to store every object separately as datasets in HDF5 can only be of a single type.
These datasets are then given the name data_X with X being the index of that object in its parent container.

The above works well for basically all types of objects, except the one you are mentioning: dicts.
In a dict, every object is associated with a name, which is its key.
However, in hickle, we use recursive functions to explore the contents of every iterable and save it.
As an object in a dict does not have a name itself (it only has a name when it is part of the dict, but not on its own), the data name is used again.
Eliminating the redundant data name would mean that all recursive functions no longer work for anything contained in a dict, making it quite a bit harder to write everything properly.

I could take a good hard look at it again and see if there is maybe a way to do it, but I am not sure if there is.

1313e · 2020-06-19T15:28:39Z

Alright, I just had to do something somewhat similar for solving the issue raised in #90, and there it was already quite a challenge to get rid of the data name when using NumPy arrays.
As dicts are far more complex, it will probably take some serious effort to get it removed completely (as much as I would like it to be removed, actually).

gpetty · 2020-06-20T02:14:46Z

Upon reflection, I recognize that part of the difficulty undoubtedly stems from hickle's generality -- it needs to be able to do pickle-like exporting for literally any possible object, no matter how arcanely structured. Perhaps what's needed is a more specialized utility specifically for dicts satisfying particular constraints, such as no data types not supported by HDF5. In my case, the dicts I'm working with are constructed from fields in an existing HDF5 file, so automatically exporting those same data fields to a new HDF5 file with analogous structure should be fairly straightforward. But as I said, I see now that that's not necessarily compatible with the overarching goal of hickle.

1313e · 2020-06-20T02:18:12Z

@gpetty Exactly.
It is definitely something we would like to think about in the near future, but it won't happen any time soon I think.

telegraphic · 2021-01-16T13:10:26Z

Looks like #138 addresses this!

telegraphic · 2021-12-19T03:12:14Z

Closing this as it's mostly addressed -- as well as can be while maintaining generality -- in v5.0.0. E.g. for a dict with keys 'a', 'b' and 'c':

h5diff -v hkl_500.hkl hkl_404.hkl

file1     file2
---------------------------------------
    x      x    /
    x      x    /data
    x           /data/data0
    x           /data/data0/"a"
    x           /data/data0/"b"
    x           /data/data0/"c"
           x    /data/data_0
           x    /data/data_0/'a'
           x    /data/data_0/'a'/data
           x    /data/data_0/'b'
           x    /data/data_0/'b'/data
           x    /data/data_0/'c'
           x    /data/data_0/'c'/data

1313e mentioned this issue Jun 19, 2020

Last few changes #134

Merged

hernot mentioned this issue Jun 19, 2020

support for python copy protocol __setstate__ __getstate__ if present in object #125

Closed

hernot mentioned this issue Jun 20, 2020

H4EP 001: Container and mixed (dataset + Container) loaders (draft) #135

Closed

1313e added the enhancement label Jul 30, 2020

hernot mentioned this issue Apr 22, 2021

Hickle 5 rc #149

Merged

telegraphic closed this as completed Dec 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhancement request: ability to export dicts as shareable/generic HDF5 #133

enhancement request: ability to export dicts as shareable/generic HDF5 #133

gpetty commented Jun 18, 2020

1313e commented Jun 19, 2020

1313e commented Jun 19, 2020

gpetty commented Jun 20, 2020

1313e commented Jun 20, 2020

telegraphic commented Jan 16, 2021

telegraphic commented Dec 19, 2021

enhancement request: ability to export dicts as shareable/generic HDF5 #133

enhancement request: ability to export dicts as shareable/generic HDF5 #133

Comments

gpetty commented Jun 18, 2020

1313e commented Jun 19, 2020

1313e commented Jun 19, 2020

gpetty commented Jun 20, 2020

1313e commented Jun 20, 2020

telegraphic commented Jan 16, 2021

telegraphic commented Dec 19, 2021