Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement request: ability to export dicts as shareable/generic HDF5 #133

Closed
gpetty opened this issue Jun 18, 2020 · 6 comments
Closed

Comments

@gpetty
Copy link

gpetty commented Jun 18, 2020

It's fantastic that hickle can automatically export a dict object as an hdf5 file, but the resulting structure is not very transparent to someone (e.g., a non-Python user who doesn't have access to hickle) who might want to open and explore the file using some other hdf5 library or utility (e.g., panoply). In particular, the partial field name 'data_0' shows up redundantly at every level, and I'm not sure why that's necessary.

I would like to be able to use hickle to create a large number of clean/generic hdf5 files for sharing with people using C, Fortran, and other languages, so I hope a future version can somehow provide that option, so that the structure of the hdf5 file can be accessed in exactly the same way as the original dict.

@1313e
Copy link
Collaborator

1313e commented Jun 19, 2020

Hi @gpetty,

glad to see you like the functionality that hickle brings.

This issue is very interesting and one I agree with, as I have thought about it myself quite a few times as well while rewriting hickle to v4 (see #117).
I have looked into several different ways before to reduce the number of groups and datasets that hickle creates, but solving this issue is much harder than it may seem.
And part of that problem is the reason why the seemingly redundant 'data_0' group/dataset exists (also see #44, and note that it will be renamed to just 'data' in v4 in case there is only one of them).

So, the big issue is that in Python, objects do not carry any names.
In an HDF5-file, every group or dataset must have a name, so hickle simply assigns the generic name data to it.
As certain containers in Python, like tuples and lists, can contain more than 1 type of object, we have to store every object separately as datasets in HDF5 can only be of a single type.
These datasets are then given the name data_X with X being the index of that object in its parent container.

The above works well for basically all types of objects, except the one you are mentioning: dicts.
In a dict, every object is associated with a name, which is its key.
However, in hickle, we use recursive functions to explore the contents of every iterable and save it.
As an object in a dict does not have a name itself (it only has a name when it is part of the dict, but not on its own), the data name is used again.
Eliminating the redundant data name would mean that all recursive functions no longer work for anything contained in a dict, making it quite a bit harder to write everything properly.

I could take a good hard look at it again and see if there is maybe a way to do it, but I am not sure if there is.

@1313e
Copy link
Collaborator

1313e commented Jun 19, 2020

Alright, I just had to do something somewhat similar for solving the issue raised in #90, and there it was already quite a challenge to get rid of the data name when using NumPy arrays.
As dicts are far more complex, it will probably take some serious effort to get it removed completely (as much as I would like it to be removed, actually).

@gpetty
Copy link
Author

gpetty commented Jun 20, 2020

Upon reflection, I recognize that part of the difficulty undoubtedly stems from hickle's generality -- it needs to be able to do pickle-like exporting for literally any possible object, no matter how arcanely structured. Perhaps what's needed is a more specialized utility specifically for dicts satisfying particular constraints, such as no data types not supported by HDF5. In my case, the dicts I'm working with are constructed from fields in an existing HDF5 file, so automatically exporting those same data fields to a new HDF5 file with analogous structure should be fairly straightforward. But as I said, I see now that that's not necessarily compatible with the overarching goal of hickle.

@1313e
Copy link
Collaborator

1313e commented Jun 20, 2020

@gpetty Exactly.
It is definitely something we would like to think about in the near future, but it won't happen any time soon I think.

@telegraphic
Copy link
Owner

Looks like #138 addresses this!

@hernot hernot mentioned this issue Apr 22, 2021
@telegraphic
Copy link
Owner

Closing this as it's mostly addressed -- as well as can be while maintaining generality -- in v5.0.0. E.g. for a dict with keys 'a', 'b' and 'c':

h5diff -v hkl_500.hkl hkl_404.hkl

file1     file2
---------------------------------------
    x      x    /
    x      x    /data
    x           /data/data0
    x           /data/data0/"a"
    x           /data/data0/"b"
    x           /data/data0/"c"
           x    /data/data_0
           x    /data/data_0/'a'
           x    /data/data_0/'a'/data
           x    /data/data_0/'b'
           x    /data/data_0/'b'/data
           x    /data/data_0/'c'
           x    /data/data_0/'c'/data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants