Add support for sparse data and enhance support for variable-length data in HDF5 #3257

epourmal · 2023-07-18T15:48:38Z

epourmal
Jul 18, 2023
Collaborator

Hello, HDF5 community!

We have been working on adding support for sparse data into HDF5. While designing file format and API extensions we have given significant thought to a number of other potential HDF5 enhancements and noticed the commonalities with the sparse data problem. In particular, the idea of extending the concept of the chunk to contain multiple sections that describe different facets of the values stored in the chunk seems to be applicable to a number of problems, for example, to HDF5 variable-length data and non-homogeneous arrays. This in turn raises the problem of compressing these different sections efficiently, as different compression algorithms may be optimal for different sections.

Our approach to HDF5 modifications is documented in two RFCs:
RFC:File Format Changes for Enabling Sparse Storage in HDF5 and RFC:Programming Model to Support Sparse Data in HDF5.

We are looking for community feedback on our proposal. Especially, we are interested in the use cases for storing sparse and variable-length data in HDF5 and suggestions on HDF5 APIs. Please notice that our approach will finally enable compression of variable-length data and writing it in parallel.

I will present our work at HUG23 and, hopefully, at EHUG23.

Please add your comments to this ticket. You are also very welcome to contact me directly if you have any questions,concerns, and/or suggestions. RFCs have my contact information.

Thank you and I look forward to your feedback.

Elena

ax3l · 2023-07-26T07:44:53Z

ax3l
Jul 26, 2023

Thank you for taking the lead on this initiative!

Over at the openPMD metadata standard, we need a notion of sparseness for blockstructured mesh-refinement:
openPMD/openPMD-standard#252
Refined levels are only partially covered by (hyperrectangular) patches with refined resolution, which can further be nested to multiple levels, increasing the sparseness in the overall bounding box with each level.

Our applications are beam, plasma and particle accelerator modeling codes that are implemented against the AMReX library for block-structured AMR.

0 replies

markcmiller86 · 2023-08-20T06:29:22Z

markcmiller86
Aug 20, 2023

FWIW, I captured some thoughts here

0 replies

erik-mansson · 2023-09-22T15:05:06Z

erik-mansson
Sep 22, 2023

I think markcmiller86 raises some valid criticisms in the linked comment, including the point that that API-compatibility with readers made for dense data is not completely necessary when thinking about datasets too large for RAM so that no existing analysis code anyway uses the dense representation (rather application-level schemas/combinations of datasets). My own application is on the small side, with images of "only" a million pixels which is not too bad to have dense in memory, but where writing a sparse representation of only the very few (e.g. 1%) bright pixels offers a faster compression to achieve sufficiently high frame rate.

My gut feeling is that my data is more suited for a regular filter (function/plugin) with run-length encoding of zeros, so that I don't have to do calls to a HDF5-sparse-API for each individual pixel. (They are not clustered into dense regions of interest.) I'm working towards such a filter and may post links once there's something to share.

The proposal to define a HDF5-API targeted at use-cases where rectangular (hyperrectangular) regions of elements are defined (in a sea of undefined/padded values) may of course proceed regardless of the existence of alternative approaches for other use-cases.

If there is a wish to make a sparse HDF5-API that can be efficient also for really sparse data, where most regions have size 1 (few pixels of an image or something like a nearly diagonal matrix), I would propose making sure that there is an option of buffered lower-level storage which is fast at accepting single-element updates (writes) and perhaps for reading can also provide an iterator over the defined elements in a chunk. Not sure whether HDF5 handles any chunk buffering now, I guess there could be a flush() or close()-method to let an application signal when it's suitable to commit a chunk to storage -- or have it automatically happen soon after the access pattern has proceeded to another chunk. Or should such buffering optimization rather be achieved by something like direct writing of application-compressed chunks, where the application would now prepare a buffer with the sparse-defined-region-headers etc. ready for copying to lower-level storage?

2 replies

epourmal Sep 26, 2023
Collaborator Author

Thank you for reviewing RFC!

I am not sure I understand the point about "API-compatibility with readers made for dense data".

The proposal offers a programming model that allows to find defined elements within a specified region of the data set (H5Dget_defined) and to read them using regular H5Dread call. HDF5 will not read more data that is stored in the file, i.e., only defined elements. Readers of data (HDF5 application) do not need to be aware of a particular HDF5 storage mechanism, i.e., if it is contiguous, chunked or structured chunk storage. It means that introduction of new storage type in HDF5 will not affect reading applications when it encounters data stored using structured chunk storage. As RFCs point out, HDF5 provides users with many custom options for storing sparse data. The goal of the RFC is to propose a standard that doesn't depend on sparse data presentation in memory and doesn't require custom schema storage in order to find data in an HDF5 file.

erik-mansson Oct 9, 2023

This subthread might be a bit of a tangent, about me just explaining what I think markcmiller86 saw as an unnecessary/overreaching/overkill ambition in the RFC... But, I will explain what I meant by "API-compatibility with readers made for dense data".

The RFC about sparse HDF5 dataset handling discusses the ways one will be able to find what regions are defined and get their contents, while maintaining the abstract concept of a multidimensional dataset (dataspace) where each index will either have a defined value or be undefined. I.e. the dataset still has some number of dimensions and a length of each dimension, so that the number of hypothetical elements can be computed (although most of them may be undefined and thus not occupying disk space).

The sentence that "Readers of data (HDF5 application) do not need to be aware of a particular HDF5 storage mechanism" confirms my interpretation that "normal" calls to read some index or some hyperslab will still work. So the ambition seems to be that if there is an existing user program that reads an entire HDF5 dataset in a classic way, e.g. using H5Dread(..., file_space_id=H5S_ALL, ...), into an N-dimensional array of numbers (what I call a "dense representation"), then this program will through a future HDF5-library version be able to get its N-dimensional array even from a new kind of file where the low-level storage can be made in a sparse way. This will rely on a fill-value being chosen to put in the N-dimensional array where the sparse representation is undefined. This is what I mean by API-compatibility.

Now, the point I saw raised by markcmiller86 was that this specific case of API-compatibility while nice in theory, might not be so important to existing sparse-data users in practice. Consider that the only users for which sparse data representation is completely essential are those for whom the dataspace is huge (so the N-dimensional array full of fill-values won't fit in memory or on disk) but the amount of defined elements is manageable. Such users would not have an existing program that builds any such plain N-dimensional array. Because we defined "such users" by the criterion that that N-dimensional array would be too large. So "such users" would have some kind of sparse representation also at the application level of the existing programs.

The win by a HDF5-standard way of handling sparse data would of course still be that transfer between users would be easier, because future applications would not need to reinvent sparse representations, and users would not choose incompatible schemas for how they encode their sparse data into some combinations of traditional datasets. But this win would for "such users" seem achieved even if the new sparse-HDF5-dataset would have dropped compatibility with the traditional H5Dread().

I don't really want to argue much about this or claim that there would be any problem, this comment is just to hopefully clarify the reasoning behind the previous comment. If retaining H5Dread()-compatibility is not preventing a good file format design for sparse data use-cases it's of course a nice feature to have. Users of intermediate-sized arrays, that actually can fit in memory, will thanks to the compatibility be able to benefit from reduced disk space usage if their data is sparse (a kind of compression) while not having to change the rest of their programs (still using N-dimensional arrays with fill-values).

erik-mansson · 2023-09-22T15:23:47Z

erik-mansson
Sep 22, 2023

An unrelated thought when reading more about the implementation proposal is that the "erase" method might become rather complicated for handling the case when a user wishes to erase a few elements within what was previously defined (by one or more ROIs). Is it allowed for an implementation to keep the dense ROI-block and set some elements to zero (or whatever the fill-value is), or would the implementation have to break each ROI-block into two new blocks along affected dimensions in order to achieve the proper gap, so that the erased region becomes reported as undefined rather than just zero? Or does it make sense for implementations to let a small "anti-defining" region hide parts of an elsewhere defined region?

It seems you could define this complication away if the API just promises that the erased elements will "appear to have the fill-value", and only in case an erased region completely covers a (chunk, ROI)-intersection does that (chunk, ROI)-intersection need to become marked as undefined. As I didn't read very carefully, it could of course be that I missed some ideas about how the erasing should work.

3 replies

epourmal Sep 26, 2023
Collaborator Author

The goal of H5Derase API is to remove the locations and values themselves from storage. If the elements are not defined anymore, they cannot appear on the elements list when H5Dget_defined is called after the H5Derase call. When a region is read back by H5Dread call the erased elements will be filled with fill-value in the memory buffer that is provided by application.

erik-mansson Oct 9, 2023

I completely agree that this is the intended behaviour. I was trying to point out that, if I understood the idea of defined regions/blocks/(hyperslabs?), it just seems potentially quite a complex operation to update the file to carry out a removal request if the user is allowed to remove a subset within a defined region.

Consider the following toy example of a datset with whose dataspace has two dimensions of lengths 2 and 7:

[[uu123uu],
  uu456uu]]

Each single digit is a defined element and the "u" indicate undefined elements.

If we now erase column 3 (both rows), my interpreation of the RFC was that the updated dataset would appear as

[[uu1u3uu],
  uu4u6uu]]

in any future kind of reading. This may be by getting a list of the still defined elements/blocks/(hyperslabs?) or perhaps by a call that gives the full 2-by-7 array padded with some fill-value, e.g. NaN, 0 or -1 depending on the application.

So far just clarifying the expected behaviour.
Now about the implementation question/fear:

I suppose that if the six defined elements were defined in one go, the low-level sparse representation in the HDF-file could in some case be like a hyperslab with an offset of (0, 2) that just defines the region [[123], [456]].

Now, if the implementation and the HDF5-file is not allowed to use any fill-value (e.g. because the fill-value may be chosen by the reading application and vary between read-calls), the remove-operation can not simply replace the "2" and "5" elements by some other value. The remove-operation would have to remove the entire region and replace it by two smaller regions that contain the remaining elements. One region that defines [[1], [4]] and one region that defines [[3], [6]]. In this toy example, I imagine the headers for a second region even makes the total size on disk larger after the removal, which might feel surprising. In a large N-dimensional example and with chunking of the dataset, I just imagine that the implementation will be quite complex with many special cases, reallocations and files that end up larger. (My toy example is too small to prove that a second region is needed, as opposed to a hyperslab with gaps, but had there been four columns defined it would be clear that there is no recurring gap.)

An alternative to simplify this would be if we required that each sparse dataset had a fill-value specified as a dataset attribute, in the file, and which perhaps was not allowed to change after the dataset has been created. An element with the fill-value is equivalent to being undefined. Then the removal operation would be as simple as replacing elements by the fill-value. As a second step, the implementation can check whether any of the defining regions is full of fill-values, then that region defines nothing and can be removed completely.

Another alternative would be if removal was not possible for arbitrary elements or selections, only for selections precisely matching one (or more) defining regions/(hyperslabs?). Then the implementation would be as simple as going through the list of element-defining regions and determine if it should be kept or dropped. If this is what the RFC intended, I didn't pick it up in my first reading (and didn't try to re-read now). Perhaps it could be achieved by the remove-method taking an argument that is an index within the list of definitions, not an arbitrary selection. Or document that an exception can be raised if given a selection that only partly covers some of the element-region-definitions.

epourmal Oct 14, 2023
Collaborator Author

This may be by getting a list of the still defined elements/blocks/(hyperslabs?) or perhaps by a call that gives the full 2-by-7 array padded with some fill-value, e.g. NaN, 0 or -1 depending on the application. So far just clarifying the expected behaviour.

When you remove 2 and 5 new serialized selection of the rest of the elements will be stored along with 4 numbers 1,3,4,6. What is important that if one requests defined elements, the library will give a selection that describes the locations of the defined elements. There are functions in HDF5 that allow to get the coordinates of the values from the space_id that is returned by H5Dget_defined. space_id used in H5Dread will give a data buffer with defined values in the order in which selection APIs will give points coordinates. I need to add examples to RFC to show how it will work.

I imagine the headers for a second region even makes the total size on disk larger after the removal.

You are correct. To mitigate the issue, we can easily store a sparse chunk as "dense" if the user provides threshold and fill value to use for the undefined elements. It is one of the optimizations we could add.

An alternative to simplify this would be if we required that each sparse dataset had a fill-value specified as a dataset attribute, in the file, and which perhaps was not allowed to change after the dataset has been created.

Some libraries use this approach, for example, to define missing values. The issue is that how one will know if a defined value will never be the same value as an attribute. Some times it is possible but not in the general case.

markcmiller86 · 2023-09-27T21:00:19Z

markcmiller86
Sep 27, 2023

I saw some new comments here and it inspired me to ask some more questions about structured chunks.

How will structured chunks be the same or different from normal chunks? Will they use the same buffer? Can a dataset be composed of both regular chunks and structured chunks or does a producer have to decide at the outset which kind of chunk will be used and then stick with that? Will a dataset using structured chunks be extendible in the same way normally chunked datasets are? Will there be a separate chunk buffer for structured chunks? Will new data sieving code need to be added to support partial requests on datasets composed of structured chunks? Can existing chunk buffering algorithms be applied to structured chunks? Is there a point at which a structured chunk becomse so close to fully dense that it should just be converted to a normal chunk and is/will that be possible? There are high level API routines to direct write (normal) chunks. Will there be same now for structured chunks too or can the existing routines handle the structured chunks? Will there be new functionality for h5repack that can take dense (but with a lot of zeros of fill vals) datasets and re-chunk them as structured chunk and vice versa?

2 replies

epourmal Sep 29, 2023
Collaborator Author

Good questions! Thank you!

How will structured chunks be the same or different from normal chunks?

Once can think about "normal" chunk as a chunk with only one section. Structured chunks will be using the same chunk indexing schemas as current chunks expect that different metadata will be stored in the used index (see Section 4 of File Format RFC), for example, when Version 2 B-tree is used for chunk indexing (it is used for datasets with more than one unlimited dimension), Record Layout has Structured Chunk Metadata record added to the regular chunk Record Layout Spec.

Will they use the same buffer?

Not sure what do you mean. We are going to re-implement chunk cache that can be shared between chunked datasets (both regular and structured chunk). The whole I/O pipeline will be reworked as described in this RFC, section 6.

Can a dataset be composed of both regular chunks and structured chunks or does a producer have to decide at the outset which kind of chunk will be used and then stick with that?

No, structured chunk is another way of storing data, it is a dataset creation property, i.e., one has to decided on the storage at the dataset creation time. h5repack can be used to change the storage later as for the current chunk storage. Said this some chunks in sparse dataset can be dense and for the fixed-size datatype may contain only one section (this will be just an optimization in the library). Variable-length data will need to use structured chunk storage in order to be stored as a data not as a metadata and to be compressed.

epourmal Sep 29, 2023
Collaborator Author

Will new data sieving code need to be added to support partial requests on datasets composed of structured chunks?

I think it will be reused and/or enhanced.

epourmal · 2023-09-29T13:30:08Z

epourmal
Sep 29, 2023
Collaborator Author

There are high level API routines to direct write (normal) chunks. Will there be same now for structured chunks too or can the existing routines handle the structured chunks?

We decided to introduce new direct chunking routines. Once will be able to use old routines on read and then do all tedious work of "unpacking" sections.

0 replies

epourmal · 2023-09-29T13:34:59Z

epourmal
Sep 29, 2023
Collaborator Author

Will there be new functionality for h5repack that can take dense (but with a lot of zeros of fill vals) datasets and re-chunk them as structured chunk and vice versa?

Yes, the RFC just outlines some issues. See section 4.6 for h5repack. More work needs to be done on the design for all tools and, unfortunately, I didn't get any comments on that part of the RFC.

0 replies

epourmal · 2023-09-29T13:37:43Z

epourmal
Sep 29, 2023
Collaborator Author

Is there a point at which a structured chunk becomse so close to fully dense that it should just be converted to a normal chunk and is/will that be possible?

As I mentioned above it is an optimization, but chunk's metadata will still be structured chunk metadata. We cannot change creation property of an HDF5 object.

0 replies

epourmal · 2024-08-22T14:15:03Z

epourmal
Aug 22, 2024
Collaborator Author

Update:
Proposed extensions to the file format to support sparse and variable-length data are available now. Check corresponding File Format RFC for details.
Here are the links to the latest Public APIs RFC and corresponding RM in Doxygen (check new functions from the RFC).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for sparse data and enhance support for variable-length data in HDF5 #3257

{{title}}

Replies: 9 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Add support for sparse data and enhance support for variable-length data in HDF5 #3257

epourmal Jul 18, 2023 Collaborator

Replies: 9 comments · 7 replies

ax3l Jul 26, 2023

markcmiller86 Aug 20, 2023

erik-mansson Sep 22, 2023

epourmal Sep 26, 2023 Collaborator Author

erik-mansson Oct 9, 2023

erik-mansson Sep 22, 2023

epourmal Sep 26, 2023 Collaborator Author

erik-mansson Oct 9, 2023

epourmal Oct 14, 2023 Collaborator Author

markcmiller86 Sep 27, 2023

epourmal Sep 29, 2023 Collaborator Author

epourmal Sep 29, 2023 Collaborator Author

epourmal Sep 29, 2023 Collaborator Author

epourmal Sep 29, 2023 Collaborator Author

epourmal Sep 29, 2023 Collaborator Author

epourmal Aug 22, 2024 Collaborator Author

epourmal
Jul 18, 2023
Collaborator

Replies: 9 comments 7 replies

ax3l
Jul 26, 2023

markcmiller86
Aug 20, 2023

erik-mansson
Sep 22, 2023

epourmal Sep 26, 2023
Collaborator Author

erik-mansson
Sep 22, 2023

epourmal Sep 26, 2023
Collaborator Author

epourmal Oct 14, 2023
Collaborator Author

markcmiller86
Sep 27, 2023

epourmal Sep 29, 2023
Collaborator Author

epourmal Sep 29, 2023
Collaborator Author

epourmal
Sep 29, 2023
Collaborator Author

epourmal
Sep 29, 2023
Collaborator Author

epourmal
Sep 29, 2023
Collaborator Author

epourmal
Aug 22, 2024
Collaborator Author