Lifeboat, LLC. was awarded DOE SBIR Phase I grant to design extentions to the HDF5 software to support efficient access and storage for sparse data. This repository contains the design documents for the HDF5 File Format extensions and public APIs to support the feature.
Sparse data is common in many scientific disciplines and experiments. Several examples are discussed in “Sparse Data Management in HDF5” [1], including High Energy Physics, Neutron and X-ray Scattering, Mass Spectrometry and Compressive Sensing experiments. In those use cases, only 0.1% to 10% of gathered data is of interest. HDF5, due to its proven track record and flexibility, remains the data format of choice. As the amount of data produced continues to grow due to higher instrument and detector resolution and higher sampling rates, there is a clear demand for efficient management of sparse data in HDF5. Adding support for sparse data will significantly simplify data processing software and widen adoption of HDF5.
In HDF5, problem-sized data is stored in multidimensional arrays of elements of a given type. Currently, the HDF5 library requires that all elements are defined with user-supplied values or fill-values, and it treats data as “dense”, mapping each data element to storage during I/O operations. Features such as HDF5 chunking and per-dataset compression help to optimize the storage of sparse data by not storing chunks devoid of user-supplied values and by compressing each chunk written. However, there are several obvious disadvantages to applying “dense storage” thinking to sparse data. Each chunk written may still have a lot of blank data and the location of the actual user-supplied data is not explicitly represented. Also, storing and accessing sparse datasets as dense datasets, when read into memory (and after decompression), may result in a huge memory footprint. Therefore, a different approach to handling HDF5 sparse data in files and in memory is needed.
As prototyped in [1], the proposed approach to sparse data management uses the existing HDF5 selection mechanism to represent sparse datasets, both in memory and in files. Since it will be impractical to hold entire sparse datasets in memory, we break the extent of the sparse dataset into user-specified, regular, n-dimensional hyper-rectangles. A sparse chunk is a hyper-rectangle endowed with an HDF5 selection, which represents all defined entries in its domain. This way, each sparse chunk has a selection (data coordinates) and associated user-defined data. This approach allows us to store only data of interest and to simultaneously operate on several sparse chunks using existing HDF5 facilities for serialization and deserialization, and for constructing partial I/O on sparse data.
Our proposed implementation offers sparse array storage that is independent from in-memory representation of the sparse data thus offering sparse data portability between applications. It also requires minimal changes to applications’ codes.
While the immediate need is for file format and API changes to support sparse data, we have given significant thought to a number of other potential HDF5 enhancements and noticed the commonalities with the sparse data problem. In particular, the idea of extending the concept of the chunk to contain multiple sections that describe different facets of the values stored in the chunk seems to be applicable to a number of problems, for example, to HDF5 variable-length data and non-homogeneous arrays. This in turn raises the problem of compressing these different sections efficiently, as different compression algorithms may be optimal for different sections. Please see the documents in the design_docs directory for more details. We will be very happy to receive community feedback on the proposed designs.
In Phase II (if awarded) we plan to implement the new feature and integrate the solution into the open source HDF5 library.
References:
- J. Mainzer, N. Fortner, G. Heber, and others, “Sparse Data Management in HDF5”, November 2019, Conference: 2019 IEEE/ACM 1st Annual Workshop on Large-scale Experiment-in-the-Loop Computing (XLOOP), http://dx.doi.org/10.1109/XLOOP49562.2019.00009