Replies: 9 comments 7 replies
-
Thank you for taking the lead on this initiative! Over at the openPMD metadata standard, we need a notion of sparseness for blockstructured mesh-refinement: Our applications are beam, plasma and particle accelerator modeling codes that are implemented against the AMReX library for block-structured AMR. |
Beta Was this translation helpful? Give feedback.
-
FWIW, I captured some thoughts here |
Beta Was this translation helpful? Give feedback.
-
I think markcmiller86 raises some valid criticisms in the linked comment, including the point that that API-compatibility with readers made for dense data is not completely necessary when thinking about datasets too large for RAM so that no existing analysis code anyway uses the dense representation (rather application-level schemas/combinations of datasets). My own application is on the small side, with images of "only" a million pixels which is not too bad to have dense in memory, but where writing a sparse representation of only the very few (e.g. 1%) bright pixels offers a faster compression to achieve sufficiently high frame rate. My gut feeling is that my data is more suited for a regular filter (function/plugin) with run-length encoding of zeros, so that I don't have to do calls to a HDF5-sparse-API for each individual pixel. (They are not clustered into dense regions of interest.) I'm working towards such a filter and may post links once there's something to share. The proposal to define a HDF5-API targeted at use-cases where rectangular (hyperrectangular) regions of elements are defined (in a sea of undefined/padded values) may of course proceed regardless of the existence of alternative approaches for other use-cases. If there is a wish to make a sparse HDF5-API that can be efficient also for really sparse data, where most regions have size 1 (few pixels of an image or something like a nearly diagonal matrix), I would propose making sure that there is an option of buffered lower-level storage which is fast at accepting single-element updates (writes) and perhaps for reading can also provide an iterator over the defined elements in a chunk. Not sure whether HDF5 handles any chunk buffering now, I guess there could be a flush() or close()-method to let an application signal when it's suitable to commit a chunk to storage -- or have it automatically happen soon after the access pattern has proceeded to another chunk. Or should such buffering optimization rather be achieved by something like direct writing of application-compressed chunks, where the application would now prepare a buffer with the sparse-defined-region-headers etc. ready for copying to lower-level storage? |
Beta Was this translation helpful? Give feedback.
-
An unrelated thought when reading more about the implementation proposal is that the "erase" method might become rather complicated for handling the case when a user wishes to erase a few elements within what was previously defined (by one or more ROIs). Is it allowed for an implementation to keep the dense ROI-block and set some elements to zero (or whatever the fill-value is), or would the implementation have to break each ROI-block into two new blocks along affected dimensions in order to achieve the proper gap, so that the erased region becomes reported as undefined rather than just zero? Or does it make sense for implementations to let a small "anti-defining" region hide parts of an elsewhere defined region? It seems you could define this complication away if the API just promises that the erased elements will "appear to have the fill-value", and only in case an erased region completely covers a (chunk, ROI)-intersection does that (chunk, ROI)-intersection need to become marked as undefined. As I didn't read very carefully, it could of course be that I missed some ideas about how the erasing should work. |
Beta Was this translation helpful? Give feedback.
-
I saw some new comments here and it inspired me to ask some more questions about structured chunks. How will structured chunks be the same or different from normal chunks? Will they use the same buffer? Can a dataset be composed of both regular chunks and structured chunks or does a producer have to decide at the outset which kind of chunk will be used and then stick with that? Will a dataset using structured chunks be extendible in the same way normally chunked datasets are? Will there be a separate chunk buffer for structured chunks? Will new data sieving code need to be added to support partial requests on datasets composed of structured chunks? Can existing chunk buffering algorithms be applied to structured chunks? Is there a point at which a structured chunk becomse so close to fully dense that it should just be converted to a normal chunk and is/will that be possible? There are high level API routines to direct write (normal) chunks. Will there be same now for structured chunks too or can the existing routines handle the structured chunks? Will there be new functionality for |
Beta Was this translation helpful? Give feedback.
-
We decided to introduce new direct chunking routines. Once will be able to use old routines on read and then do all tedious work of "unpacking" sections. |
Beta Was this translation helpful? Give feedback.
-
Yes, the RFC just outlines some issues. See section 4.6 for |
Beta Was this translation helpful? Give feedback.
-
As I mentioned above it is an optimization, but chunk's metadata will still be structured chunk metadata. We cannot change creation property of an HDF5 object. |
Beta Was this translation helpful? Give feedback.
-
Update: |
Beta Was this translation helpful? Give feedback.
-
Hello, HDF5 community!
We have been working on adding support for sparse data into HDF5. While designing file format and API extensions we have given significant thought to a number of other potential HDF5 enhancements and noticed the commonalities with the sparse data problem. In particular, the idea of extending the concept of the chunk to contain multiple sections that describe different facets of the values stored in the chunk seems to be applicable to a number of problems, for example, to HDF5 variable-length data and non-homogeneous arrays. This in turn raises the problem of compressing these different sections efficiently, as different compression algorithms may be optimal for different sections.
Our approach to HDF5 modifications is documented in two RFCs:
RFC:File Format Changes for Enabling Sparse Storage in HDF5 and RFC:Programming Model to Support Sparse Data in HDF5.
We are looking for community feedback on our proposal. Especially, we are interested in the use cases for storing sparse and variable-length data in HDF5 and suggestions on HDF5 APIs. Please notice that our approach will finally enable compression of variable-length data and writing it in parallel.
I will present our work at HUG23 and, hopefully, at EHUG23.
Please add your comments to this ticket. You are also very welcome to contact me directly if you have any questions,concerns, and/or suggestions. RFCs have my contact information.
Thank you and I look forward to your feedback.
Elena
Beta Was this translation helpful? Give feedback.
All reactions