You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add support for the loading of HDF5 files as catalogues, which should be more storage-efficient.
Alternatives
Alternatives include other binary file formats, but may suffer from more difficult ABI issues.
Implementation
The standard HDF5 library has a more complex interface, so the header-only, free-of-worries open-source HighFive library is favoured. However, this will introduce a dependency which must be carefully managed.
Additional context
A rough estimate suggests that HDF5 (or other binary formats) may halve the storage needed compared to plain-text.
An excerpt from GPT:
When switching from a text format (like CSV) to HDF5, the storage savings can be substantial, especially for large datasets. Here's a comparison to help estimate the savings:
Text Format (CSV):
Each double value is stored as text, typically taking 15-20 bytes due to precision and delimiters.
HDF5 Format:
Each double is stored in a binary format, taking exactly 8 bytes.
Example Calculation:
Suppose you have a dataset with 1 million double values:
CSV Size:
Assuming an average of 17 bytes per double:
(1,000,000 \times 17 = 17,000,000) bytes (17 MB)
HDF5 Size:
(\frac{9 , \text{MB}}{17 , \text{MB}} \times 100% \approx 52.9%)
Additional Benefits of HDF5:
Compression: HDF5 supports built-in compression (like gzip), which can further reduce file size significantly depending on data characteristics.
Metadata: Efficiently stores metadata alongside the data.
Scalability: Handles large datasets efficiently.
In summary, switching to HDF5 can offer around 50% savings in storage space for raw data, with potential for more savings through compression, along with additional benefits in data management and performance.
The text was updated successfully, but these errors were encountered:
Is the requested feature related to an issue?
No.
Summary
Add support for the loading of HDF5 files as catalogues, which should be more storage-efficient.
Alternatives
Alternatives include other binary file formats, but may suffer from more difficult ABI issues.
Implementation
The standard HDF5 library has a more complex interface, so the header-only, free-of-worries open-source HighFive library is favoured. However, this will introduce a dependency which must be carefully managed.
Additional context
A rough estimate suggests that HDF5 (or other binary formats) may halve the storage needed compared to plain-text.
An excerpt from GPT:
The text was updated successfully, but these errors were encountered: