Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Read in HDF5 catalogue files in C++ interface #94

Open
MikeSWang opened this issue Nov 27, 2024 · 0 comments
Open

[FEAT] Read in HDF5 catalogue files in C++ interface #94

MikeSWang opened this issue Nov 27, 2024 · 0 comments
Assignees
Labels
cpp-only C++-specific feature Feature requests

Comments

@MikeSWang
Copy link
Owner

Is the requested feature related to an issue?

No.

Summary

Add support for the loading of HDF5 files as catalogues, which should be more storage-efficient.

Alternatives

Alternatives include other binary file formats, but may suffer from more difficult ABI issues.

Implementation

The standard HDF5 library has a more complex interface, so the header-only, free-of-worries open-source HighFive library is favoured. However, this will introduce a dependency which must be carefully managed.

Additional context

A rough estimate suggests that HDF5 (or other binary formats) may halve the storage needed compared to plain-text.

An excerpt from GPT:

When switching from a text format (like CSV) to HDF5, the storage savings can be substantial, especially for large datasets. Here's a comparison to help estimate the savings:

Text Format (CSV):
Each double value is stored as text, typically taking 15-20 bytes due to precision and delimiters.
HDF5 Format:
Each double is stored in a binary format, taking exactly 8 bytes.
Example Calculation:
Suppose you have a dataset with 1 million double values:

CSV Size:

Assuming an average of 17 bytes per double:
(1,000,000 \times 17 = 17,000,000) bytes (17 MB)
HDF5 Size:

Each double takes 8 bytes:
(1,000,000 \times 8 = 8,000,000) bytes (8 MB)
Estimated Savings:
Size Reduction:

CSV: 17 MB
HDF5: 8 MB
Savings: (17 , \text{MB} - 8 , \text{MB} = 9 , \text{MB})
Percentage Savings:

(\frac{9 , \text{MB}}{17 , \text{MB}} \times 100% \approx 52.9%)
Additional Benefits of HDF5:
Compression: HDF5 supports built-in compression (like gzip), which can further reduce file size significantly depending on data characteristics.
Metadata: Efficiently stores metadata alongside the data.
Scalability: Handles large datasets efficiently.
In summary, switching to HDF5 can offer around 50% savings in storage space for raw data, with potential for more savings through compression, along with additional benefits in data management and performance.

@MikeSWang MikeSWang added feature Feature requests cpp-only C++-specific labels Nov 27, 2024
@MikeSWang MikeSWang self-assigned this Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp-only C++-specific feature Feature requests
Projects
None yet
Development

No branches or pull requests

1 participant