Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDF5 filter and plugin based on c-blosc2? #29

Closed
mkitti opened this issue Feb 4, 2022 · 23 comments
Closed

HDF5 filter and plugin based on c-blosc2? #29

mkitti opened this issue Feb 4, 2022 · 23 comments

Comments

@mkitti
Copy link
Contributor

mkitti commented Feb 4, 2022

I would be interested in seeing this plugin updated to work with c-blosc2 code.

@FrancescAlted
Copy link
Member

That would be great. The new API for C-Blosc2 is backward compatible with C-Blosc, so this should be easy. Just remember that the C-Blosc2 binary format is backward compatible, but not forward compatible.

@mkitti
Copy link
Contributor Author

mkitti commented Feb 6, 2022

Does that mean that a HDF5 plugin based on c-blosc1 would not be able to read chunks compressed by a HDF5 plugin based on c-blosc2?

@FrancescAlted
Copy link
Member

That's correct.

@mkitti
Copy link
Contributor Author

mkitti commented Feb 6, 2022

Is there anyway to have c-blosc2 produce a backwards-compatible binary format?

@FrancescAlted
Copy link
Member

I don't think so. At first I was trying to keep a format that was forward-compatible, but it was too much hassle, and decided not to do it.

@t20100
Copy link

t20100 commented Oct 24, 2022

Hi,

What is the guideline (is there any?...) regarding registered HDF5 filters compatibility and when it should use a different ID.

The bottom line is being able to read old compressed data with new versions of the filter, and I would expect passing parameters to the filter through HDF5 to be compatible as well.
Also, ideally IMO, writing compressed data with a new version but without using new features should be readable with older versions of the filter.
But looking at other HDF5 filters it is not always the case: For instance, ZFP breaks this when updating the underlying compression library, while the bitshuffle filter kept compatible when adding ztsd.

@mkitti
Copy link
Contributor Author

mkitti commented Oct 24, 2022

By the way, Blosc2 is registered with a new ID, 32026:
https://portal.hdfgroup.org/display/support/Filters#Filters-32026

@t20100
Copy link

t20100 commented Oct 25, 2022

Question: In the event of using c-blosc2 as a compression library for this filter (thus breaking forward compatibility), would older versions of the filter detect that they can't read the chunk and raise an error?

@FrancescAlted
Copy link
Member

We would need to try, but I am pretty sure that an error will be raised when using C-Blosc1 with chunks generated with C-Blosc2.

On the other hand, the way we are currently using the registered Blosc2 (ID 32026) is by using a CFrame. The CFrame is more flexible than a regular chunk, and will allow to use multidim metalayers, which should be useful for optimizing dataset reads.

After reflecting more on this, one path that we could follow is to add a check in the current HDF5 Blosc1 filter so that, when an error would be detected, check whether what we are decompressing is a CFrame, and if so, call the actual Blosc2 filter. BTW, we have a preliminary version of a HDF5 Blosc2 filter at https://github.com/PyTables/PyTables/tree/direct-chunking-append/hdf5-blosc2/src, and one can use this for the standalone future hdf5-blosc2 filter.

@t20100
Copy link

t20100 commented Oct 26, 2022

I did a very quick test (no shuffle/default compressor) with updating hdf5plugin to compile hdf5-blosc with c-blosc2 and indeed it raises an error:

OSError: Can't read data (Blosc decompression error)

If you want to take the way of switching to c-blosc2 sooner or later (I expect it is better in terms of maintenance) and break forward compatibility, maybe that would be good to add a check of the blosc version in the filter to prepare for forward compatibility breaks and provide a more explicit message in this case.

@t20100
Copy link

t20100 commented Oct 26, 2022

Great to see a HDF5 Blosc2 filter coming-up!

@FrancescAlted
Copy link
Member

After pondering a bit more about Blosc/Blosc2 compatibility, I think a better approach is to make the two filters totally separate. So the current hdf5-blosc will continue supporting just the C-Blosc 1.x series, while future hdf5-blosc2 will support just C-Blosc 2.x series. Also, having separate HDF5 Filter IDs will help in this.

@mkitti
Copy link
Contributor Author

mkitti commented Oct 26, 2022

It would be convenient if we could decompress Blosc1 with c-blosc2 though.

@t20100
Copy link

t20100 commented Oct 26, 2022

It would be convenient if we could decompress Blosc1 with c-blosc2 though.

From what I tested, compiling hdf5-blosc1 filter with c-blosc2 can decompress chunks compressed with c-blosc1 (it is backward compatible)... but not the other way around.

@froody
Copy link

froody commented Feb 10, 2024

Any updates on this? There seem to be some blosc2 plugins availlable (eg pytables) but none support arbitrary filters as far as I can tell. I need BYTEDELTA to get a good compression ratio, I really want to be able to read/write hdf5 datasets and specify the blosc2 filters used to compress.

@mkitti
Copy link
Contributor Author

mkitti commented Feb 10, 2024

@froody
Copy link

froody commented Feb 11, 2024

I've seen that before, how does that answer my question? Is hdf5 going to adopt that proposal?

@mkitti
Copy link
Contributor Author

mkitti commented Feb 11, 2024

Have you seen https://github.com/Blosc/b2h5py ?

I'm not sure if I understand your question. The Blosc2 filter has been registered as ID 32026. There's nothing more for The HDF Group to do.

@froody
Copy link

froody commented Feb 11, 2024

b2h5py is mostly out of scope for me. To be clear:

  • I want SHUFFLE + BYTEDELTA blosc2 compression
  • I want to be able to store the compressed data in hdf5 datasets
  • I want to generate these datasets from C/C++
  • I want it to be easy to read from C/python i.e. I don't want to have to read a uint8_t buffer from hdf5 and then call blosc2 to decompress it

You mention the plugin 32026, where is the authoritative implementation of that plugin?

I guess I could compress data with blosc2 myself and write it with H5DOwrite_chunk, trusting that 32026 will just decompress it.

@froody
Copy link

froody commented Feb 12, 2024

Ok, I tried what I suggested above, but I get an error on decompression because dparams.schunk = NULL. If I set dparams.schunk to point to the schunk then it decompresses correctly. Specifically, I believe there should be an assignment of dparams.schunk = schunk; immediately after this line

@FrancescAlted
Copy link
Member

FrancescAlted commented Feb 12, 2024

@froody You are right that, with the current API, we cannot use the full functionality of Blosc2 pipeline inside HDF5. The solution would be to use the cd_values in HDF5 API in a more imaginative way, but that requires thought and careful execution so as to avoid collision with existing conventions for storing metainfo in cd_values (e.g. n-dim info).

Meanwhile, I am glad that you figured out the best workaround, i.e. using direct chunking in HDF5 (via H5DOwrite_chunk). FWIW, and if others are reading this, you can use direct chunking in h5py as well. Here you have an example where h5py is using grok for compressing with JPEG2000 via Blosc2: https://gist.github.com/t20100/80960ec46abd3a863e85876c013834bb

@t20100
Copy link

t20100 commented Feb 12, 2024

You mention the plugin 32026, where is the authoritative implementation of that plugin?

It's in PyTables: https://github.com/PyTables/PyTables/tree/master/hdf5-blosc2/src

It's also embedded in hdf5plugin for usage with h5py.
Though as already mentioned, for having access to all blosc2 features when writing one has to use HDF5 direct chunk write.

@FrancescAlted
Copy link
Member

I think we can close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants