Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blosc2 Codec? #413

Open
rabernat opened this issue Dec 21, 2022 · 8 comments
Open

Blosc2 Codec? #413

rabernat opened this issue Dec 21, 2022 · 8 comments
Labels
New codec Suggestion for a new codec

Comments

@rabernat
Copy link
Contributor

I recently noticed that Blosc2 2.0 has been released: https://twitter.com/Blosc2/status/1605529031780311041. This made me wonder whether we should revisit the idea of adding blosc2 support to Numcodecs and Zarr.

Obviously blosc2 has gone in the direction of adding more features--it's not much more than just a compression codec, and includes I/O, metadata, plugins, etc.--such that there is no clear boundary between Zarr's features and blosc2's features. So we would first want to decide which parts of blosc2 would be advantageous to expose as a codec in numcodecs. The main question would be whether there is a benefit to using the blosc2 superchunk feature to store multiple chunks in a single shard. If so, we will quickly resurrect the discussion about Caterva in Zarr (zarr-developers/zarr-python#713).

@JackKelly
Copy link

Ooh, yes, I'd be really interested to see blosc2 integrated with numcodecs / Zarr.

For now, I believe that imagecodecs supports blosc2. But deep integration with numcodecs & Zarr would be very interesting.

@jakirkham
Copy link
Member

@FrancescAlted and I discussed adding Blosc2 support to Zarr during the 2022 NumFOCUS Summit. Think what we concluded it should be possible. Note this is a different approach than what is outlined in the blogpost above.

The relevant thing for this discussion is the Blosc2 chunk format. This is the thing (I think) we would want to interact with.

The Blosc2 chunk format has blocks, which in Zarr we would call shards. AFAICT these are the same, but the terminology is different. Will use the term shards (as that is what we are familiar with), but keep this in mind when reading the spec.

In Blosc2 it tracks starts/offsets into shards, which we would likely want to extract and add to the metadata. This overlaps a bit with the sharding work @jstriebel has been doing ( zarr-developers/zarr-python#1111 ) so potentially could work with that approach. Think we would want to remove the header as this would already be stored in Zarr metadata. As a first pass this would likely entail some copying to add/remove the header. Longer term we might want the option to pass in the header separately (or something like this).

Anyways this is my recollection of that conversation, which is not as fresh as it was. I may very well have forgotten/misunderstood things.

@jakirkham
Copy link
Member

Perhaps another way to go about this would be look at using Kerchunk with Blosc2

@martindurant
Copy link
Member

Yes, kerchunk is interested in accessing the chunks within a compressed stream. You could regard the compressed blocks as chunks, but they would presumably not be equal length, so additional logic would be needed.

With the release of indexed_gzip, we may be able to something similar across all implementations. There is some tradeoff here between writing lots of references for individual chunks versus storing "shard" information elsewhere versus just requesting the exact matrix offsets from the storage and having the compression layer figure out what to actually read (I don't know if the third is actually possible)

@FrancescAlted
Copy link

Numcodecs adopting Blosc2 would be great. BTW, what we recently released as 2.0 is Python-Blosc2, not C-Blosc2 (whose 2.0 release happened 1,5 years ago).

For what is worth, we have just merged Caterva into the main branch of C-Blosc2, so the later has gained multidimensional capabilities. During the merge, the API has changed a bit (mainly to adapt to the Blosc way of doing things), but the functionality in the new C-Blosc2 is the same. We will let the new API to rest a bit, and when the dust would be settled, we will proceed with releasing C-Blosc2 (probably 2.7.0) pretty soon.

@mkitti
Copy link
Contributor

mkitti commented Feb 21, 2023

Just so the situation is clear, Blosc2 compressed data is not decompressable by Blosc1. On the other hand, Blosc1 compressed data can be decompressed by Blosc2.

Blosc/hdf5-blosc#29 (comment)

For this reason Blosc1 and Blosc2 are registered as separate HDF5 filter plugins:
https://portal.hdfgroup.org/display/support/Filters#Filters-32026

I suspect numcodecs will need to support both Blosc1 and Blosc2 compression, simultaneously, for the sake of backwards compatibility.

You may also want to consider deprecating Blosc1 compression in favor of Blosc2 compression.

@fschwar4
Copy link

Hi all,

If anyone really wants the Blosc2 compressors, they could check out the Python implementation of Blosc2. You can easily register this as a new Numcodec. A first test showed improved behaviour over Blosc1 in most cases. I will do some more rigorous testing next week.

import blosc2
import numcodecs

enum_dict = {
    'blosclz': blosc2.Codec.BLOSCLZ,
    'lz4': blosc2.Codec.LZ4,
    'lz4hc': blosc2.Codec.LZ4HC,
    'zlib': blosc2.Codec.ZLIB,
    'zstd': blosc2.Codec.ZSTD,
    'NDLZ': blosc2.Codec.NDLZ,
    'ZFP_ACC': blosc2.Codec.ZFP_ACC,
    'ZFP_PREC': blosc2.Codec.ZFP_PREC,
    'ZFP_RATE': blosc2.Codec.ZFP_RATE,
}


class Blosc2(numcodecs.abc.Codec):

    codec_id = 'blosc2'

    def __init__(self, cname='BLOSCLZ', clevel=5, shuffle=1, blocksize=0):
        self.cname = cname
        self.clevel = clevel
        self.shuffle = shuffle
        self.blocksize = blocksize
    
    def encode(self, data):
        return blosc2.compress2(data, codec=enum_dict[self.cname], clevel=self.clevel, filter=blosc2.Filter(self.shuffle), blocksize=self.blocksize)
    
    def decode(self, data):
        return blosc2.decompress(data)
    
numcodecs.register_codec(Blosc2, 'blosc2')

zarr_blosc1_vs_blosc2_(60, 10000)

zarr_blosc1_vs_blosc2_(60, 45000000)

zarr_blosc1_vs_blosc2_(60, 10000000)_100_000_chunks

@joshmoore
Copy link
Member

joshmoore commented Sep 20, 2023

Wow, thanks for the info, @fschwar4. (And of course @FrancescAlted for the PR!:wink:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
New codec Suggestion for a new codec
Projects
None yet
Development

No branches or pull requests

9 participants