Blosc2 Codec? #413

rabernat · 2022-12-21T17:05:33Z

I recently noticed that Blosc2 2.0 has been released: https://twitter.com/Blosc2/status/1605529031780311041. This made me wonder whether we should revisit the idea of adding blosc2 support to Numcodecs and Zarr.

Obviously blosc2 has gone in the direction of adding more features--it's not much more than just a compression codec, and includes I/O, metadata, plugins, etc.--such that there is no clear boundary between Zarr's features and blosc2's features. So we would first want to decide which parts of blosc2 would be advantageous to expose as a codec in numcodecs. The main question would be whether there is a benefit to using the blosc2 superchunk feature to store multiple chunks in a single shard. If so, we will quickly resurrect the discussion about Caterva in Zarr (zarr-developers/zarr-python#713).

JackKelly · 2023-01-18T11:10:28Z

Ooh, yes, I'd be really interested to see blosc2 integrated with numcodecs / Zarr.

For now, I believe that imagecodecs supports blosc2. But deep integration with numcodecs & Zarr would be very interesting.

jakirkham · 2023-01-21T05:34:20Z

@FrancescAlted and I discussed adding Blosc2 support to Zarr during the 2022 NumFOCUS Summit. Think what we concluded it should be possible. Note this is a different approach than what is outlined in the blogpost above.

The relevant thing for this discussion is the Blosc2 chunk format. This is the thing (I think) we would want to interact with.

The Blosc2 chunk format has blocks, which in Zarr we would call shards. AFAICT these are the same, but the terminology is different. Will use the term shards (as that is what we are familiar with), but keep this in mind when reading the spec.

In Blosc2 it tracks starts/offsets into shards, which we would likely want to extract and add to the metadata. This overlaps a bit with the sharding work @jstriebel has been doing ( zarr-developers/zarr-python#1111 ) so potentially could work with that approach. Think we would want to remove the header as this would already be stored in Zarr metadata. As a first pass this would likely entail some copying to add/remove the header. Longer term we might want the option to pass in the header separately (or something like this).

Anyways this is my recollection of that conversation, which is not as fresh as it was. I may very well have forgotten/misunderstood things.

jakirkham · 2023-01-21T05:34:38Z

Perhaps another way to go about this would be look at using Kerchunk with Blosc2

martindurant · 2023-01-23T16:03:26Z

Yes, kerchunk is interested in accessing the chunks within a compressed stream. You could regard the compressed blocks as chunks, but they would presumably not be equal length, so additional logic would be needed.

With the release of indexed_gzip, we may be able to something similar across all implementations. There is some tradeoff here between writing lots of references for individual chunks versus storing "shard" information elsewhere versus just requesting the exact matrix offsets from the storage and having the compression layer figure out what to actually read (I don't know if the third is actually possible)

FrancescAlted · 2023-01-25T13:31:12Z

Numcodecs adopting Blosc2 would be great. BTW, what we recently released as 2.0 is Python-Blosc2, not C-Blosc2 (whose 2.0 release happened 1,5 years ago).

For what is worth, we have just merged Caterva into the main branch of C-Blosc2, so the later has gained multidimensional capabilities. During the merge, the API has changed a bit (mainly to adapt to the Blosc way of doing things), but the functionality in the new C-Blosc2 is the same. We will let the new API to rest a bit, and when the dust would be settled, we will proceed with releasing C-Blosc2 (probably 2.7.0) pretty soon.

mkitti · 2023-02-21T03:32:43Z

Just so the situation is clear, Blosc2 compressed data is not decompressable by Blosc1. On the other hand, Blosc1 compressed data can be decompressed by Blosc2.

Blosc/hdf5-blosc#29 (comment)

For this reason Blosc1 and Blosc2 are registered as separate HDF5 filter plugins:
https://portal.hdfgroup.org/display/support/Filters#Filters-32026

I suspect numcodecs will need to support both Blosc1 and Blosc2 compression, simultaneously, for the sake of backwards compatibility.

You may also want to consider deprecating Blosc1 compression in favor of Blosc2 compression.

fschwar4 · 2023-09-20T08:01:36Z

Hi all,

If anyone really wants the Blosc2 compressors, they could check out the Python implementation of Blosc2. You can easily register this as a new Numcodec. A first test showed improved behaviour over Blosc1 in most cases. I will do some more rigorous testing next week.

import blosc2
import numcodecs

enum_dict = {
    'blosclz': blosc2.Codec.BLOSCLZ,
    'lz4': blosc2.Codec.LZ4,
    'lz4hc': blosc2.Codec.LZ4HC,
    'zlib': blosc2.Codec.ZLIB,
    'zstd': blosc2.Codec.ZSTD,
    'NDLZ': blosc2.Codec.NDLZ,
    'ZFP_ACC': blosc2.Codec.ZFP_ACC,
    'ZFP_PREC': blosc2.Codec.ZFP_PREC,
    'ZFP_RATE': blosc2.Codec.ZFP_RATE,
}


class Blosc2(numcodecs.abc.Codec):

    codec_id = 'blosc2'

    def __init__(self, cname='BLOSCLZ', clevel=5, shuffle=1, blocksize=0):
        self.cname = cname
        self.clevel = clevel
        self.shuffle = shuffle
        self.blocksize = blocksize
    
    def encode(self, data):
        return blosc2.compress2(data, codec=enum_dict[self.cname], clevel=self.clevel, filter=blosc2.Filter(self.shuffle), blocksize=self.blocksize)
    
    def decode(self, data):
        return blosc2.decompress(data)
    
numcodecs.register_codec(Blosc2, 'blosc2')

joshmoore · 2023-09-20T09:46:08Z

Wow, thanks for the info, @fschwar4. (And of course @FrancescAlted for the PR!:wink:)

jstriebel mentioned this issue Feb 3, 2023

Review of the ZEP2 spec - Sharding storage transformer zarr-developers/zarr-specs#152

Closed

mkitti mentioned this issue May 18, 2023

Blosc2 acquire-project/acquire-driver-zarr#21

Open

martindurant mentioned this issue Sep 27, 2023

blosc? milesgranger/cramjam#110

Closed

jakirkham mentioned this issue Dec 11, 2023

Partial read zarr-developers/zarr-python#667

Merged

6 tasks

jhamman mentioned this issue May 20, 2024

Support Blosc2 codec zarr-developers/zarr-python#1896

Closed

dstansby added the New codec Suggestion for a new codec label Aug 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blosc2 Codec? #413

Blosc2 Codec? #413

rabernat commented Dec 21, 2022

JackKelly commented Jan 18, 2023

jakirkham commented Jan 21, 2023

jakirkham commented Jan 21, 2023

martindurant commented Jan 23, 2023

FrancescAlted commented Jan 25, 2023

mkitti commented Feb 21, 2023

fschwar4 commented Sep 20, 2023

joshmoore commented Sep 20, 2023 •

edited

Loading

Blosc2 Codec? #413

Blosc2 Codec? #413

Comments

rabernat commented Dec 21, 2022

JackKelly commented Jan 18, 2023

jakirkham commented Jan 21, 2023

jakirkham commented Jan 21, 2023

martindurant commented Jan 23, 2023

FrancescAlted commented Jan 25, 2023

mkitti commented Feb 21, 2023

fschwar4 commented Sep 20, 2023

joshmoore commented Sep 20, 2023 • edited Loading

joshmoore commented Sep 20, 2023 •

edited

Loading