should we abstract over v2 and v3 codecs #2654

d-v-b · 2025-01-06T10:56:36Z

Over in #2647 @jni asked a the following question:

To help us fix the napari side, can someone point to a compressor= dict that we can pass to zarr.open that will work on zarr-python 2.x and 3.x with zarr v2 and v3 arrays? 🙏

Unfortunately, at this time there is no dict or codec class instance that can satisfy this question. By design, v2 and v3 chunk encoding are completely distinct entities. I wonder if this is wise.

For example, can someone explain why we really need two versions of a gzip codec (one in numcodecs, and one defined here)? From what I can tell, the only differences between these two gzip codecs are the JSON serialization: the numcodecs version serializes to {"id": "gzip", "level": <int>}, while the zarr v3 version serializes to {"name": "gzip", "configuration": {"level": <int>}}. Should users creating arrays in zarr-python have to care about this minor difference?

I don't think users need or want to care about the differences between zarr v2 and zarr v3 codec serialization. So I propose that we should allow code like create_array(...compressor=foo, zarr_format=2) and create_array(...compressor=foo, zarr_format=3) for the same value of foo.

Here's a simple short-term solution: For codecs like blosc and gzip that can be found in zarr v2 and v3, how about we allow functions like create_array accept either the zarr v2 or zarr v3 codec (or its dict form)?

Here's a more complex, longer term solution: all the codecs in numcodecs should be altered to produce either zarr v2 or zarr v3 JSON serializations. That is, the numcodecs Gzip should have a serialization method zarr v2 clients can use, and a separate serialization method that zarr v3 clients can use.

The text was updated successfully, but these errors were encountered:

jni · 2025-01-06T12:13:36Z

I don't think users need or want to care about the differences between zarr v2 and zarr v3 codec serialization. So I propose that we should allow code like create_array(...compressor=foo, zarr_format=2) and create_array(...compressor=foo, zarr_format=3) for the same value of foo.

I'm very +1 to this as it feels like it would be relatively small effort (maybe even fitting in before 3.0) (and acknowledging I am saying this from a "haven't really looked at the code" perspective) and would be very useful to help projects transition.

normanrz · 2025-01-06T15:18:58Z

I definitely wouldn't want to rush this.

More generally, I see the 3.0 release as a v3-first library, with v2 in support mode. The library should support reading all v2 data, but the incentive for new data should be on v3. That is why we switched the default zarr_format. Therefore, I am more interested in designing good APIs that work for v3 instead of trying to paper over the differences of the 2 format versions.

Anyways, adding an implicit conversion for a transitional period would be fine with me. But more like a hotfix than a real solution.

TomNicholas · 2025-01-06T15:59:34Z

I think this would also be useful for VirtualiZarr.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

should we abstract over v2 and v3 codecs #2654

should we abstract over v2 and v3 codecs #2654

d-v-b commented Jan 6, 2025 •

edited

Loading

jni commented Jan 6, 2025

normanrz commented Jan 6, 2025

TomNicholas commented Jan 6, 2025

should we abstract over v2 and v3 codecs #2654

should we abstract over v2 and v3 codecs #2654

Comments

d-v-b commented Jan 6, 2025 • edited Loading

jni commented Jan 6, 2025

normanrz commented Jan 6, 2025

TomNicholas commented Jan 6, 2025

d-v-b commented Jan 6, 2025 •

edited

Loading