[v3] default compressor / codec pipeline #2267

jhamman · 2024-09-28T22:27:00Z

Zarr version

3.0.0.alpha6

Numcodecs version

N/A

Python Version

N/A

Operating System

N/A

Installation

N/A

Description

In Zarr-Python 2.x, Zarr provided default compressors for most (all?) datatypes. As of now, in 3.0, we don't provide any defaults.

Steps to reproduce

In 2.18:

In [1]: import zarr

In [2]: arr = zarr.create(shape=(10, ), path='foo', store={})

In [3]: arr.compressor
Out[3]: Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)

In 3.0.0.alpha6

In [4]: arr = zarr.create(shape=(10, ), path='foo', store={}, zarr_format=2)

In [5]: arr.metadata
Out[5]: ArrayV2Metadata(shape=(10,), fill_value=0.0, chunk_grid=RegularChunkGrid(chunk_shape=(10,)), attributes={}, zarr_format=2, data_type=dtype('float64'), order='C', filters=None, dimension_separator='.', compressor=None)

Additional output

No response

d-v-b · 2024-09-28T22:35:22Z

thoughts on defaulting to Zstandard?

zoj613 · 2024-09-29T04:22:42Z

Does compressor mean "codec" in the v3 example? If so, then I think the "default" codec is implicitly the bytes codec since the codec pipeline of v3 requires that exactly one array->bytes codec must be present in the pipeline.

d-v-b · 2024-09-29T11:05:32Z

the v3 example is creating a zarr v2 array, so the "compressor" concept still applies there.

but it's also worth considering what the default codec pipeline should be when creating zarr v3 arrays. If it was easy to specify, then I think defaulting to a sharding codec with an inner chunk size equal to the outer chunk size (i.e., a minimal shard) would actually be a good default, but would have to see how this looks.

rabernat · 2024-09-29T11:34:15Z

I think we should make the defaults exactly the same as they are in v2.

I do not think we should make sharding the default until we have spent more time optimizing and debugging it.

d-v-b · 2024-09-29T11:50:52Z

I think we should make the defaults exactly the same as they are in v2.

The default compressor in v2 was Blosc() with no arguments, if Blosc is available, falling back to Zlib; Blosc itself takes parameters to determine which compression scheme to use, so I think leaving everything default in v2 is a mistake that we don't want to repeat.

I'd be fine with keeping Blosc as the default in zarr-python v3 but we should explicitly configure it, and if we are doing that it might be worth considering the pros / cons of using something other than the numcodecs default blosc configuration. In numcodecs, the default configuration for Blosc uses the lz4 compressor, which iirc is rather fast but doesn't compress well. I understand the urge to preserve v2 behavior, but on the other hand if lz4 has significantly worse performance than another blosc compressor (e.g., zstd), then we should consider the value of giving zarr users a good default.

cc @mkitti

mkitti · 2024-10-01T01:46:36Z

Before defaulting to Blosc in Zarr v3, we should really fix the issue in #2171 . That is there should probably be a ArrayBytesBloscCodec that can actually transmit dtype / typesize information correctly to Blosc. Perhaps one could follow the zarr-java implementation and default the typesize to the dtype.

Also, continuing to encode new data with Blosc v1 is unwise given the current stage-of-life of that package. Blosc v1 is in what I will term "community maintenance mode". Unless you are actively thinking about it, I would not assume that anyone is actively maintaining the package.

My recommendations from most favored to least are:

No default compression. Picking a particular compressor now is probably not going to age well. Users should make explicit choices about compression given what they know about their data.
Zstandard. Zstandard is designed to be a flexible compressor that can handle a wide variety of data. The compressor is widely available.
Blosc v2. Blosc v2 has a superior msgpack based contiguous frame format vs Blosc v1. It is also seeing active development including the delta and bytedelta filters as well as improved SIMD support.

dstansby · 2024-10-01T16:59:14Z

I would be reluctant to pick Blosc, given it isn't very actively maintained and the blosc maintainers would rather folks tranistioned to Blosc2. In this context I think we have a responsibility to move away from providing Blosc (version 1) as the default compressor, and zarr-python v3 seems like a good point to do that.

I'd be 👍 to defaulting to no compression. This would then force users to learn about compression and choose a compressor that works well for them.

dstansby · 2024-10-01T16:59:50Z

I think we should make the defaults exactly the same as they are in v2.

@rabernat can you expand a bit on why you think we should keep the same defaults?

d-v-b · 2024-10-01T17:55:16Z

This would then force users to learn about compression and choose a compressor that works well for them.

Instead of learning how zarr works, people might just get frustrated that their data isn't compressed and conclude that the format isn't worth the trouble. I like the idea but I don't think a "pedagogical default" beats a "good default" here. My preference would be that we pick a solid compressor that offers good all-around performance, and I think Zstd (sans blosc) clears that bar.

rabernat · 2024-10-01T18:49:24Z

@rabernat can you expand a bit on why you think we should keep the same defaults?

Just for the sake of maintaining consistency and not having to make a decision. However, if there are serious drawbacks (as it appears there are), I'm fine with zstd.

Do we have a sense of the performance implications of this choice?

mkitti · 2024-10-01T19:03:13Z

Here are blosc's internal benchmarks:
https://www.blosc.org/posts/zstd-has-just-landed-in-blosc/

"Not the fastest, but a nicely balanced one" is a good summary. Basically, the default settings balance compression ratio and speed. Lower or negative levels provide more speed. Higher levels provide more compression ratio.

TomAugspurger · 2024-10-03T17:48:59Z

One note here that's relevant for #2036 and pydata/xarray#9515, the default codec can depend on the dtype of the array:

# zarr-python 2.18.3
>>> g = zarr.group(store={})
>>> g.create(name="b", shape=(3,), dtype=str).filters
[VLenUTF8()]

d-v-b · 2024-10-03T17:57:12Z

good point @TomAugspurger. In that case we probably should default to a string literal like "auto" for compressor / filters, and then use functions like auto_compressor(dtype) -> Codec / auto_filters(dtype) -> list[Codec] to transform "auto" into a concrete configuration, given the dtype (and whatever else we deem relevant)

rabernat · 2024-10-03T17:58:58Z

Big 👍 to that idea.

rabernat · 2024-10-10T13:25:44Z

We now have automatic detection of the ArrayBytes codec based on dtype:

zarr-python/src/zarr/codecs/__init__.py

Lines 37 to 46 in 395604d

    
           def _get_default_array_bytes_codec( 
        
               np_dtype: np.dtype[Any], 
        
           ) -> BytesCodec | VLenUTF8Codec | VLenBytesCodec: 
        
               dtype = DataType.from_numpy(np_dtype) 
        
               if dtype == DataType.string: 
        
                   return VLenUTF8Codec() 
        
               elif dtype == DataType.bytes: 
        
                   return VLenBytesCodec() 
        
               else: 
        
                   return BytesCodec()

Next step for this issue is to just add a default BytesBytes compressor.

jhamman · 2024-10-18T19:51:06Z

Based on the discussion today, we landed on:

for most dtypes, use Bytes + Zstd
for string dtypes, VLenBytesCodec

Related to this is how to set the default. In 2.x, we told folks to set zarr.storage.default_compressor. Going forward, I'd like us to use zarr.config for this.

akshaysubr · 2024-10-18T22:40:41Z

+1 for using Zstd or Zlib potentially as default compressors. It would be ideal to use a format that is standardized, has a strongly specified stream definition and is widely used and supported.

brokkoli71 · 2024-11-06T15:35:36Z

Based on the discussion today, we landed on:
* for most dtypes, use `Bytes` + `Zstd`

* for string dtypes, `VLenBytesCodec`
Related to this is how to set the default. In 2.x, we told folks to set zarr.storage.default_compressor. Going forward, I'd like us to use zarr.config for this.

Based on this, I started working on #2470. Feel free to leave a comment.

brokkoli71 · 2024-11-13T11:23:07Z

@jhamman
I am a little confused maybe you can help me: in which cases should we use VLenBytes and in which cases VLenUTF8? Or do we need both at the same time?
Also, do we need a default compressor or are default filters sufficient?

jhamman added bug Potential issues with the zarr-python library V3 Affects the v3 branch labels Sep 28, 2024

jhamman added this to the 3.0.0 milestone Sep 28, 2024

jhamman added this to Zarr-Python - 3.0 Sep 28, 2024

jhamman mentioned this issue Oct 18, 2024

Tracking 3.0 Release Blockers #2412

Open

39 tasks

brokkoli71 linked a pull request Nov 6, 2024 that will close this issue

Add default compressors to config #2470

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v3] default compressor / codec pipeline #2267

[v3] default compressor / codec pipeline #2267

jhamman commented Sep 28, 2024

d-v-b commented Sep 28, 2024

zoj613 commented Sep 29, 2024

d-v-b commented Sep 29, 2024

rabernat commented Sep 29, 2024

d-v-b commented Sep 29, 2024

mkitti commented Oct 1, 2024

dstansby commented Oct 1, 2024

dstansby commented Oct 1, 2024

d-v-b commented Oct 1, 2024

rabernat commented Oct 1, 2024

mkitti commented Oct 1, 2024

TomAugspurger commented Oct 3, 2024

d-v-b commented Oct 3, 2024

rabernat commented Oct 3, 2024

rabernat commented Oct 10, 2024

jhamman commented Oct 18, 2024

akshaysubr commented Oct 18, 2024

brokkoli71 commented Nov 6, 2024 •

edited

Loading

brokkoli71 commented Nov 13, 2024 •

edited

Loading

[v3] default compressor / codec pipeline #2267

[v3] default compressor / codec pipeline #2267

Comments

jhamman commented Sep 28, 2024

Zarr version

Numcodecs version

Python Version

Operating System

Installation

Description

Steps to reproduce

Additional output

d-v-b commented Sep 28, 2024

zoj613 commented Sep 29, 2024

d-v-b commented Sep 29, 2024

rabernat commented Sep 29, 2024

d-v-b commented Sep 29, 2024

mkitti commented Oct 1, 2024

dstansby commented Oct 1, 2024

dstansby commented Oct 1, 2024

d-v-b commented Oct 1, 2024

rabernat commented Oct 1, 2024

mkitti commented Oct 1, 2024

TomAugspurger commented Oct 3, 2024

d-v-b commented Oct 3, 2024

rabernat commented Oct 3, 2024

rabernat commented Oct 10, 2024

jhamman commented Oct 18, 2024

akshaysubr commented Oct 18, 2024

brokkoli71 commented Nov 6, 2024 • edited Loading

brokkoli71 commented Nov 13, 2024 • edited Loading

brokkoli71 commented Nov 6, 2024 •

edited

Loading

brokkoli71 commented Nov 13, 2024 •

edited

Loading