Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v3] default compressor / codec pipeline #2267

Open
Tracked by #2412
jhamman opened this issue Sep 28, 2024 · 19 comments · May be fixed by #2470
Open
Tracked by #2412

[v3] default compressor / codec pipeline #2267

jhamman opened this issue Sep 28, 2024 · 19 comments · May be fixed by #2470
Labels
bug Potential issues with the zarr-python library V3 Affects the v3 branch
Milestone

Comments

@jhamman
Copy link
Member

jhamman commented Sep 28, 2024

Zarr version

3.0.0.alpha6

Numcodecs version

N/A

Python Version

N/A

Operating System

N/A

Installation

N/A

Description

In Zarr-Python 2.x, Zarr provided default compressors for most (all?) datatypes. As of now, in 3.0, we don't provide any defaults.

Steps to reproduce

In 2.18:

In [1]: import zarr

In [2]: arr = zarr.create(shape=(10, ), path='foo', store={})

In [3]: arr.compressor
Out[3]: Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)

In 3.0.0.alpha6

In [4]: arr = zarr.create(shape=(10, ), path='foo', store={}, zarr_format=2)

In [5]: arr.metadata
Out[5]: ArrayV2Metadata(shape=(10,), fill_value=0.0, chunk_grid=RegularChunkGrid(chunk_shape=(10,)), attributes={}, zarr_format=2, data_type=dtype('float64'), order='C', filters=None, dimension_separator='.', compressor=None)

Additional output

No response

@jhamman jhamman added bug Potential issues with the zarr-python library V3 Affects the v3 branch labels Sep 28, 2024
@jhamman jhamman added this to the 3.0.0 milestone Sep 28, 2024
@d-v-b
Copy link
Contributor

d-v-b commented Sep 28, 2024

thoughts on defaulting to Zstandard?

@zoj613
Copy link

zoj613 commented Sep 29, 2024

Does compressor mean "codec" in the v3 example? If so, then I think the "default" codec is implicitly the bytes codec since the codec pipeline of v3 requires that exactly one array->bytes codec must be present in the pipeline.

@d-v-b
Copy link
Contributor

d-v-b commented Sep 29, 2024

the v3 example is creating a zarr v2 array, so the "compressor" concept still applies there.

but it's also worth considering what the default codec pipeline should be when creating zarr v3 arrays. If it was easy to specify, then I think defaulting to a sharding codec with an inner chunk size equal to the outer chunk size (i.e., a minimal shard) would actually be a good default, but would have to see how this looks.

@rabernat
Copy link
Contributor

I think we should make the defaults exactly the same as they are in v2.

I do not think we should make sharding the default until we have spent more time optimizing and debugging it.

@d-v-b
Copy link
Contributor

d-v-b commented Sep 29, 2024

I think we should make the defaults exactly the same as they are in v2.

The default compressor in v2 was Blosc() with no arguments, if Blosc is available, falling back to Zlib; Blosc itself takes parameters to determine which compression scheme to use, so I think leaving everything default in v2 is a mistake that we don't want to repeat.

I'd be fine with keeping Blosc as the default in zarr-python v3 but we should explicitly configure it, and if we are doing that it might be worth considering the pros / cons of using something other than the numcodecs default blosc configuration. In numcodecs, the default configuration for Blosc uses the lz4 compressor, which iirc is rather fast but doesn't compress well. I understand the urge to preserve v2 behavior, but on the other hand if lz4 has significantly worse performance than another blosc compressor (e.g., zstd), then we should consider the value of giving zarr users a good default.

cc @mkitti

@mkitti
Copy link

mkitti commented Oct 1, 2024

Before defaulting to Blosc in Zarr v3, we should really fix the issue in #2171 . That is there should probably be a ArrayBytesBloscCodec that can actually transmit dtype / typesize information correctly to Blosc. Perhaps one could follow the zarr-java implementation and default the typesize to the dtype.

Also, continuing to encode new data with Blosc v1 is unwise given the current stage-of-life of that package. Blosc v1 is in what I will term "community maintenance mode". Unless you are actively thinking about it, I would not assume that anyone is actively maintaining the package.

My recommendations from most favored to least are:

  1. No default compression. Picking a particular compressor now is probably not going to age well. Users should make explicit choices about compression given what they know about their data.
  2. Zstandard. Zstandard is designed to be a flexible compressor that can handle a wide variety of data. The compressor is widely available.
  3. Blosc v2. Blosc v2 has a superior msgpack based contiguous frame format vs Blosc v1. It is also seeing active development including the delta and bytedelta filters as well as improved SIMD support.

@dstansby
Copy link
Contributor

dstansby commented Oct 1, 2024

I would be reluctant to pick Blosc, given it isn't very actively maintained and the blosc maintainers would rather folks tranistioned to Blosc2. In this context I think we have a responsibility to move away from providing Blosc (version 1) as the default compressor, and zarr-python v3 seems like a good point to do that.

I'd be 👍 to defaulting to no compression. This would then force users to learn about compression and choose a compressor that works well for them.

@dstansby
Copy link
Contributor

dstansby commented Oct 1, 2024

I think we should make the defaults exactly the same as they are in v2.

@rabernat can you expand a bit on why you think we should keep the same defaults?

@d-v-b
Copy link
Contributor

d-v-b commented Oct 1, 2024

This would then force users to learn about compression and choose a compressor that works well for them.

Instead of learning how zarr works, people might just get frustrated that their data isn't compressed and conclude that the format isn't worth the trouble. I like the idea but I don't think a "pedagogical default" beats a "good default" here. My preference would be that we pick a solid compressor that offers good all-around performance, and I think Zstd (sans blosc) clears that bar.

@rabernat
Copy link
Contributor

rabernat commented Oct 1, 2024

@rabernat can you expand a bit on why you think we should keep the same defaults?

Just for the sake of maintaining consistency and not having to make a decision. However, if there are serious drawbacks (as it appears there are), I'm fine with zstd.

Do we have a sense of the performance implications of this choice?

@mkitti
Copy link

mkitti commented Oct 1, 2024

Here are blosc's internal benchmarks:
https://www.blosc.org/posts/zstd-has-just-landed-in-blosc/

"Not the fastest, but a nicely balanced one" is a good summary. Basically, the default settings balance compression ratio and speed. Lower or negative levels provide more speed. Higher levels provide more compression ratio.

@TomAugspurger
Copy link
Contributor

One note here that's relevant for #2036 and pydata/xarray#9515, the default codec can depend on the dtype of the array:

# zarr-python 2.18.3
>>> g = zarr.group(store={})
>>> g.create(name="b", shape=(3,), dtype=str).filters
[VLenUTF8()]

@d-v-b
Copy link
Contributor

d-v-b commented Oct 3, 2024

good point @TomAugspurger. In that case we probably should default to a string literal like "auto" for compressor / filters, and then use functions like auto_compressor(dtype) -> Codec / auto_filters(dtype) -> list[Codec] to transform "auto" into a concrete configuration, given the dtype (and whatever else we deem relevant)

@rabernat
Copy link
Contributor

rabernat commented Oct 3, 2024

Big 👍 to that idea.

@rabernat
Copy link
Contributor

We now have automatic detection of the ArrayBytes codec based on dtype:

def _get_default_array_bytes_codec(
np_dtype: np.dtype[Any],
) -> BytesCodec | VLenUTF8Codec | VLenBytesCodec:
dtype = DataType.from_numpy(np_dtype)
if dtype == DataType.string:
return VLenUTF8Codec()
elif dtype == DataType.bytes:
return VLenBytesCodec()
else:
return BytesCodec()

Next step for this issue is to just add a default BytesBytes compressor.

@jhamman
Copy link
Member Author

jhamman commented Oct 18, 2024

Based on the discussion today, we landed on:

  • for most dtypes, use Bytes + Zstd
  • for string dtypes, VLenBytesCodec

Related to this is how to set the default. In 2.x, we told folks to set zarr.storage.default_compressor. Going forward, I'd like us to use zarr.config for this.

@akshaysubr
Copy link
Contributor

+1 for using Zstd or Zlib potentially as default compressors. It would be ideal to use a format that is standardized, has a strongly specified stream definition and is widely used and supported.

@brokkoli71 brokkoli71 linked a pull request Nov 6, 2024 that will close this issue
6 tasks
@brokkoli71
Copy link
Member

brokkoli71 commented Nov 6, 2024

Based on the discussion today, we landed on:

* for most dtypes, use `Bytes` + `Zstd`

* for string dtypes, `VLenBytesCodec`

Related to this is how to set the default. In 2.x, we told folks to set zarr.storage.default_compressor. Going forward, I'd like us to use zarr.config for this.

Based on this, I started working on #2470. Feel free to leave a comment.

@brokkoli71
Copy link
Member

brokkoli71 commented Nov 13, 2024

@jhamman
I am a little confused maybe you can help me: in which cases should we use VLenBytes and in which cases VLenUTF8? Or do we need both at the same time?
Also, do we need a default compressor or are default filters sufficient?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library V3 Affects the v3 branch
Projects
Status: Todo
Development

Successfully merging a pull request may close this issue.

9 participants