-
-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor blosc compression ratios compared to v2 #2171
Comments
Nice find.
I'm also spinning up on codecs, but I wonder if it's related to a pipeline only being able to have one ArrayBytes codec, but many BytesBytes codecs? By writing it as a That said... trying to view the bytes as |
If blosc is aware of the dtype, I think it makes sense to redefine it as an ArrayBytesCodec. |
Is it a solution if we define |
the basic problem is that arrays are a particular arrangement of bytes. so drawing a hard line between arrays and bytes will inevitably lead to situations like this, because they are not two separate categories. |
I mean, if a consequence of strictly following the V3 spec is that users experience a 20x degradation in compression rates, then it seems like we should change the spec, no?
All array buffers can be interpreted as bytes. The reverse is not true. To me, an ArrayBytesCodec gets additional metadata about how to interpret the raw bytes as an array. That sounds like exactly what we want here. |
Believe me, I'm not opposed to changing the v3 spec :) But if we reclassify blosc as an Maybe this is OK if the performance is always terrible in this situation, but I thought blosc was designed to work on arbitrary binary data, including arrays. As I understand it, |
I think having both an The usability issues can be mitigated a bit by having good defaults. In Zarr v2 I typically didn't dig into the encoding / compressor stuff. If we can match that behavior by setting the default codec pipeline to include an |
If we make a version of blosc that fits in the
|
What about just automatically setting the
I don't think an
Potential |
Happy to see this garnered a lot of interesting discussion! Here are my own thoughts on some points raised here:
Even if we switch to making Blosc the default codec in v3 (as it is in v2), there are still many times I find myself needing to specify codecs manually when using Blosc since it's a meta-compressor with many different options. So in the situation where I do need to use a configuration of blosc that's different from the default, having two flavors will add some confusion from the end-user perspective. And for the maintainers of this library, that means you now have the weight of two separate implementations which fundamentally do the same thing. As far as the actual blosc compressor is concerned, the only real difference between the two is that one infers the typesize and the other doesn't, so it seems kind of overkill.
I don't see why not. Array inputs are ideal for shuffling since the typesize gets inferred, but with the bit shuffle filter you can still see marginal improvements in the compression ratio even for bytes input.
Right now from what I can see setting the typesize in the
My personal feeling is that keeping blosc as a |
The main reason Blosc needs to know the dtype is for byte or bitshuffling. Under Zarr-python as used for Zarr v3 as above, if shuffle is not set and the dtype is a byte, then BloscCodec will default to bitshuffling 8-bits. For large dtypes, byte shuffling is used. zarr-python/src/zarr/codecs/blosc.py Lines 142 to 146 in 726fdfb
Meanwhile, numcodecs will default to byte shuffle (just called shuffle) for Zarr v2: |
The 11348 compressed length correspond to compressing with either
A compressed length of 1383 corresponds to compressing with
In [1]: import blosc, numpy as np
In [2]: c = np.arange(10000, dtype="float64")
In [3]: len(blosc.compress(c, cname="zstd", typesize=1, clevel=5, shuffle=blosc.NOSHUFFLE))
Out[3]: 11348
In [4]: len(blosc.compress(c, cname="zstd", typesize=1, clevel=5, shuffle=blosc.SHUFFLE))
Out[4]: 11348
In [5]: len(blosc.compress(c, cname="zstd", typesize=1, clevel=5, shuffle=blosc.BITSHUFFLE))
Out[5]: 979
In [6]: len(blosc.compress(c, cname="zstd", typesize=8, clevel=5, shuffle=blosc.NOSHUFFLE))
Out[6]: 11348
In [7]: len(blosc.compress(c, cname="zstd", typesize=8, clevel=5, shuffle=blosc.SHUFFLE))
Out[7]: 1383
In [8]: len(blosc.compress(c, cname="zstd", typesize=8, clevel=5, shuffle=blosc.BITSHUFFLE))
Out[8]: 391 |
Does the |
Technically, Using a pre-filter such as shuffle only makes sense if you are providing information that the compessor does not already have.
I have no idea why you would try to shuffle variable length types. |
What if the variable length type was something like |
Are you expecting some correlation to still occur every four bytes in that case? Are the values in the list related somehow? |
Yes that's what I'm thinking. Suppose you have a data type that's a variable-length collection of points in space, and the points tend to have similar values. Would shuffling be useful here? Honest question, since I don't know much about it. |
Consider a simple run length encoding (RLE) compression scheme where I first provide the length of a run and then the content of the run. For example, Now imagine if had the byte sequence Now let's say I shuffle the bytes as follows. If the bytes were All compressors use RLE at some stage, so shuffling can be really useful if you expect runs of multibyte values that are related. By putting the rarely changing higher order bytes together, you can increase runs for encoding. Lz4 is just a fancy RLE encoder so shuffling really helps if the numbers are of similar magnitude. Zstd also has an entropy coding stage, so shuffling does not have as much. Thus you probably want to shuffle if you know your number are correlated somehow, as in an image. If your numbers are random, shuffling might not help. If your numbers are fluctuating across magnitudes then, it is quite possible that shuffling could hurt. |
thanks for that explanation. my takeaway then remains the same: the |
There is utility in array to bytes codec in that
There is also utility in a bytes to bytes codec where typesize is a free parameter. All of this is content dependent. To set defaults here assumes the content follows certain patterns, which it may not. I would discourage making assumptions about other people's content. |
This is actually implemented already: zarr-python/src/zarr/codecs/blosc.py Lines 141 to 142 in 680142f
Should we just default to |
Zarr version
v3
Numcodecs version
n/a
Python Version
n/a
Operating System
n/a
Installation
n/a
Description
While playing around with v3 a bit, I noticed that the blosc codec wasn't compressing my sample data as well as I expected to, and I have confirmed after multiple comparisons that I am getting different results between v3 and v2, in some cases v3 blosc compressed chunks end up being 10-20x larger than in v2 (see below).
Steps to reproduce
The difference isn't huge in this case, but it can be much more noticeable in others. For example:
Cause and Possible Solution
In numcodecs, the blosc compressor is able to improve compression ratios by inferring the item size through the input buffer's numpy array dtype. But in v3, the blosc codec is implemented as a
BytesBytesCodec
and requires each chunk to be fed as bytes on encode (hence,BytesCodec()
is required in the list of codecs in my example) and thus numcodecs infers an item size of 1.A simple fix for this is to make the following change in
blosc.py
:Thoughts
I am still just getting started with v3, but this has made me curious about one thing. Why is the blosc codec implemented as a
BytesBytesCodec
rather than as anArrayBytesCodec
, considering that it can accept (and is optimized for) numpy array input? Although the above solution does work, because I need to include theBytesCodec
first when specifying my codecs in v3, it essentially first encodes each chunk into bytes, then decodes it back into an array in its original dtype, making the bytes codec effectively a pointless noop in this case.The text was updated successfully, but these errors were encountered: