Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor blosc compression ratios compared to v2 #2171

Open
agoodm opened this issue Sep 11, 2024 · 21 comments
Open

Poor blosc compression ratios compared to v2 #2171

agoodm opened this issue Sep 11, 2024 · 21 comments
Labels
bug Potential issues with the zarr-python library V3 Affects the v3 branch
Milestone

Comments

@agoodm
Copy link
Contributor

agoodm commented Sep 11, 2024

Zarr version

v3

Numcodecs version

n/a

Python Version

n/a

Operating System

n/a

Installation

n/a

Description

While playing around with v3 a bit, I noticed that the blosc codec wasn't compressing my sample data as well as I expected to, and I have confirmed after multiple comparisons that I am getting different results between v3 and v2, in some cases v3 blosc compressed chunks end up being 10-20x larger than in v2 (see below).

Steps to reproduce

import numpy as np
import zarr
import zarr.v2 as zarr_v2
import numpy as np
from numcodecs import Blosc

a = np.random.rand(1000000)

# v3
codecs = [zarr.codecs.BytesCodec(), zarr.codecs.BloscCodec()]
z3 = zarr.array(a, chunks=(10000), codecs=codecs)
v3_size = len(z3.store_path.store._store_dict["c/0"]) # 75136

# v2
compressor = Blosc("zstd") # Should be equivalent to default settings for v3 Blosc codec
z2 = zarr_v2.array(a, chunks=z3.chunks, compressor=compressor)
v2_size = len(z2.store["0"]) 

print(f"v2 compressed chunk size: {v2_size}") 
print(f"v3 compressed chunk size: {v3_size}") 
# v2 compressed chunk size: 70113
# v3 compressed chunk size: 75136

The difference isn't huge in this case, but it can be much more noticeable in others. For example:

b = np.arange(1000000)
z3[:] = b
v3_size = len(z3.store_path.store._store_dict["c/0"]) 

z2[:] = b
v2_size = len(z2.store["0"]) 

print(f"v2 compressed chunk size: {v2_size}") 
print(f"v3 compressed chunk size: {v3_size}") 

# v2 compressed chunk size: 1383
# v3 compressed chunk size: 11348

Cause and Possible Solution

In numcodecs, the blosc compressor is able to improve compression ratios by inferring the item size through the input buffer's numpy array dtype. But in v3, the blosc codec is implemented as a BytesBytesCodec and requires each chunk to be fed as bytes on encode (hence, BytesCodec() is required in the list of codecs in my example) and thus numcodecs infers an item size of 1.

A simple fix for this is to make the following change in blosc.py:

    async def _encode_single(
        self,
        chunk_bytes: Buffer,
        chunk_spec: ArraySpec,
    ) -> Buffer | None:
        # Since blosc only support host memory, we convert the input and output of the encoding
        # between numpy array and buffer
        return await to_thread(
            lambda chunk: chunk_spec.prototype.buffer.from_bytes(
                self._blosc_codec.encode(chunk.as_numpy_array().view(chunk_spec.dtype))
            ),
            chunk_bytes,

Thoughts

I am still just getting started with v3, but this has made me curious about one thing. Why is the blosc codec implemented as a BytesBytesCodec rather than as an ArrayBytesCodec, considering that it can accept (and is optimized for) numpy array input? Although the above solution does work, because I need to include the BytesCodec first when specifying my codecs in v3, it essentially first encodes each chunk into bytes, then decodes it back into an array in its original dtype, making the bytes codec effectively a pointless noop in this case.

@agoodm agoodm added the bug Potential issues with the zarr-python library label Sep 11, 2024
@TomAugspurger
Copy link
Contributor

Nice find.

Why is the blosc codec implemented as a BytesBytesCodec rather than as an ArrayBytesCodec

I'm also spinning up on codecs, but I wonder if it's related to a pipeline only being able to have one ArrayBytes codec, but many BytesBytes codecs? By writing it as a BytesBytesCodec it composes a bit better with other codecs.

That said... trying to view the bytes as chunk_spec.dtype will I think only work if the BloscCodec is the first BytesBytes codec (which is another way of saying that it could / should be an ArrayBytes codec). If you have [BytesCodec(), GzipCodec(), BloscCodec()] for whatever reason, the bytes after gzipping wouldn't be viewable as dtype.

@rabernat
Copy link
Contributor

If blosc is aware of the dtype, I think it makes sense to redefine it as an ArrayBytesCodec.

@d-v-b
Copy link
Contributor

d-v-b commented Sep 11, 2024

Is it a solution if we define BloscArrayBytesCodec and BloscBytesBytesCodec as two separate codecs that both use blosc. The former takes an array, the latter takes bytes. Extremely inelegant, but maybe the best we can do without changing the v3 spec.

@d-v-b
Copy link
Contributor

d-v-b commented Sep 11, 2024

the basic problem is that arrays are a particular arrangement of bytes. so drawing a hard line between arrays and bytes will inevitably lead to situations like this, because they are not two separate categories.

@rabernat
Copy link
Contributor

Extremely inelegant, but maybe the best we can do without changing the v3 spec.

I mean, if a consequence of strictly following the V3 spec is that users experience a 20x degradation in compression rates, then it seems like we should change the spec, no?

drawing a hard line between arrays and bytes will inevitably lead to situations like this, because they are not two separate categories.

All array buffers can be interpreted as bytes. The reverse is not true. To me, an ArrayBytesCodec gets additional metadata about how to interpret the raw bytes as an array. That sounds like exactly what we want here.

@d-v-b
Copy link
Contributor

d-v-b commented Sep 11, 2024

I mean, if a consequence of strictly following the V3 spec is that users experience a 20x degradation in compression rates, then it seems like we should change the spec, no?

Believe me, I'm not opposed to changing the v3 spec :) But if we reclassify blosc as an ArrayBytesCodec, then it must occur at the array -> bytes boundary in the list of codecs, which means blosc can no longer be used as a BytesBytes codec.

Maybe this is OK if the performance is always terrible in this situation, but I thought blosc was designed to work on arbitrary binary data, including arrays. As I understand it, typesize only has an effect if shuffling is used, but shuffling is itself a free parameter. This means blosc is sometimes a BytesBytesCodec, and other times an ArrayBytesCodec. Hence my point that the array / bytes distinction is blurry. Would love to hear from a blosc expert here.

@TomAugspurger
Copy link
Contributor

I think having both an ArrayBytesBloscCodec and BytesBytesBloscCodec is fine.

The usability issues can be mitigated a bit by having good defaults. In Zarr v2 I typically didn't dig into the encoding / compressor stuff. If we can match that behavior by setting the default codec pipeline to include an BloscArrayBytesCodec first I think most users will be fine, and advanced users can dig into the thorny details if they want.

@d-v-b
Copy link
Contributor

d-v-b commented Sep 11, 2024

If we make a version of blosc that fits in the ArrayBytes category, then there are few considerations to be mindful of:

  • blosc is only sensitive to the dtype of the input when the optional bitshuffling is used. So the ArrayBytesBlosc would probably have to have shuffling enabled unconditionally.
  • Should the BytesBytesBloscCodec allow shuffling? If so, then it's de facto an ArrayBytesCodec. if not, then we are not really using blosc effectively.
  • The BytesCodec (the only non-sharding ArrayBytesCodec) takes a required endian parameter, which specifies the endianness of the output bytes for multi-byte dtypes. Would we need to add this parameter to the ArrayBytesBlosc codec? If so, do all non-sharding ArrayBytesCodec instances take an endian parameter?

@LDeakin
Copy link

LDeakin commented Sep 11, 2024

What about just automatically setting the typesize of the blosc codec if it is unspecified and identified as following the bytes codec? This is what the spec already suggests:

Zarr implementations MAY allow users to leave this unspecified and have the implementation choose a value automatically based on the array data type and previous codecs in the chain

I don't think an array->bytes blosc codec is a great idea, because it would then need to take on the role of the bytes codec for fixed-size data types (and need an endian parameter as @d-v-b noted). This would worsen with variable-sized data types in the future. Too much responsibility on one codec.

If so, do all non-sharding ArrayBytesCodec instances take an endian parameter?

Potential array->bytes codecs like zfp and pcodec do not need an endian parameter.

@agoodm
Copy link
Contributor Author

agoodm commented Sep 11, 2024

Happy to see this garnered a lot of interesting discussion! Here are my own thoughts on some points raised here:

I think having both an ArrayBytesBloscCodec and BytesBytesBloscCodec is fine.

The usability issues can be mitigated a bit by having good defaults. In Zarr v2 I typically didn't dig into the encoding / compressor stuff. If we can match that behavior by setting the default codec pipeline to include an BloscArrayBytesCodec first I think most users will be fine, and advanced users can dig into the thorny details if they want.

Even if we switch to making Blosc the default codec in v3 (as it is in v2), there are still many times I find myself needing to specify codecs manually when using Blosc since it's a meta-compressor with many different options. So in the situation where I do need to use a configuration of blosc that's different from the default, having two flavors will add some confusion from the end-user perspective. And for the maintainers of this library, that means you now have the weight of two separate implementations which fundamentally do the same thing. As far as the actual blosc compressor is concerned, the only real difference between the two is that one infers the typesize and the other doesn't, so it seems kind of overkill.

Should the BytesBytesBloscCodec allow shuffling? If so, then it's de facto an ArrayBytesCodec. if not, then we are not really using blosc effectively.

I don't see why not. Array inputs are ideal for shuffling since the typesize gets inferred, but with the bit shuffle filter you can still see marginal improvements in the compression ratio even for bytes input.

What about just automatically setting the typesize of the blosc codec if it is unspecified and identified as following the bytes codec? This is what the spec already suggests:

Right now from what I can see setting the typesize in the BloscCodec constructor typesize does nothing as far as encode/decode is concerned but on inspection it already automatically infers the correct value from the input dtype. The issue is the numcodecs wrapper to blosc only infers the typesize correctly when the input is an array, otherwise it assumes a typesize of 1. So perhaps a good first solution would be to modify the numcodecs blosc wrapper to override the value of typesize, as this is what the the python-blosc library already does.

I don't think an array->bytes blosc codec is a great idea, because it would then need to take on the role of the bytes codec for fixed-size data types (and need an endian parameter as @d-v-b noted). This would worsen with zarr-developers/zeps#47 (comment) in the future. Too much responsibility on one codec.

My personal feeling is that keeping blosc as a BytesBytesCodec still seems like a bit of an interface regression to me. As a user of v2, I never had to think really hard when using the blosc codec since it seamlessly worked whether the input was an array or raw bytes. One of my remaining pet peeves I would still have as a result would be the need to always manually specify a BytesCodec at the start of a codec pipeline, which is kind of cumbersome. And on top of this, internally you are still doing a conversion from array to bytes and adding that extra bit of unnecessary processing overhead, however slight.

@mkitti
Copy link

mkitti commented Sep 12, 2024

The main reason Blosc needs to know the dtype is for byte or bitshuffling.

Under Zarr-python as used for Zarr v3 as above, if shuffle is not set and the dtype is a byte, then BloscCodec will default to bitshuffling 8-bits. For large dtypes, byte shuffling is used.

if new_codec.shuffle is None:
new_codec = replace(
new_codec,
shuffle=(BloscShuffle.bitshuffle if dtype.itemsize == 1 else BloscShuffle.shuffle),
)

Meanwhile, numcodecs will default to byte shuffle (just called shuffle) for Zarr v2:
https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/blosc.pyx#L548

@mkitti
Copy link

mkitti commented Sep 12, 2024

The 11348 compressed length correspond to compressing with either

  • typesize=1 with no shuffle or byte shuffle OR
  • typesize=8 with no shuffle.

A compressed length of 1383 corresponds to compressing with

  • typesize = 8 with byte shuffle.
In [1]: import blosc, numpy as np

In [2]: c = np.arange(10000, dtype="float64")

In [3]: len(blosc.compress(c, cname="zstd", typesize=1, clevel=5, shuffle=blosc.NOSHUFFLE))
Out[3]: 11348

In [4]: len(blosc.compress(c, cname="zstd", typesize=1, clevel=5, shuffle=blosc.SHUFFLE))
Out[4]: 11348

In [5]: len(blosc.compress(c, cname="zstd", typesize=1, clevel=5, shuffle=blosc.BITSHUFFLE))
Out[5]: 979

In [6]: len(blosc.compress(c, cname="zstd", typesize=8, clevel=5, shuffle=blosc.NOSHUFFLE))
Out[6]: 11348

In [7]: len(blosc.compress(c, cname="zstd", typesize=8, clevel=5, shuffle=blosc.SHUFFLE))
Out[7]: 1383

In [8]: len(blosc.compress(c, cname="zstd", typesize=8, clevel=5, shuffle=blosc.BITSHUFFLE))
Out[8]: 391

@d-v-b
Copy link
Contributor

d-v-b commented Sep 12, 2024

Does the typesize parameter have to correspond to the size of the elements of an array? For example, when compressing arrays with variable length types (like strings) with blosc, it might make sense to pick a typesize, even though the size of the array dtype is undefined. To phrase it differently, requiring that the blosc typesize be set to the "size" of a dtype will fail for variable length types.

@mkitti
Copy link

mkitti commented Sep 30, 2024

Does the typesize parameter have to correspond to the size of the elements of an array? For example, when compressing arrays with variable length types (like strings) with blosc, it might make sense to pick a typesize, even though the size of the array dtype is undefined.

Technically, typesize could be anything. It does not have to correspond to the size of the elements of the array. I'm unclear what might happen if the typesize does not evenly divide the number of of bytes in the array though. What typesize does here is parameterize the shuffle.

Using a pre-filter such as shuffle only makes sense if you are providing information that the compessor does not already have.

To phrase it differently, requiring that the blosc typesize be set to the "size" of a dtype will fail for variable length types.

I have no idea why you would try to shuffle variable length types.

@d-v-b
Copy link
Contributor

d-v-b commented Sep 30, 2024

I have no idea why you would try to shuffle variable length types.

What if the variable length type was something like list[uint32]? I.e., a variable-length collection of some numeric type.

@mkitti
Copy link

mkitti commented Sep 30, 2024

Are you expecting some correlation to still occur every four bytes in that case? Are the values in the list related somehow?

@d-v-b
Copy link
Contributor

d-v-b commented Sep 30, 2024

Are you expecting some correlation to still occur every four bytes in that case? Are the values in the list related somehow?

Yes that's what I'm thinking. Suppose you have a data type that's a variable-length collection of points in space, and the points tend to have similar values. Would shuffling be useful here? Honest question, since I don't know much about it.

@mkitti
Copy link

mkitti commented Sep 30, 2024

Consider a simple run length encoding (RLE) compression scheme where I first provide the length of a run and then the content of the run.

For example, [0x1 0x1 0x1 0x1 0x2 0x2 0x2] would be encoded as [0x4 0x1 0x3 0x2] since there is a run of 4 0x1s followed by 3 0x2s. Terrific, I have encoded 7 bytes into 4 bytes.

Now imagine if had the byte sequence [0x1 0x0 0x0 0x0 0x2 0x0 0x0 0x0]. The encoding would be [0x1 0x1 0x3 0x0 0x1 0x2 0x03 0x00]. Now I have encoded 8 bytes into ... 8 bytes: no compression.

Now let's say I shuffle the bytes as follows. [0x1 0x2 0x0 0x0 0x0 0x0 0x0 0x0 0x0]. These are the same bytes, but shuffled a different order. I encode them as [0x1 0x1 0x1 0x2 0x6 0x0]. Now I have some compression again. In this case, the variable byte occurred every 4 bytes since these were 32-bit numbers, and the numbers were of similar magnitude. We took advantage that the higher order bytes were not changing.

If the bytes were [0x1 0x1 0x1 0x1 0x2 0x2 0x2 0x2], I could encode this as [0x4 0x1 0x4 0x1]. If I shuffled the bytes to be [0x1 0x2 0x1 0x2 0x1 0x2 0x1 0x2], then RLE is a disaster. There would be 16 encoded bytes, twice the number we started with. Shuffling the bytes made the compression worse!

All compressors use RLE at some stage, so shuffling can be really useful if you expect runs of multibyte values that are related. By putting the rarely changing higher order bytes together, you can increase runs for encoding. Lz4 is just a fancy RLE encoder so shuffling really helps if the numbers are of similar magnitude. Zstd also has an entropy coding stage, so shuffling does not have as much.

Thus you probably want to shuffle if you know your number are correlated somehow, as in an image. If your numbers are random, shuffling might not help. If your numbers are fluctuating across magnitudes then, it is quite possible that shuffling could hurt.

@d-v-b
Copy link
Contributor

d-v-b commented Sep 30, 2024

thanks for that explanation. my takeaway then remains the same: the typesize attribute of the blosc codec is not perfectly equivalent to the dtype of an N-dimensional array, and so forcing blosc to be an array -> bytes codec would limit the utility of blosc.

@mkitti
Copy link

mkitti commented Sep 30, 2024

There is utility in array to bytes codec in that

  1. in many scenarios typesize = sizeof(dtype) is a useful default
  2. this would match the Zarr v2 behavior

There is also utility in a bytes to bytes codec where typesize is a free parameter.

All of this is content dependent. To set defaults here assumes the content follows certain patterns, which it may not. I would discourage making assumptions about other people's content.

@normanrz
Copy link
Contributor

normanrz commented Nov 6, 2024

What about just automatically setting the typesize of the blosc codec if it is unspecified and identified as following the bytes codec? This is what the spec already suggests:

This is actually implemented already:

if new_codec.typesize is None:
new_codec = replace(new_codec, typesize=dtype.itemsize)

Meanwhile, numcodecs will default to byte shuffle (just called shuffle) for Zarr v2:
https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/blosc.pyx#L548

Should we just default to shuffle to match the behavior in v2?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library V3 Affects the v3 branch
Projects
Status: Todo
Development

No branches or pull requests

8 participants