Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2.metadata and v3.metadata encode fill_value bytes differently #2322

Open
rabernat opened this issue Oct 9, 2024 · 0 comments
Open

v2.metadata and v3.metadata encode fill_value bytes differently #2322

rabernat opened this issue Oct 9, 2024 · 0 comments
Labels
V3 Affects the v3 branch
Milestone

Comments

@rabernat
Copy link
Contributor

rabernat commented Oct 9, 2024

Here I am creating an array and specifying the fill_value as raw bytes b'X'

import zarr

fv = b'X'

a = zarr.create(shape=10, dtype=bytes, zarr_version=2, fill_value=fv)
ad = a.metadata.to_dict()
print(ad)
# -> {'shape': (10,), 'fill_value': 'WA==', 'attributes': {}, 'zarr_format': 2, 'order': 'C', 'filters': None, 'dimension_separator': '.', 'compressor': None, 'chunks': (10,), 'dtype': '|S0'}


b = zarr.create(shape=10, dtype=bytes, zarr_version=3, fill_value=fv)
bd = b.metadata.to_dict()
print(bd)
# -> {'shape': (10,), 'fill_value': (88,), 'chunk_grid': {'name': 'regular', 'configuration': {'chunk_shape': (10,)}}, 'attributes': {}, 'zarr_format': 3, 'data_type': <DataType.bytes: 'bytes'>, 'chunk_key_encoding': {'name': 'default', 'configuration': {'separator': '/'}}, 'codecs': ({'name': 'vlen-bytes', 'configuration': {}},), 'node_type': 'array', 'storage_transformers': ()}

assert zarr.core.metadata.v2.ArrayV2Metadata.from_dict(ad).fill_value == fv
assert zarr.core.metadata.v3.ArrayV3Metadata.from_dict(bd).fill_value == fv

As we can see, the way this fill value is encoded looks quite different from these two. Remarkably, it gets translated back to something reasonable in both cases.

In both cases, the bytes are going through this path:

elif isinstance(value, Sequence):
out_dict[key] = tuple(v.to_dict() if isinstance(v, Metadata) else v for v in value)

This converts the bytes to a tuple of ints.

However, for v2, #2286 added this additional special handling for fill_value:

if dtype.kind in "SV":
fill_value_encoded = _data.get("fill_value")
if fill_value_encoded is not None:
fill_value = base64.standard_b64decode(fill_value_encoded)
_data["fill_value"] = fill_value

According to the V3 spec:

Raw data types (r)
An array of integers, with length equal to , where each integer is in the range [0, 255].

This seems in line with what is happening.

This is relevant to pydata/xarray#5475

@rabernat rabernat changed the title v2.metadata and v3.metadata encode bytes differently v2.metadata and v3.metadata encode fill_value bytes differently Oct 9, 2024
@rabernat rabernat added the V3 Affects the v3 branch label Oct 9, 2024
@jhamman jhamman added this to the After 3.0.0 milestone Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
V3 Affects the v3 branch
Projects
Status: No status
Development

No branches or pull requests

2 participants