Can't create big endian dtypes in V3 array #2324

rabernat · 2024-10-09T14:53:55Z

This works with V2 data:

zarr.create(shape=10, dtype=">i2", zarr_version=2)
# -> <Array memory://4413530368 shape=(10,) dtype=>i2>

But raises for V3

zarr.create(shape=10, dtype=">i2", zarr_version=3)

File ~/gh/zarr-developers/zarr-python/src/zarr/codecs/__init__.py:40, in _get_default_array_bytes_codec(np_dtype)
     37 def _get_default_array_bytes_codec(
     38     np_dtype: np.dtype[Any],
     39 ) -> BytesCodec | VLenUTF8Codec | VLenBytesCodec:
---> 40     dtype = DataType.from_numpy(np_dtype)
     41     if dtype == DataType.string:
     42         return VLenUTF8Codec()

File ~/gh/zarr-developers/zarr-python/src/zarr/core/metadata/v3.py:599, in DataType.from_numpy(cls, dtype)
    581     return DataType.bytes
    582 dtype_to_data_type = {
    583     "|b1": "bool",
    584     "bool": "bool",
   (...)
    597     "<c16": "complex128",
    598 }
--> 599 return DataType[dtype_to_data_type[dtype.str]]

KeyError: '>i2'

In the V3 spec, endianness is now handled by a codec: https://zarr-specs.readthedocs.io/en/latest/v3/codecs/bytes/v1.0.html

Xarray tests create data with big endian dtypes, and Zarr needs to know how to handle them.

The text was updated successfully, but these errors were encountered:

d-v-b · 2024-10-09T15:06:18Z

If the codecs are unspecified, then I think we could automatically parametrize the BytesCodec based on the dtype. If the codecs are specified and the BytesCodec endianness doesn't match the endianness of the data, then we raise an exception.

But a bigger problem is that, by making endianness a serialization detail, the zarr dtype model has diverged from the numpy dtype model. If our array object uses zarr v3 data type semantics, then zarr.create(..., dtype=">i2") will return an array with dtype <i2 + a special bytes codec. From the POV of functions like np.array_like, this zarr array will not have its "real" dtype; users might be surprised to see that zarr.create(..., dtype=">i2") and zarr.create(..., dtype="<i2") returns arrays with the same dtype. I don't see an easy solution to this.

rabernat · 2024-10-12T12:22:25Z

One solution could be to always translate the endianness of the on-disk data to the endianness of the in-memory data. This could be done within BytesCodec. However, it would be hard, since endianness is not part of ArraySpec.

rabernat mentioned this issue Oct 9, 2024

Fill value fixes for V3 TomAugspurger/xarray#1

Merged

This was referenced Nov 1, 2024

Monthly issue metrics report #2455

Closed

Monthly issue metrics report MSanKeys963/zarr-python#3

Open

rabernat mentioned this issue Nov 1, 2024

Invalid Datatype ('>f8') when trying to convert kerchunk reference to icechunk reference. earth-mover/icechunk#367

Closed

jbusecke mentioned this issue Nov 1, 2024

Tracking issue for Nov presentation jbusecke/esgf-virtual-zarr-data-access#15

Open

LDeakin mentioned this issue Nov 5, 2024

(feat): minimum working codec pipeline ilan-gold/zarrs-python#19

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't create big endian dtypes in V3 array #2324

Can't create big endian dtypes in V3 array #2324

rabernat commented Oct 9, 2024

d-v-b commented Oct 9, 2024

rabernat commented Oct 12, 2024

Can't create big endian dtypes in V3 array #2324

Can't create big endian dtypes in V3 array #2324

Comments

rabernat commented Oct 9, 2024

d-v-b commented Oct 9, 2024

rabernat commented Oct 12, 2024