Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

endian codec always required? #8

Closed
manzt opened this issue Aug 9, 2023 · 8 comments
Closed

endian codec always required? #8

manzt opened this issue Aug 9, 2023 · 8 comments

Comments

@manzt
Copy link

manzt commented Aug 9, 2023

Thanks for all your work on zarr v3. I am using this project to generate fixtures for zarrita.js and ensure it is compatible moving forward.

I noticed that the endian codec is always used, even for single byte datatypes that do no have endianness.

a = zarrita.ArrayV2.create("data.zarr/foo", shape=(3,3), chunks=(1, 1), dtype="|i1")
a.convert_to_v3() # writes an "little endian" codec

I guess right now the endian codec is the pass through to generate the array from bytes, but I think the v3 metadata for this array should ideally be an empty list of codecs? The endian codec only makes sense for data types > 1 byte.

{
  "shape": [3, 3],
  "data_type": "int8",
  "chunk_grid": {
    "configuration": {
      "chunk_shape": [1, 1]
    },
    "name": "regular"
  },
  "chunk_key_encoding": {
    "configuration": {
      "separator": "."
    },
    "name": "v2"
  },
  "fill_value": 0,
  "codecs": [
--    {
--      "configuration": {
--        "endian": "little"
--      },
--      "name": "endian"
--    }
  ],
  "attributes": {},
  "dimension_names": null,
  "zarr_format": 3,
  "node_type": "array"
}
@normanrz
Copy link
Member

After receiving feedback on ZEP1 from implementors, @jbms proposed to make one array->bytes codec mandatory zarr-developers/zarr-specs#249. I implemented that change in zarrita. It is true that the endian codec doesn't make sense for 1-byte data types, but it is the only array->bytes codec at the moment.

Looking forward to seeing v3 in zarrita.js! And, don't forget to vote on ZEP2 :-)

@jbms
Copy link

jbms commented Aug 13, 2023

It is true that it is a bit unfortunate that for 1-byte types like bool and int8 the codec is still called endian and requires that an endianness be specified even though it does not have an effect.

Arguably it would be better to have a different codec that is used for 1-byte types, e.g. "byte" or "raw", that behaves the same as "endian" but is only valid for 1-byte types and does not have any configuration options.

@normanrz
Copy link
Member

I don't think it is a good idea to have 2 codecs that essentially do the same. I would rather rename the "endian" codec into some thing else (maybe "byte" or "raw" or "continuous") and make the "endian" configuration only required for data types with > 1 byte.

@manzt
Copy link
Author

manzt commented Aug 14, 2023

Perhaps an "data_type" codec that is an array -> bytes and creates the initial array container from the bytes, and then an "endian" codec that is array -> array and enforces endianness.

EDIT: Ah yes, similar in spirit to @normanrz idea, which I think I like better since endianess is related to the interpretation of the bytes, and maybe a leaky abstraction at the "array" level.

@manzt
Copy link
Author

manzt commented Aug 14, 2023

# bytes -> array
arr = np.array(bytes, dtype="int16", shape=shape)

# array -> array
arr.dtype.byteorder = "<" if config["endian"] === "little" else ">"

EDIT: oops, dtype isn't mutable.

@jbms
Copy link

jbms commented Aug 14, 2023

I think given how arrays and data types are specified for zarr v3, as abstract types rather than specific byte representations in memory, that the suggestion from @normanrz would work better.

"raw" isn't very descriptive for bool but that is not too big of a deal.

@normanrz
Copy link
Member

We could call it "bytes".

@jbms
Copy link

jbms commented Aug 20, 2023

I think bytes is a good name --- will you submit a zarr-specs PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants