Supporting UTF-8 data type #83

jakirkham · 2020-07-01T20:05:43Z

In today's discussion the need for UTF-8 came up. Thought we already had an issue for this, but am not finding it.

Would be useful to have UTF-8 support in the spec or as a high priority extension. Raising here to start the discussion about how we want to approach this.

cc @joshmoore @alimanfoo @shoyer @Carreau

shoyer · 2020-07-01T22:02:20Z

As I noted in the call, I think how HDF5 supports strings (including UTF-8) is pretty sane:

Strings data types always have an explicit encoding, which can be either ascii or utf8.
Strings data types are either fixed width (which refers to the number of bytes used in the encoded representation, not the number of unicode characters like in Python) or variable width

I'm not sure there's a real use for ascii these days (given that it's a strict subset of utf8), but there are certainly use cases for both fixed width and variable width utf8 strings.

ivirshup · 2022-12-06T19:52:24Z

I think having a utf8 string type is very important for v3.

I would also be a strong proponent of a variable length utf-8, as most text data is variable length.

I am concerned by the current specs use of fixed length utf-32, since it's an uncommon encoding with little support beyond numpy.

My ideal scenario would be to have the string extension spec essentially use arrows string type encoding specification, e.g. a string is a variable length list of bytes (docs on layout). This means the chunk would include multiple buffers, including an offset buffer and a data buffer. Arrow also includes information about validity for null values – which is nice but I'm not sure necessary.

For expediency, it could make sense to include fixed length utf8 strings as an extension in zarr v3. I'm not sure I would update the AnnData formats to zarr v3 until variable length strings existed, since I'd rather not go back to the issues we had with fixed length strings. E.g. I would really like to kerchunk together arrays of labels, and labels vary widely in size.

@DennisHeimbigner, we briefly talked about this at the end of the last zarr call, though I hadn't had a chance to read the spec yet. You had mentioned varlength was proposed, but was that in an issue/ PR?

jbms · 2022-12-06T20:01:04Z

I agree --- I would also like to see variable length byte sequence and variable length Unicode code point sequence as data types.

I believe the existing fixed length string type extensions are definitely not intended to be part of the core spec. They were added to document the existing zarr v2 behavior, and haven't been reviewed too much. Despite the fact that they don't seem terribly useful, I also don't think they are unreasonable to have as optional extensions.

ivirshup · 2022-12-06T20:47:28Z

I agree --- I would also like to see variable length byte sequence and variable length Unicode code point sequence as data types.

A point that is a little confusing to me right now is "core", "extension", or "extension but on zarr-specs.readthedocs.io". Which were you thinking for these types?

I also don't think they are unreasonable to have as optional extensions.

I agree these aren't unreasonable by themselves. I think it might be bad if utf-32 were the only unicode representation for v3 on zarr-specs.

jbms · 2022-12-06T22:39:00Z

I think we still have to sort out exactly how extensions and other additions of features in later spec versions will be specified in the metadata.

But I certainly agree that the utf-32 encoding is not very useful.

tomwhite · 2024-07-09T11:51:22Z

I'd like to add my vote for adding support for variable-length strings in v3. We need this for supporting Zarr v3 in sgkit's VCF Zarr support (see sgkit-dev/bio2zarr#254).

The way we are using it currently in v2 is the way that's recommended in the Zarr Tutorial:

>>> import numcodecs
>>> import zarr.v2 as zarr
>>> z = zarr.array(["Hi", "Hey"], dtype=object, object_codec=numcodecs.VLenUTF8())
>>> z
<zarr.v2.core.Array (2,) object>
>>> z[:]
array(['Hi', 'Hey'], dtype=object)

Perhaps Zarr v3 should take advantage of the new NumPy UTF-8 variable-width string dtype for this?

d-v-b · 2024-07-09T12:09:17Z

I'm not too familiar with numpy string arrays but my impression is that an array of a variable-length type cannot use a contiguous memory buffer for the in-memory representation. As zarr-python v3 internal APIs are very much centered around contiguous memory buffers, this might be a challenge!

@normanrz do you have any insight into how variable length types would fit into the current chunk processing framework in zarr python v3?

normanrz · 2024-07-09T12:18:01Z

I think adding variable-length strings to zarr-python would take some work but is not impossible. The numpy-backed buffers are still quite flexible. We use them for handling the object dtype in v2 arrays as well. Other buffers might need more work.

rabernat · 2024-07-09T13:18:15Z

Perhaps Zarr v3 should take advantage of the new NumPy UTF-8 variable-width string dtype for this?

I don't think this is much help for Zarr, because "string data are stored outside the array buffer" (see https://numpy.org/neps/nep-0055-string_dtype.html#serialization), i.e. the arrays just stores a pointer to the actual string data.

A much better reference point would be Arrow string encoding, or more generally, Arrow variable sized binary layout. Variable-length types require at least two buffers: one to store the actual data and one to store offsets into the data where the items begin.

We already support all of this in Zarr V2 via numcodecs vlen codecs! https://numcodecs.readthedocs.io/en/stable/vlen.html

Shouldn't it be straightforward to adapt this approach to V3? They key will be to not rely on anything Python specific (e.g. python objects). Arrow points the way here.

jeromekelleher · 2024-07-09T14:12:22Z

Just adding my +1 to @tomwhite's comment above. Strings are crucial for supporting genetic variation data, which there is an awful lot of, and Zarr would be amazing for. See our preprint for background and details.

normanrz · 2024-07-10T13:59:33Z

I think this issue needs a champion who wants to write a ZEP.

rabernat · 2024-07-12T22:19:45Z

Over at zarr-developers/zarr-python#2031 I have a proof-of-concept that we can very easily support UTF-8 and variable length strings by leveraging Arrow encoding of string arrays. Would love some feedback on whether that approach seems promising.

jakirkham mentioned this issue Jul 1, 2020

Conversion of ilastik probabilities leads to "ndarray is not JSON serializable" joshmoore/zarr-utils#1

Open

alimanfoo changed the title ~~Supporting UTF-8~~ Supporting UTF-8 data type Jul 15, 2020

jstriebel added core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec protocol-extension Protocol extension related issue data-type labels Nov 16, 2022

jstriebel removed the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Dec 7, 2022

tomwhite mentioned this issue Jul 9, 2024

Run tests against Zarr 3 sgkit-dev/bio2zarr#254

Open

joshmoore mentioned this issue Jul 11, 2024

Draft ZEP 0007: Strings zarr-developers/zeps#47

Open

rabernat mentioned this issue Jul 12, 2024

ArrowRecordBatchCodec and vlen string support zarr-developers/zarr-python#2031

Draft

6 tasks

rabernat mentioned this issue Jul 14, 2024

Add string and bytes dtypes plus vlen-utf8 and vlen-bytes codecs zarr-developers/zarr-python#2036

Merged

flying-sheep mentioned this issue Aug 8, 2024

Nullable string columns scverse/anndata#679

Closed

jhamman mentioned this issue Sep 29, 2024

[v3] String support for v3 array zarr-developers/zarr-python#2268

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting UTF-8 data type #83

Supporting UTF-8 data type #83

jakirkham commented Jul 1, 2020

shoyer commented Jul 1, 2020

ivirshup commented Dec 6, 2022

jbms commented Dec 6, 2022

ivirshup commented Dec 6, 2022

jbms commented Dec 6, 2022

tomwhite commented Jul 9, 2024

d-v-b commented Jul 9, 2024

normanrz commented Jul 9, 2024

rabernat commented Jul 9, 2024 •

edited

Loading

jeromekelleher commented Jul 9, 2024 •

edited

Loading

normanrz commented Jul 10, 2024

rabernat commented Jul 12, 2024

Supporting UTF-8 data type #83

Supporting UTF-8 data type #83

Comments

jakirkham commented Jul 1, 2020

shoyer commented Jul 1, 2020

ivirshup commented Dec 6, 2022

jbms commented Dec 6, 2022

ivirshup commented Dec 6, 2022

jbms commented Dec 6, 2022

tomwhite commented Jul 9, 2024

d-v-b commented Jul 9, 2024

normanrz commented Jul 9, 2024

rabernat commented Jul 9, 2024 • edited Loading

jeromekelleher commented Jul 9, 2024 • edited Loading

normanrz commented Jul 10, 2024

rabernat commented Jul 12, 2024

rabernat commented Jul 9, 2024 •

edited

Loading

jeromekelleher commented Jul 9, 2024 •

edited

Loading