-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supporting UTF-8 data type #83
Comments
As I noted in the call, I think how HDF5 supports strings (including UTF-8) is pretty sane:
I'm not sure there's a real use for ascii these days (given that it's a strict subset of utf8), but there are certainly use cases for both fixed width and variable width utf8 strings. |
I think having a utf8 string type is very important for v3. I would also be a strong proponent of a variable length utf-8, as most text data is variable length. I am concerned by the current specs use of fixed length utf-32, since it's an uncommon encoding with little support beyond numpy. My ideal scenario would be to have the string extension spec essentially use arrows string type encoding specification, e.g. a string is a variable length list of bytes (docs on layout). This means the chunk would include multiple buffers, including an offset buffer and a data buffer. Arrow also includes information about validity for null values – which is nice but I'm not sure necessary. For expediency, it could make sense to include fixed length utf8 strings as an extension in zarr v3. I'm not sure I would update the AnnData formats to zarr v3 until variable length strings existed, since I'd rather not go back to the issues we had with fixed length strings. E.g. I would really like to kerchunk together arrays of labels, and labels vary widely in size. @DennisHeimbigner, we briefly talked about this at the end of the last zarr call, though I hadn't had a chance to read the spec yet. You had mentioned varlength was proposed, but was that in an issue/ PR? |
I agree --- I would also like to see variable length byte sequence and variable length Unicode code point sequence as data types. I believe the existing fixed length string type extensions are definitely not intended to be part of the core spec. They were added to document the existing zarr v2 behavior, and haven't been reviewed too much. Despite the fact that they don't seem terribly useful, I also don't think they are unreasonable to have as optional extensions. |
A point that is a little confusing to me right now is "core", "extension", or "extension but on zarr-specs.readthedocs.io". Which were you thinking for these types?
I agree these aren't unreasonable by themselves. I think it might be bad if utf-32 were the only unicode representation for v3 on zarr-specs. |
I think we still have to sort out exactly how extensions and other additions of features in later spec versions will be specified in the metadata. But I certainly agree that the utf-32 encoding is not very useful. |
I'd like to add my vote for adding support for variable-length strings in v3. We need this for supporting Zarr v3 in sgkit's VCF Zarr support (see sgkit-dev/bio2zarr#254). The way we are using it currently in v2 is the way that's recommended in the Zarr Tutorial: >>> import numcodecs
>>> import zarr.v2 as zarr
>>> z = zarr.array(["Hi", "Hey"], dtype=object, object_codec=numcodecs.VLenUTF8())
>>> z
<zarr.v2.core.Array (2,) object>
>>> z[:]
array(['Hi', 'Hey'], dtype=object) Perhaps Zarr v3 should take advantage of the new NumPy UTF-8 variable-width string dtype for this? |
I'm not too familiar with numpy string arrays but my impression is that an array of a variable-length type cannot use a contiguous memory buffer for the in-memory representation. As zarr-python v3 internal APIs are very much centered around contiguous memory buffers, this might be a challenge! @normanrz do you have any insight into how variable length types would fit into the current chunk processing framework in zarr python v3? |
I think adding variable-length strings to zarr-python would take some work but is not impossible. The numpy-backed buffers are still quite flexible. We use them for handling the object dtype in v2 arrays as well. Other buffers might need more work. |
I don't think this is much help for Zarr, because "string data are stored outside the array buffer" (see https://numpy.org/neps/nep-0055-string_dtype.html#serialization), i.e. the arrays just stores a pointer to the actual string data. A much better reference point would be Arrow string encoding, or more generally, Arrow variable sized binary layout. Variable-length types require at least two buffers: one to store the actual data and one to store offsets into the data where the items begin. We already support all of this in Zarr V2 via numcodecs vlen codecs! https://numcodecs.readthedocs.io/en/stable/vlen.html Shouldn't it be straightforward to adapt this approach to V3? They key will be to not rely on anything Python specific (e.g. python objects). Arrow points the way here. |
I think this issue needs a champion who wants to write a ZEP. |
Over at zarr-developers/zarr-python#2031 I have a proof-of-concept that we can very easily support UTF-8 and variable length strings by leveraging Arrow encoding of string arrays. Would love some feedback on whether that approach seems promising. |
In today's discussion the need for UTF-8 came up. Thought we already had an issue for this, but am not finding it.
Would be useful to have UTF-8 support in the spec or as a high priority extension. Raising here to start the discussion about how we want to approach this.
cc @joshmoore @alimanfoo @shoyer @Carreau
The text was updated successfully, but these errors were encountered: