Add `string` and `bytes` dtypes plus `vlen-utf8` and `vlen-bytes` codecs #2036

rabernat · 2024-07-14T16:56:14Z

This is an alternative approach to #2031 for implementing variable length string encoding, using the legacy numcodecs vlen-utf8 codec.

The codec is very simple. It encodes variable length data like this:

| header | item 0 len | item 0 data | item 1 len | item 1 data | ... 
| uint4  | unit4      | variable    | unit4      | variable    | ...

where header is the number of items and each item is preceded by an int which tells how many bytes long it is.

We could add this as an "official" Zarr V3 codec if we want. The point of this PR is to demonstrate that it is simple to implement. If desirable, we could extend this to the other vlen types (bytes and array).

I am in favor of supporting this "legacy" encoding for backwards compatibility--I'd like existing Zarr V2 data to be convertible to V3 without rewriting the chunks.

How, I would favor using Arrow encoding (#2031) as the preferred default for new data.

xref:

rabernat · 2024-07-14T18:20:29Z

I compared this encoding + Zstd compression with the arrow-based encoding in #2031 and got some surprising results. The dataset consists of 47868 city names.

encoding	compression	stored bytes
vlen-utf8	None	642616
vlen-utf8	Zstd(2)	366504
vlen-utf8	Zstd(2)	358428
arrow	None	657185
arrow	Zstd(2)	462208
arrow	Zstd(5)	455234

The surprise is that the legacy encoding compresses better than the arrow-based one.

import zarr
from zarr.codecs import ZstdCodec
from zarr.buffer import default_buffer_prototype
from zarr.api.synchronous import array
import asyncio

# swap to compare across PRs
from zarr.codecs import VLenUTF8Codec as BytesCodec
#from zarr.codecs import ArrowRecordBatchCodec as BytesCodec

a = array(df.city.values, chunks=(1000,), fill_value='', codecs=[BytesCodec(), ZstdCodec(level=5)])

# ridiculous workarounds the lack of getsize on V3 store API
store_path = a._async_array.store_path
all_items = [item async for item in store_path.store.list()]
async def get_item_size(store, item):
    return len(await store.get(item, prototype=default_buffer_prototype))
all_sizes = await asyncio.gather(*[get_item_size(store_path.store, item) for item in all_items])
total = sum(all_sizes)

worldcities.csv

tomwhite · 2024-07-18T13:00:37Z

Thanks for opening this PR @rabernat!

I just tried your implementation of VLenUTF8Codec with the VCF to Zarr conversion in sgkit-dev/bio2zarr#254, and the test worked, which shows string encoding and decoding is working correctly in this case.

tomwhite · 2024-07-18T13:02:40Z

I am in favor of supporting this "legacy" encoding for backwards compatibility--I'd like existing Zarr V2 data to be convertible to V3 without rewriting the chunks.

+1

TomAugspurger · 2024-09-30T20:00:49Z

Thanks. Would it be possible to support object-dtype arrays here too (I think zarr-v2 or xarray assumes object-dtype arrays hold strings)?

Right now this raises an exception.

import zarr
import zarr.store
import numpy as np
import json
# from zarr.codecs import VLenUTF8Codec

store = zarr.store.MemoryStore(mode="a")
data = np.array(['a', 'bc', 'def', 'asdf', 'asdfg'], dtype=np.dtype(object))

arr = zarr.create(shape=data.shape, store=store, zarr_format=2, filters=[{"id": "vlen-utf8"}], fill_value="", dtype=object)
arr[:] = data
arr2 = zarr.open_array(store=store, zarr_format=2)
print(arr2[:])

Something like this seems to do the trick for zarr_format=2:

diff --git a/src/zarr/codecs/_v2.py b/src/zarr/codecs/_v2.py
index cc6129e6..c1f4944e 100644
--- a/src/zarr/codecs/_v2.py
+++ b/src/zarr/codecs/_v2.py
@@ -37,7 +37,13 @@ class V2Compressor(ArrayBytesCodec):
 
         # ensure correct dtype
         if str(chunk_numpy_array.dtype) != chunk_spec.dtype:
-            chunk_numpy_array = chunk_numpy_array.view(chunk_spec.dtype)
+            if chunk_spec.dtype.kind == "O":
+                # I think we need to assert something about the codec pipeline here
+                # to ensure that we're followed by something that'll do the cast properly.
+                # chunk_numpy_array = chunk_numpy_array.astype(chunk_spec.dtype)
+                pass
+            else:
+                chunk_numpy_array = chunk_numpy_array.view(chunk_spec.dtype)
 
         return get_ndbuffer_class().from_numpy_array(chunk_numpy_array)
 
diff --git a/src/zarr/core/buffer/core.py b/src/zarr/core/buffer/core.py
index 9a808e08..034711d9 100644
--- a/src/zarr/core/buffer/core.py
+++ b/src/zarr/core/buffer/core.py
@@ -472,7 +472,7 @@ class NDBuffer:
         # use array_equal to obtain equal_nan=True functionality
         _data, other = np.broadcast_arrays(self._data, other)
         return np.array_equal(
-            self._data, other, equal_nan=equal_nan if self._data.dtype.kind not in "US" else False
+            self._data, other, equal_nan=equal_nan if self._data.dtype.kind not in "OUS" else False
         )
 
     def fill(self, value: Any) -> None:

rabernat · 2024-09-30T22:56:02Z

We could do that if we assume that the object dtype only ever hold strings. But that seems like a pretty extreme assumption.

Zarr V2 allows arbitrary python objects, but V3 does not. Seems reasonable to force users to use an explicit string dtype for string arrays.

What's the use case you have in mind for object string arrays?

TomAugspurger · 2024-10-01T02:57:15Z

I'll look a bit closer tomorrow, but this was for reading zarr v2 data (an xarray dataset with an object-dtype coordinate array full of strings).

I agree that being explicit about the dtype here is better. Hopefully we can find a way that works well for older v2 datasets without too much hassle.

rabernat · 2024-10-01T11:41:43Z

Here's a little exploration of different types of numpy string arrays

import numpy as np
strings = ["hello", "world", "this", "is", "a", "test"]
# default
a1 = np.array(strings)
assert a1.dtype == '<U5'
# numpy 2.0 vlen string
a2 = np.array(strings, dtype='T')
# object
a3 = np.array(strings, dtype='O')

# the first two are detectable as string arrays
assert np.issubdtype(np.str_, a1.dtype)
assert np.issubdtype(np.str_, a2.dtype)

# the object is not, even though it contains all strings
assert not np.issubdtype(np.str_, a3.dtype)

# we can cast the object dtype to a vlen string or regular string
assert np.issubdtype(np.str_, a3.astype('T').dtype)

# however, we want to guard against casting if the array contains non-string elements
a4 = np.array(strings + [1], dtype='O') 
a4.astype('T')  # works
a4.astype('T', casting="same_kind")  # -> TypeError

So my proposal for object arrays would be to try to cast them using .astype('T', casking="same_kind") and, if successful, use string encoding.

edit: actually that is useless, because this also fails:

a3.astype('T', casting="same_kind")

So it looks like there is actually no reliable way to know if an object array is a string array.

@TomAugspurger - do you have a test case you are using for object arrays? Is my example enough to cover what Xarray needs? Under what circumstances does Xarray produce object arrays?

Also, my code is relying on numpy 2.0 vlen-string arrays. I guess that's not a hard dependency for Zarr. 🤔

TomAugspurger · 2024-10-01T14:26:18Z

do you have a test case you are using for object arrays?

In xarray, xarray/tests/test_backends.py::TestZarrDictStore::test_roundtrip_object_dtype is the failing test: https://github.com/pydata/xarray/blob/095d47fcb036441532bf6f5aed907a6c4cfdfe0d/xarray/tests/test_backends.py#L506

 In [10]: import xarray as xr, numpy as np, zarr

In [11]: ds = xr.Dataset({"a": xr.DataArray(np.array(['a', 'b', 'c'], dtype=object))})

In [12]: store = zarr.store.MemoryStore(mode="a")

In [13]: ds.to_zarr(store)

Those are manually set to object dtype. But that same test also creates float arrays, so there's something more sophisticated than "object dtype == stings". I'll do some more looking as I work through those tests, but I think this PR can proceed independently.

TomAugspurger · 2024-10-03T16:30:16Z

OK, I think I understand how v2 did things a bit better now.

It's xarray that treats object-dtype arrays as strings, after validating that's actually true. If so, then xarray sets the dtype passed to Zarr to str: https://github.com/pydata/xarray/blob/5c6418219e551cd59f81e3d1c6a828c1c23cd763/xarray/backends/zarr.py#L899-L900

Then zarr turns dtype=str back into dtype=object with

zarr-python/zarr/util.py

Lines 187 to 192 in 6b9ab9e

    
           if isinstance(dtype, str): 
        
               # allow ':' to delimit class from codec arguments 
        
               tokens = dtype.split(":") 
        
               key = tokens[0] 
        
               if key in object_codecs: 
        
                   dtype = np.dtype(object)

(assuming an object codec is provided, which xarray does)

The net result is that in zarr-python 2.x writing zarr v2 data, here's the compressor and filters for various dtypes:

dtype	compressor	filters
strings (U)	Blosc
bytes (S)	Blosc
object[str] (O)	Blosc	VLenUTF8

rabernat · 2024-10-03T16:37:57Z

Thanks Tom. Planning to work on this today.

mkitti · 2024-10-03T20:19:52Z

Could we use the existing sharding codec for new variable length data? Essentially each variable length element is a just an "inner chunk" with some offset and length (nbytes)?

src/zarr/core/metadata/v3.py

rabernat

I have this PR in a place where I am pretty happy with it. I have mostly refactored the internal code to use our own DataType enum instead of numpy dtypes. This is necessary because the vlen types (strings, bytestrings) do not have such a simple 1:1 relationship with numpy dtypes.

However, I am thoroughly stuck on typing stuff. I am not able to recreate the sort of overloaded strict typing that was previously used on on parse_fill_value.

I would greatly appreciate some help and advice from whoever wrote the parse_fill_value type signatures. 🙏

src/zarr/core/metadata/v3.py

src/zarr/strings.py

rabernat · 2024-10-08T15:41:05Z

Do not know why RTD build is failing, but otherwise this is GTG.

jhamman · 2024-10-08T15:47:31Z

Do not know why RTD build is failing, but otherwise this is GTG.

This warning is causing the failure:

/home/docs/checkouts/readthedocs.org/user_builds/zarr/checkouts/2036/docs/_autoapi/zarr/strings/index.rst:73: WARNING: duplicate object description of zarr.strings.STRING_DTYPE, other instance in _autoapi/zarr/strings/index, use :no-index: for one of them

jhamman

Really cool @rabernat!

I have a handful of suggestions, mostly cosmetic.

src/zarr/strings.py

src/zarr/core/buffer/core.py

src/zarr/codecs/__init__.py

src/zarr/codecs/vlen_utf8.py

src/zarr/core/buffer/core.py

src/zarr/core/metadata/v3.py

jhamman · 2024-10-08T16:15:33Z

src/zarr/core/metadata/v3.py

+        np_dtype = dtype.to_numpy()
+        np_dtype = cast(np.dtype[np.generic], np_dtype)


Suggested change

np_dtype = dtype.to_numpy()

np_dtype = cast(np.dtype[np.generic], np_dtype)

np_dtype = cast(np.dtype[np.generic], dtype.to_numpy())

tests/v3/test_codecs/test_vlen.py

Co-authored-by: Joe Hamman <[email protected]>

add legacy vlen-utf8 codec

c05b9d1

rabernat force-pushed the ryan/legacy-vlen branch from cb617f0 to c05b9d1 Compare July 14, 2024 16:58

LDeakin mentioned this pull request Jul 17, 2024

Variable length data types LDeakin/zarrs#40

Merged

tomwhite added a commit to tomwhite/bio2zarr that referenced this pull request Jul 18, 2024

Use VLenUTF8Codec from zarr-developers/zarr-python#2036

849b141

tomwhite mentioned this pull request Jul 18, 2024

Run tests against Zarr 3 sgkit-dev/bio2zarr#254

Open

jhamman added the V3 Affects the v3 branch label Aug 9, 2024

jhamman mentioned this pull request Sep 29, 2024

[v3] String support for v3 array #2268

Closed

rabernat added 2 commits September 29, 2024 17:59

Merge branch 'v3' into ryan/legacy-vlen

c86ddc6

got it working again

a322124

TomAugspurger mentioned this pull request Sep 30, 2024

Compatibility for zarr-python 3.x pydata/xarray#9552

Merged

4 tasks

rabernat added 2 commits September 30, 2024 20:02

got strings working; broke everything else

2a1e2e3

change v3.metadata.data_type type

1d3d7a5

rabernat mentioned this pull request Oct 1, 2024

Change ArrayV3Metadata.data_type to DataType #2278

Merged

rabernat added 3 commits September 30, 2024 22:08

merged

cd40b08

fixed tests

988f9df

satisfy mypy for tests

507161a

TomAugspurger mentioned this pull request Oct 3, 2024

[v3] default compressor / codec pipeline #2267

Open

rabernat added 2 commits October 3, 2024 16:26

make strings work

1ae5e63

add missing module

94ecdb5

rabernat requested a review from jhamman October 7, 2024 14:24

rabernat added 3 commits October 7, 2024 14:54

much better validation of fill value

6cf7dde

retype parse_fill_value

28d58fa

tests pass but not mypy

c6de878

rabernat commented Oct 7, 2024

View reviewed changes

src/zarr/core/metadata/v3.py Outdated Show resolved Hide resolved

rabernat added 2 commits October 7, 2024 20:25

attempted to change parse_fill_value typing

4f026db

restore DEFAULT_DTYPE

e427c7a

rabernat commented Oct 8, 2024

View reviewed changes

src/zarr/core/metadata/v3.py Outdated Show resolved Hide resolved

src/zarr/core/metadata/v3.py Outdated Show resolved Hide resolved

TomAugspurger and others added 6 commits October 7, 2024 20:57

fixup

7d9d897

docstring

0c21994

update test

c12ac41

add better DataType tests

3aeea1e

more progress on typing; still not passing mypy

cae7055

fix typing yay!

1aeb49a

rabernat commented Oct 8, 2024

View reviewed changes

src/zarr/strings.py Outdated Show resolved Hide resolved

make types work with numpy <, 2

6714bad

rabernat changed the title ~~Implement legacy vlen-utf8 codec~~ Add string and bytes dtypes plus vlen-utf8 and vlen-bytes codecs Oct 8, 2024

jhamman approved these changes Oct 8, 2024

View reviewed changes

rabernat and others added 5 commits October 8, 2024 14:20

Apply suggestions from code review

2edf3b8

Co-authored-by: Joe Hamman <[email protected]>

Apply suggestions from code review

12a0d65

Co-authored-by: Joe Hamman <[email protected]>

apply Joe's suggestions

7ba7077

add missing module

1e828b4

make _STRING_DTYPE private to try to make sphinx happy

ba0f093

rabernat merged commit 7e2be57 into zarr-developers:v3 Oct 8, 2024
20 checks passed

This was referenced Oct 8, 2024

[v3]: TypeError when updating Array.attrs, thanks to incorrect inferred fill_value #2153

Closed

zarr.create(dtype=str) has different dtype in 3 (<U) vs 2 (O) #2315

Closed

Special case str dtype in array creation #2323

Merged

d-v-b mentioned this pull request Oct 9, 2024

[extension codecs] zarr v2 vlenutf8 codec zarr-developers/zarr-specs#315

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `string` and `bytes` dtypes plus `vlen-utf8` and `vlen-bytes` codecs #2036

Add `string` and `bytes` dtypes plus `vlen-utf8` and `vlen-bytes` codecs #2036

rabernat commented Jul 14, 2024 •

edited

Loading

rabernat commented Jul 14, 2024 •

edited

Loading

tomwhite commented Jul 18, 2024

tomwhite commented Jul 18, 2024

TomAugspurger commented Sep 30, 2024

rabernat commented Sep 30, 2024

TomAugspurger commented Oct 1, 2024

rabernat commented Oct 1, 2024 •

edited

Loading

TomAugspurger commented Oct 1, 2024 •

edited

Loading

TomAugspurger commented Oct 3, 2024

rabernat commented Oct 3, 2024

mkitti commented Oct 3, 2024

rabernat left a comment

rabernat commented Oct 8, 2024

jhamman commented Oct 8, 2024

jhamman left a comment

jhamman Oct 8, 2024

		np_dtype = dtype.to_numpy()
		np_dtype = cast(np.dtype[np.generic], np_dtype)

	np_dtype = dtype.to_numpy()
	np_dtype = cast(np.dtype[np.generic], np_dtype)
	np_dtype = cast(np.dtype[np.generic], dtype.to_numpy())

Add string and bytes dtypes plus vlen-utf8 and vlen-bytes codecs #2036

Add string and bytes dtypes plus vlen-utf8 and vlen-bytes codecs #2036

Conversation

rabernat commented Jul 14, 2024 • edited Loading

rabernat commented Jul 14, 2024 • edited Loading

tomwhite commented Jul 18, 2024

tomwhite commented Jul 18, 2024

TomAugspurger commented Sep 30, 2024

rabernat commented Sep 30, 2024

TomAugspurger commented Oct 1, 2024

rabernat commented Oct 1, 2024 • edited Loading

TomAugspurger commented Oct 1, 2024 • edited Loading

TomAugspurger commented Oct 3, 2024

rabernat commented Oct 3, 2024

mkitti commented Oct 3, 2024

rabernat left a comment

Choose a reason for hiding this comment

rabernat commented Oct 8, 2024

jhamman commented Oct 8, 2024

jhamman left a comment

Choose a reason for hiding this comment

jhamman Oct 8, 2024

Choose a reason for hiding this comment

Add `string` and `bytes` dtypes plus `vlen-utf8` and `vlen-bytes` codecs #2036

Add `string` and `bytes` dtypes plus `vlen-utf8` and `vlen-bytes` codecs #2036

rabernat commented Jul 14, 2024 •

edited

Loading

rabernat commented Jul 14, 2024 •

edited

Loading

rabernat commented Oct 1, 2024 •

edited

Loading

TomAugspurger commented Oct 1, 2024 •

edited

Loading