Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add string and bytes dtypes plus vlen-utf8 and vlen-bytes codecs #2036

Merged
merged 35 commits into from
Oct 8, 2024

Conversation

rabernat
Copy link
Contributor

@rabernat rabernat commented Jul 14, 2024

This is an alternative approach to #2031 for implementing variable length string encoding, using the legacy numcodecs vlen-utf8 codec.

The codec is very simple. It encodes variable length data like this:

| header | item 0 len | item 0 data | item 1 len | item 1 data | ... 
| uint4  | unit4      | variable    | unit4      | variable    | ...

where header is the number of items and each item is preceded by an int which tells how many bytes long it is.

We could add this as an "official" Zarr V3 codec if we want. The point of this PR is to demonstrate that it is simple to implement. If desirable, we could extend this to the other vlen types (bytes and array).

I am in favor of supporting this "legacy" encoding for backwards compatibility--I'd like existing Zarr V2 data to be convertible to V3 without rewriting the chunks.

How, I would favor using Arrow encoding (#2031) as the preferred default for new data.

xref:

@rabernat
Copy link
Contributor Author

rabernat commented Jul 14, 2024

I compared this encoding + Zstd compression with the arrow-based encoding in #2031 and got some surprising results. The dataset consists of 47868 city names.

encoding compression stored bytes
vlen-utf8 None 642616
vlen-utf8 Zstd(2) 366504
vlen-utf8 Zstd(2) 358428
arrow None 657185
arrow Zstd(2) 462208
arrow Zstd(5) 455234

The surprise is that the legacy encoding compresses better than the arrow-based one.

import zarr
from zarr.codecs import ZstdCodec
from zarr.buffer import default_buffer_prototype
from zarr.api.synchronous import array
import asyncio

# swap to compare across PRs
from zarr.codecs import VLenUTF8Codec as BytesCodec
#from zarr.codecs import ArrowRecordBatchCodec as BytesCodec

a = array(df.city.values, chunks=(1000,), fill_value='', codecs=[BytesCodec(), ZstdCodec(level=5)])

# ridiculous workarounds the lack of getsize on V3 store API
store_path = a._async_array.store_path
all_items = [item async for item in store_path.store.list()]
async def get_item_size(store, item):
    return len(await store.get(item, prototype=default_buffer_prototype))
all_sizes = await asyncio.gather(*[get_item_size(store_path.store, item) for item in all_items])
total = sum(all_sizes)

worldcities.csv

@tomwhite
Copy link
Contributor

Thanks for opening this PR @rabernat!

I just tried your implementation of VLenUTF8Codec with the VCF to Zarr conversion in sgkit-dev/bio2zarr#254, and the test worked, which shows string encoding and decoding is working correctly in this case.

@tomwhite
Copy link
Contributor

I am in favor of supporting this "legacy" encoding for backwards compatibility--I'd like existing Zarr V2 data to be convertible to V3 without rewriting the chunks.

+1

@jhamman jhamman added the V3 Affects the v3 branch label Aug 9, 2024
@TomAugspurger
Copy link
Contributor

Thanks. Would it be possible to support object-dtype arrays here too (I think zarr-v2 or xarray assumes object-dtype arrays hold strings)?

Right now this raises an exception.

import zarr
import zarr.store
import numpy as np
import json
# from zarr.codecs import VLenUTF8Codec

store = zarr.store.MemoryStore(mode="a")
data = np.array(['a', 'bc', 'def', 'asdf', 'asdfg'], dtype=np.dtype(object))

arr = zarr.create(shape=data.shape, store=store, zarr_format=2, filters=[{"id": "vlen-utf8"}], fill_value="", dtype=object)
arr[:] = data
arr2 = zarr.open_array(store=store, zarr_format=2)
print(arr2[:])

Something like this seems to do the trick for zarr_format=2:

diff --git a/src/zarr/codecs/_v2.py b/src/zarr/codecs/_v2.py
index cc6129e6..c1f4944e 100644
--- a/src/zarr/codecs/_v2.py
+++ b/src/zarr/codecs/_v2.py
@@ -37,7 +37,13 @@ class V2Compressor(ArrayBytesCodec):
 
         # ensure correct dtype
         if str(chunk_numpy_array.dtype) != chunk_spec.dtype:
-            chunk_numpy_array = chunk_numpy_array.view(chunk_spec.dtype)
+            if chunk_spec.dtype.kind == "O":
+                # I think we need to assert something about the codec pipeline here
+                # to ensure that we're followed by something that'll do the cast properly.
+                # chunk_numpy_array = chunk_numpy_array.astype(chunk_spec.dtype)
+                pass
+            else:
+                chunk_numpy_array = chunk_numpy_array.view(chunk_spec.dtype)
 
         return get_ndbuffer_class().from_numpy_array(chunk_numpy_array)
 
diff --git a/src/zarr/core/buffer/core.py b/src/zarr/core/buffer/core.py
index 9a808e08..034711d9 100644
--- a/src/zarr/core/buffer/core.py
+++ b/src/zarr/core/buffer/core.py
@@ -472,7 +472,7 @@ class NDBuffer:
         # use array_equal to obtain equal_nan=True functionality
         _data, other = np.broadcast_arrays(self._data, other)
         return np.array_equal(
-            self._data, other, equal_nan=equal_nan if self._data.dtype.kind not in "US" else False
+            self._data, other, equal_nan=equal_nan if self._data.dtype.kind not in "OUS" else False
         )
 
     def fill(self, value: Any) -> None:

@rabernat
Copy link
Contributor Author

We could do that if we assume that the object dtype only ever hold strings. But that seems like a pretty extreme assumption.

Zarr V2 allows arbitrary python objects, but V3 does not. Seems reasonable to force users to use an explicit string dtype for string arrays.

What's the use case you have in mind for object string arrays?

@TomAugspurger
Copy link
Contributor

I'll look a bit closer tomorrow, but this was for reading zarr v2 data (an xarray dataset with an object-dtype coordinate array full of strings).

I agree that being explicit about the dtype here is better. Hopefully we can find a way that works well for older v2 datasets without too much hassle.

@rabernat
Copy link
Contributor Author

rabernat commented Oct 1, 2024

Here's a little exploration of different types of numpy string arrays

import numpy as np
strings = ["hello", "world", "this", "is", "a", "test"]
# default
a1 = np.array(strings)
assert a1.dtype == '<U5'
# numpy 2.0 vlen string
a2 = np.array(strings, dtype='T')
# object
a3 = np.array(strings, dtype='O')

# the first two are detectable as string arrays
assert np.issubdtype(np.str_, a1.dtype)
assert np.issubdtype(np.str_, a2.dtype)

# the object is not, even though it contains all strings
assert not np.issubdtype(np.str_, a3.dtype)

# we can cast the object dtype to a vlen string or regular string
assert np.issubdtype(np.str_, a3.astype('T').dtype)

# however, we want to guard against casting if the array contains non-string elements
a4 = np.array(strings + [1], dtype='O') 
a4.astype('T')  # works
a4.astype('T', casting="same_kind")  # -> TypeError

So my proposal for object arrays would be to try to cast them using .astype('T', casking="same_kind") and, if successful, use string encoding.

edit: actually that is useless, because this also fails:

a3.astype('T', casting="same_kind")

So it looks like there is actually no reliable way to know if an object array is a string array.

@TomAugspurger - do you have a test case you are using for object arrays? Is my example enough to cover what Xarray needs? Under what circumstances does Xarray produce object arrays?


Also, my code is relying on numpy 2.0 vlen-string arrays. I guess that's not a hard dependency for Zarr. 🤔

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 1, 2024

do you have a test case you are using for object arrays?

In xarray, xarray/tests/test_backends.py::TestZarrDictStore::test_roundtrip_object_dtype is the failing test: https://github.com/pydata/xarray/blob/095d47fcb036441532bf6f5aed907a6c4cfdfe0d/xarray/tests/test_backends.py#L506

 In [10]: import xarray as xr, numpy as np, zarr

In [11]: ds = xr.Dataset({"a": xr.DataArray(np.array(['a', 'b', 'c'], dtype=object))})

In [12]: store = zarr.store.MemoryStore(mode="a")

In [13]: ds.to_zarr(store)

Those are manually set to object dtype. But that same test also creates float arrays, so there's something more sophisticated than "object dtype == stings". I'll do some more looking as I work through those tests, but I think this PR can proceed independently.

@TomAugspurger
Copy link
Contributor

OK, I think I understand how v2 did things a bit better now.

  1. It's xarray that treats object-dtype arrays as strings, after validating that's actually true. If so, then xarray sets the dtype passed to Zarr to str: https://github.com/pydata/xarray/blob/5c6418219e551cd59f81e3d1c6a828c1c23cd763/xarray/backends/zarr.py#L899-L900
  2. Then zarr turns dtype=str back into dtype=object with

    zarr-python/zarr/util.py

    Lines 187 to 192 in 6b9ab9e

    if isinstance(dtype, str):
    # allow ':' to delimit class from codec arguments
    tokens = dtype.split(":")
    key = tokens[0]
    if key in object_codecs:
    dtype = np.dtype(object)
    (assuming an object codec is provided, which xarray does)

The net result is that in zarr-python 2.x writing zarr v2 data, here's the compressor and filters for various dtypes:

dtype compressor filters
strings (U) Blosc
bytes (S) Blosc
object[str] (O) Blosc VLenUTF8

@rabernat
Copy link
Contributor Author

rabernat commented Oct 3, 2024

Thanks Tom. Planning to work on this today.

@mkitti
Copy link

mkitti commented Oct 3, 2024

Could we use the existing sharding codec for new variable length data? Essentially each variable length element is a just an "inner chunk" with some offset and length (nbytes)?

@rabernat rabernat requested a review from jhamman October 7, 2024 14:24
src/zarr/core/metadata/v3.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have this PR in a place where I am pretty happy with it. I have mostly refactored the internal code to use our own DataType enum instead of numpy dtypes. This is necessary because the vlen types (strings, bytestrings) do not have such a simple 1:1 relationship with numpy dtypes.

However, I am thoroughly stuck on typing stuff. I am not able to recreate the sort of overloaded strict typing that was previously used on on parse_fill_value.

I would greatly appreciate some help and advice from whoever wrote the parse_fill_value type signatures. 🙏

src/zarr/core/metadata/v3.py Outdated Show resolved Hide resolved
src/zarr/core/metadata/v3.py Outdated Show resolved Hide resolved
src/zarr/strings.py Outdated Show resolved Hide resolved
@rabernat
Copy link
Contributor Author

rabernat commented Oct 8, 2024

Do not know why RTD build is failing, but otherwise this is GTG.

@rabernat rabernat changed the title Implement legacy vlen-utf8 codec Add string and bytes dtypes plus vlen-utf8 and vlen-bytes codecs Oct 8, 2024
@jhamman
Copy link
Member

jhamman commented Oct 8, 2024

Do not know why RTD build is failing, but otherwise this is GTG.

This warning is causing the failure:

/home/docs/checkouts/readthedocs.org/user_builds/zarr/checkouts/2036/docs/_autoapi/zarr/strings/index.rst:73: WARNING: duplicate object description of zarr.strings.STRING_DTYPE, other instance in _autoapi/zarr/strings/index, use :no-index: for one of them

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool @rabernat!

I have a handful of suggestions, mostly cosmetic.

src/zarr/strings.py Outdated Show resolved Hide resolved
src/zarr/core/buffer/core.py Outdated Show resolved Hide resolved
src/zarr/codecs/__init__.py Outdated Show resolved Hide resolved
src/zarr/codecs/vlen_utf8.py Outdated Show resolved Hide resolved
src/zarr/core/buffer/core.py Outdated Show resolved Hide resolved
src/zarr/core/metadata/v3.py Outdated Show resolved Hide resolved
src/zarr/core/metadata/v3.py Outdated Show resolved Hide resolved
Comment on lines +483 to +484
np_dtype = dtype.to_numpy()
np_dtype = cast(np.dtype[np.generic], np_dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
np_dtype = dtype.to_numpy()
np_dtype = cast(np.dtype[np.generic], np_dtype)
np_dtype = cast(np.dtype[np.generic], dtype.to_numpy())

tests/v3/test_codecs/test_vlen.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
V3 Affects the v3 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants