-
-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add string
and bytes
dtypes plus vlen-utf8
and vlen-bytes
codecs
#2036
Conversation
cb617f0
to
c05b9d1
Compare
I compared this encoding + Zstd compression with the arrow-based encoding in #2031 and got some surprising results. The dataset consists of 47868 city names.
The surprise is that the legacy encoding compresses better than the arrow-based one. import zarr
from zarr.codecs import ZstdCodec
from zarr.buffer import default_buffer_prototype
from zarr.api.synchronous import array
import asyncio
# swap to compare across PRs
from zarr.codecs import VLenUTF8Codec as BytesCodec
#from zarr.codecs import ArrowRecordBatchCodec as BytesCodec
a = array(df.city.values, chunks=(1000,), fill_value='', codecs=[BytesCodec(), ZstdCodec(level=5)])
# ridiculous workarounds the lack of getsize on V3 store API
store_path = a._async_array.store_path
all_items = [item async for item in store_path.store.list()]
async def get_item_size(store, item):
return len(await store.get(item, prototype=default_buffer_prototype))
all_sizes = await asyncio.gather(*[get_item_size(store_path.store, item) for item in all_items])
total = sum(all_sizes) |
Thanks for opening this PR @rabernat! I just tried your implementation of |
+1 |
Thanks. Would it be possible to support object-dtype arrays here too (I think zarr-v2 or xarray assumes object-dtype arrays hold strings)? Right now this raises an exception. import zarr
import zarr.store
import numpy as np
import json
# from zarr.codecs import VLenUTF8Codec
store = zarr.store.MemoryStore(mode="a")
data = np.array(['a', 'bc', 'def', 'asdf', 'asdfg'], dtype=np.dtype(object))
arr = zarr.create(shape=data.shape, store=store, zarr_format=2, filters=[{"id": "vlen-utf8"}], fill_value="", dtype=object)
arr[:] = data
arr2 = zarr.open_array(store=store, zarr_format=2)
print(arr2[:]) Something like this seems to do the trick for diff --git a/src/zarr/codecs/_v2.py b/src/zarr/codecs/_v2.py
index cc6129e6..c1f4944e 100644
--- a/src/zarr/codecs/_v2.py
+++ b/src/zarr/codecs/_v2.py
@@ -37,7 +37,13 @@ class V2Compressor(ArrayBytesCodec):
# ensure correct dtype
if str(chunk_numpy_array.dtype) != chunk_spec.dtype:
- chunk_numpy_array = chunk_numpy_array.view(chunk_spec.dtype)
+ if chunk_spec.dtype.kind == "O":
+ # I think we need to assert something about the codec pipeline here
+ # to ensure that we're followed by something that'll do the cast properly.
+ # chunk_numpy_array = chunk_numpy_array.astype(chunk_spec.dtype)
+ pass
+ else:
+ chunk_numpy_array = chunk_numpy_array.view(chunk_spec.dtype)
return get_ndbuffer_class().from_numpy_array(chunk_numpy_array)
diff --git a/src/zarr/core/buffer/core.py b/src/zarr/core/buffer/core.py
index 9a808e08..034711d9 100644
--- a/src/zarr/core/buffer/core.py
+++ b/src/zarr/core/buffer/core.py
@@ -472,7 +472,7 @@ class NDBuffer:
# use array_equal to obtain equal_nan=True functionality
_data, other = np.broadcast_arrays(self._data, other)
return np.array_equal(
- self._data, other, equal_nan=equal_nan if self._data.dtype.kind not in "US" else False
+ self._data, other, equal_nan=equal_nan if self._data.dtype.kind not in "OUS" else False
)
def fill(self, value: Any) -> None: |
We could do that if we assume that the object dtype only ever hold strings. But that seems like a pretty extreme assumption. Zarr V2 allows arbitrary python objects, but V3 does not. Seems reasonable to force users to use an explicit string dtype for string arrays. What's the use case you have in mind for object string arrays? |
I'll look a bit closer tomorrow, but this was for reading zarr v2 data (an xarray dataset with an object-dtype coordinate array full of strings). I agree that being explicit about the dtype here is better. Hopefully we can find a way that works well for older v2 datasets without too much hassle. |
Here's a little exploration of different types of numpy string arrays import numpy as np
strings = ["hello", "world", "this", "is", "a", "test"]
# default
a1 = np.array(strings)
assert a1.dtype == '<U5'
# numpy 2.0 vlen string
a2 = np.array(strings, dtype='T')
# object
a3 = np.array(strings, dtype='O')
# the first two are detectable as string arrays
assert np.issubdtype(np.str_, a1.dtype)
assert np.issubdtype(np.str_, a2.dtype)
# the object is not, even though it contains all strings
assert not np.issubdtype(np.str_, a3.dtype)
# we can cast the object dtype to a vlen string or regular string
assert np.issubdtype(np.str_, a3.astype('T').dtype)
# however, we want to guard against casting if the array contains non-string elements
a4 = np.array(strings + [1], dtype='O')
a4.astype('T') # works
a4.astype('T', casting="same_kind") # -> TypeError So my proposal for object arrays would be to try to cast them using edit: actually that is useless, because this also fails: a3.astype('T', casting="same_kind") So it looks like there is actually no reliable way to know if an object array is a string array. @TomAugspurger - do you have a test case you are using for object arrays? Is my example enough to cover what Xarray needs? Under what circumstances does Xarray produce object arrays? Also, my code is relying on numpy 2.0 vlen-string arrays. I guess that's not a hard dependency for Zarr. 🤔 |
In xarray, In [10]: import xarray as xr, numpy as np, zarr
In [11]: ds = xr.Dataset({"a": xr.DataArray(np.array(['a', 'b', 'c'], dtype=object))})
In [12]: store = zarr.store.MemoryStore(mode="a")
In [13]: ds.to_zarr(store) Those are manually set to object dtype. But that same test also creates float arrays, so there's something more sophisticated than "object dtype == stings". I'll do some more looking as I work through those tests, but I think this PR can proceed independently. |
OK, I think I understand how v2 did things a bit better now.
The net result is that in zarr-python 2.x writing zarr v2 data, here's the compressor and filters for various dtypes:
|
Thanks Tom. Planning to work on this today. |
Could we use the existing sharding codec for new variable length data? Essentially each variable length element is a just an "inner chunk" with some offset and length (nbytes)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have this PR in a place where I am pretty happy with it. I have mostly refactored the internal code to use our own DataType
enum instead of numpy dtypes. This is necessary because the vlen types (strings, bytestrings) do not have such a simple 1:1 relationship with numpy dtypes.
However, I am thoroughly stuck on typing stuff. I am not able to recreate the sort of overloaded strict typing that was previously used on on parse_fill_value
.
I would greatly appreciate some help and advice from whoever wrote the parse_fill_value
type signatures. 🙏
Do not know why RTD build is failing, but otherwise this is GTG. |
string
and bytes
dtypes plus vlen-utf8
and vlen-bytes
codecs
This warning is causing the failure:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really cool @rabernat!
I have a handful of suggestions, mostly cosmetic.
np_dtype = dtype.to_numpy() | ||
np_dtype = cast(np.dtype[np.generic], np_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np_dtype = dtype.to_numpy() | |
np_dtype = cast(np.dtype[np.generic], np_dtype) | |
np_dtype = cast(np.dtype[np.generic], dtype.to_numpy()) |
Co-authored-by: Joe Hamman <[email protected]>
Co-authored-by: Joe Hamman <[email protected]>
This is an alternative approach to #2031 for implementing variable length string encoding, using the legacy numcodecs vlen-utf8 codec.
The codec is very simple. It encodes variable length data like this:
where
header
is the number of items and each item is preceded by an int which tells how many bytes long it is.We could add this as an "official" Zarr V3 codec if we want. The point of this PR is to demonstrate that it is simple to implement. If desirable, we could extend this to the other vlen types (bytes and array).
I am in favor of supporting this "legacy" encoding for backwards compatibility--I'd like existing Zarr V2 data to be convertible to V3 without rewriting the chunks.
How, I would favor using Arrow encoding (#2031) as the preferred default for new data.
xref: