-
-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add string
and bytes
dtypes plus vlen-utf8
and vlen-bytes
codecs
#2036
Merged
Merged
Changes from 16 commits
Commits
Show all changes
35 commits
Select commit
Hold shift + click to select a range
c05b9d1
add legacy vlen-utf8 codec
rabernat c86ddc6
Merge branch 'v3' into ryan/legacy-vlen
rabernat a322124
got it working again
rabernat 2a1e2e3
got strings working; broke everything else
rabernat 1d3d7a5
change v3.metadata.data_type type
rabernat cd40b08
merged
rabernat 988f9df
fixed tests
rabernat 507161a
satisfy mypy for tests
rabernat 1ae5e63
make strings work
rabernat 94ecdb5
add missing module
rabernat 2c7d638
Merge branch 'v3' into ryan/legacy-vlen
d-v-b b1717d8
Merge remote-tracking branch 'upstream/v3' into ryan/legacy-vlen
rabernat 79b7d43
store -> storage
rabernat a5c2a37
rename module
rabernat 717f0c7
Merge remote-tracking branch 'origin/ryan/legacy-vlen' into ryan/lega…
rabernat b90d8f3
merged
rabernat 0406ea1
add vlen bytes
rabernat 8e61a18
fix type assertions in test
rabernat 6cf7dde
much better validation of fill value
rabernat 28d58fa
retype parse_fill_value
rabernat c6de878
tests pass but not mypy
rabernat 4f026db
attempted to change parse_fill_value typing
rabernat e427c7a
restore DEFAULT_DTYPE
rabernat 7d9d897
fixup
TomAugspurger 0c21994
docstring
TomAugspurger c12ac41
update test
TomAugspurger 3aeea1e
add better DataType tests
rabernat cae7055
more progress on typing; still not passing mypy
rabernat 1aeb49a
fix typing yay!
rabernat 6714bad
make types work with numpy <, 2
rabernat 2edf3b8
Apply suggestions from code review
rabernat 12a0d65
Apply suggestions from code review
rabernat 7ba7077
apply Joe's suggestions
rabernat 1e828b4
add missing module
rabernat ba0f093
make _STRING_DTYPE private to try to make sphinx happy
rabernat File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
from __future__ import annotations | ||
|
||
from dataclasses import dataclass | ||
from typing import TYPE_CHECKING | ||
|
||
import numpy as np | ||
from numcodecs.vlen import VLenUTF8 | ||
|
||
from zarr.abc.codec import ArrayBytesCodec | ||
from zarr.core.buffer import Buffer, NDBuffer | ||
from zarr.core.common import JSON, parse_named_configuration | ||
from zarr.registry import register_codec | ||
from zarr.strings import cast_to_string_dtype | ||
|
||
if TYPE_CHECKING: | ||
from typing import Self | ||
|
||
from zarr.core.array_spec import ArraySpec | ||
|
||
|
||
# can use a global because there are no parameters | ||
vlen_utf8_codec = VLenUTF8() | ||
|
||
|
||
@dataclass(frozen=True) | ||
class VLenUTF8Codec(ArrayBytesCodec): | ||
@classmethod | ||
def from_dict(cls, data: dict[str, JSON]) -> Self: | ||
_, configuration_parsed = parse_named_configuration( | ||
data, "vlen-utf8", require_configuration=False | ||
) | ||
configuration_parsed = configuration_parsed or {} | ||
return cls(**configuration_parsed) | ||
|
||
def to_dict(self) -> dict[str, JSON]: | ||
return {"name": "vlen-utf8", "configuration": {}} | ||
|
||
def evolve_from_array_spec(self, array_spec: ArraySpec) -> Self: | ||
return self | ||
|
||
async def _decode_single( | ||
self, | ||
chunk_bytes: Buffer, | ||
chunk_spec: ArraySpec, | ||
) -> NDBuffer: | ||
assert isinstance(chunk_bytes, Buffer) | ||
|
||
raw_bytes = chunk_bytes.as_array_like() | ||
decoded = vlen_utf8_codec.decode(raw_bytes) | ||
assert decoded.dtype == np.object_ | ||
decoded.shape = chunk_spec.shape | ||
# coming out of the code, we know this is safe, so don't issue a warning | ||
as_string_dtype = cast_to_string_dtype(decoded, safe=True) | ||
return chunk_spec.prototype.nd_buffer.from_numpy_array(as_string_dtype) | ||
|
||
async def _encode_single( | ||
self, | ||
chunk_array: NDBuffer, | ||
chunk_spec: ArraySpec, | ||
) -> Buffer | None: | ||
assert isinstance(chunk_array, NDBuffer) | ||
return chunk_spec.prototype.buffer.from_bytes( | ||
vlen_utf8_codec.encode(chunk_array.as_numpy_array()) | ||
) | ||
|
||
def compute_encoded_size(self, input_byte_length: int, _chunk_spec: ArraySpec) -> int: | ||
# what is input_byte_length for an object dtype? | ||
raise NotImplementedError("compute_encoded_size is not implemented for VLen codecs") | ||
|
||
|
||
register_codec("vlen-utf8", VLenUTF8Codec) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -313,8 +313,7 @@ class NDBuffer: | |
""" | ||
|
||
def __init__(self, array: NDArrayLike) -> None: | ||
# assert array.ndim > 0 | ||
assert array.dtype != object | ||
# assert array.dtype != object | ||
rabernat marked this conversation as resolved.
Show resolved
Hide resolved
|
||
self._data = array | ||
|
||
@classmethod | ||
|
@@ -467,9 +466,12 @@ def all_equal(self, other: Any, equal_nan: bool = True) -> bool: | |
# Handle None fill_value for Zarr V2 | ||
return False | ||
# use array_equal to obtain equal_nan=True functionality | ||
# Note from Ryan: doesn't this lead to a huge amount of unnecessary memory allocation on every single chunk? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we're OK. In [10]: a, b = np.broadcast_arrays(np.ones(10), 2)
In [11]: b.base
Out[11]: array(2) I think that it just does some tricks with the strides or something?
rabernat marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# Since fill-value is a scalar, isn't there a faster path than allocating a new array for fill value | ||
# every single time we have to write data? | ||
_data, other = np.broadcast_arrays(self._data, other) | ||
return np.array_equal( | ||
self._data, other, equal_nan=equal_nan if self._data.dtype.kind not in "US" else False | ||
self._data, other, equal_nan=equal_nan if self._data.dtype.kind not in "USTO" else False | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
) | ||
|
||
def fill(self, value: Any) -> None: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
from typing import Any | ||
from warnings import warn | ||
|
||
import numpy as np | ||
|
||
try: | ||
STRING_DTYPE = np.dtype("T") | ||
NUMPY_SUPPORTS_VLEN_STRING = True | ||
except TypeError: | ||
STRING_DTYPE = np.dtype("object") | ||
NUMPY_SUPPORTS_VLEN_STRING = False | ||
|
||
|
||
def cast_to_string_dtype( | ||
data: np.ndarray[Any, np.dtype[Any]], safe: bool = False | ||
) -> np.ndarray[Any, np.dtype[Any]]: | ||
if np.issubdtype(data.dtype, np.str_): | ||
return data | ||
if np.issubdtype(data.dtype, np.object_): | ||
if NUMPY_SUPPORTS_VLEN_STRING: | ||
try: | ||
# cast to variable-length string dtype, fail if object contains non-string data | ||
# mypy says "error: Unexpected keyword argument "coerce" for "StringDType" [call-arg]" | ||
return data.astype(np.dtypes.StringDType(coerce=False), copy=False) # type: ignore[call-arg] | ||
except ValueError as e: | ||
raise ValueError("Cannot cast object dtype to string dtype") from e | ||
else: | ||
out = data.astype(np.str_) | ||
if not safe: | ||
warn( | ||
f"Casted object dtype to string dtype {out.dtype}. To avoid this warning, " | ||
"cast the data to a string dtype before passing to Zarr or upgrade to NumPy >= 2.", | ||
stacklevel=2, | ||
) | ||
return out | ||
raise ValueError(f"Cannot cast dtype {data.dtype} to string dtype") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
from typing import Any | ||
|
||
import numpy as np | ||
import pytest | ||
|
||
from zarr import Array | ||
from zarr.abc.store import Store | ||
from zarr.codecs import VLenUTF8Codec | ||
from zarr.core.metadata.v3 import ArrayV3Metadata, DataType | ||
from zarr.storage.common import StorePath | ||
from zarr.strings import NUMPY_SUPPORTS_VLEN_STRING | ||
|
||
numpy_str_dtypes: list[type | None] = [None, str, np.dtypes.StrDType] | ||
expected_zarr_string_dtype: np.dtype[Any] | ||
if NUMPY_SUPPORTS_VLEN_STRING: | ||
numpy_str_dtypes.append(np.dtypes.StringDType) | ||
expected_zarr_string_dtype = np.dtypes.StringDType() | ||
else: | ||
expected_zarr_string_dtype = np.dtype("O") | ||
|
||
|
||
@pytest.mark.parametrize("store", ["memory", "local"], indirect=["store"]) | ||
@pytest.mark.parametrize("dtype", numpy_str_dtypes) | ||
async def test_vlen_string(store: Store, dtype: None | np.dtype[Any]) -> None: | ||
strings = ["hello", "world", "this", "is", "a", "test"] | ||
data = np.array(strings).reshape((2, 3)) | ||
if dtype is not None: | ||
data = data.astype(dtype) | ||
|
||
sp = StorePath(store, path="string") | ||
a = Array.create( | ||
sp, | ||
shape=data.shape, | ||
chunk_shape=data.shape, | ||
dtype=data.dtype, | ||
fill_value="", | ||
codecs=[VLenUTF8Codec()], | ||
) | ||
assert isinstance(a.metadata, ArrayV3Metadata) # needed for mypy | ||
|
||
a[:, :] = data | ||
assert np.array_equal(data, a[:, :]) | ||
assert a.metadata.data_type == DataType.string | ||
assert a.dtype == expected_zarr_string_dtype | ||
|
||
# test round trip | ||
b = Array.open(sp) | ||
assert isinstance(b.metadata, ArrayV3Metadata) # needed for mypy | ||
assert np.array_equal(data, b[:, :]) | ||
assert b.metadata.data_type == DataType.string | ||
assert a.dtype == expected_zarr_string_dtype |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thought I had while implementing this: often the original numpy array is a fixed-length type (e.g.
<U5
). In V2, this dtype could be stored directly in the metadata, whereas now we are losing the5
, which is potentially useful information.In the future, we may want to try to resurface this information.