ArrowRecordBatchCodec and vlen string support #2031

rabernat · 2024-07-12T22:16:41Z

The discussion in zarr-developers/zeps#47 got me thinking: what if, instead of turning numpy arrays into bytes, we turn them into self-describing Arrow Record Batches and serialize them using the Arrow IPC format.

This would be a new type of Array -> Bytes codec. The beautiful thing about this is that it gives us variable-length string encoding for free (as well as potentially many other benefits) -- xref zarr-developers/zarr-specs#83.

This PR is a proof of concept that this is feasible and in fact very easy.

There is a lot more to explore here, but I thought I would just through this up for discussion.

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

rabernat · 2024-07-14T17:07:55Z

This experiment also suggests another interesting possibility: returning Arrow Arrays and Tables from a Zarr Array or Group. If the Zarr Arrays are all 1D, they can be represented as Arrow Arrays all the way through, and there are potentially opportunities to reduce memory copies. We could have ArrowBuffer / ArrowArrayBuffer types.

jeromekelleher · 2024-07-14T18:35:13Z

This sounds really interesting and potentially very powerful @rabernat!

Would you mind commenting on the implications of a pyarrow dependency?

rabernat · 2024-07-14T19:30:43Z

Would you mind commenting on the implications of a pyarrow dependency?

I feel like it is becoming as ubiquitous as numpy in the ecosystem, so I don't consider this a major blocker. Or it could be an optional dependency if you want to read data encoded this way. But I'd be curious to hear opinions on that.

TomAugspurger · 2024-07-20T14:37:43Z

There's lots of feedback in pandas-dev/pandas#54466 on pandas adopting pyarrow as a required dependency. The primary concern raised is the size of the package, especially in serverless contexts (though it seems like AWS Lambda has some built-in support to make this not so much of an issue?).

There's some work being done in pyarrow to make core pieces available without having to bring in everything.

joshmoore · 2024-07-24T19:36:32Z

pyproject.toml

@@ -109,6 +109,7 @@ dependencies = [
    "universal_pathlib"
 ]
 extra-dependencies = [
+    "pyarrow",


TIL: https://pypi.org/project/nanoarrow

akshaysubr · 2024-09-17T19:21:22Z

src/zarr/codecs/arrow.py

+        chunk_spec: ArraySpec,
+    ) -> Buffer | None:
+        assert isinstance(chunk_array, NDBuffer)
+        arrow_array = pa.array(chunk_array.as_ndarray_like().ravel())


Should probably be chunk_array.as_numpy_array() here since pa.array() doesn't recognize CuPy arrays? Would be good to add a GPU test here for safety.

In theory, it would be possible to do zero-copy transfers for CuPy arrays too but would need to go from CuPy -> Numba first and then Numba -> Arrow.

add ArrowRecordBatchCodec

e755fa7

rabernat mentioned this pull request Jul 12, 2024

Supporting UTF-8 data type zarr-developers/zarr-specs#83

Open

use memoryview instead of py_bytes

62adc60

This was referenced Jul 13, 2024

Add Codec unit tests #2035

Draft

Add string and bytes dtypes plus vlen-utf8 and vlen-bytes codecs #2036

Merged

LDeakin mentioned this pull request Jul 17, 2024

Variable length data types LDeakin/zarrs#40

Merged

joshmoore reviewed Jul 24, 2024

View reviewed changes

flying-sheep mentioned this pull request Aug 8, 2024

Nullable string columns scverse/anndata#679

Closed

jhamman added the V3 Affects the v3 branch label Aug 9, 2024

akshaysubr reviewed Sep 17, 2024

View reviewed changes

jhamman changed the base branch from v3 to main October 14, 2024 20:59

jhamman added this to the After 3.0.0 milestone Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ArrowRecordBatchCodec and vlen string support #2031

ArrowRecordBatchCodec and vlen string support #2031

rabernat commented Jul 12, 2024 •

edited

Loading

rabernat commented Jul 14, 2024

jeromekelleher commented Jul 14, 2024

rabernat commented Jul 14, 2024

TomAugspurger commented Jul 20, 2024

joshmoore Jul 24, 2024

akshaysubr Sep 17, 2024

ArrowRecordBatchCodec and vlen string support #2031

Are you sure you want to change the base?

ArrowRecordBatchCodec and vlen string support #2031

Conversation

rabernat commented Jul 12, 2024 • edited Loading

rabernat commented Jul 14, 2024

jeromekelleher commented Jul 14, 2024

rabernat commented Jul 14, 2024

TomAugspurger commented Jul 20, 2024

joshmoore Jul 24, 2024

Choose a reason for hiding this comment

akshaysubr Sep 17, 2024

Choose a reason for hiding this comment

rabernat commented Jul 12, 2024 •

edited

Loading