-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Implement __arrow_c_stream__
on ChunkedArray
#38717
Comments
Indeed, it seems we missed that in the initial PR. I think it makes a lot of sense to add the stream protocol for ChunkedArray. |
IIRC, Arrow C++ doesn't support exporting arrays as a C Arrow Stream, so I think that needs to be implemented first. |
Ah, yes, now I remember. For the stream support, we currently generally assume you have a stream of batches, i.e. some form of RecordBatchReader |
Yes, I think this needs C++-level support (but I think it's worth it!). I will try to take a stab at an implementation before 15.0.0...I would like to use it in the R bindings as well. To summarise where this came up in GeoArrow-land...we'd like to base our Python ecosystem on the C data/stream interfaces (e.g., functions use the dunder methods when consuming input, and return an object that implements a dunder); however, right now we can't consume or return ChunkedArray. We could force a concatenation (maybe expensive and maybe resulting in overflow since one common extension type is based on binary), or we could constantly special-case the ChunkedArray and loop over chunks at the Python level (verbose, and performance degrades when chunks get small/numerous). The ChunkedArray is the most common Array representation in pyarrow (e.g., column in a Table!), so this comes up quite a bit. |
… ArrowArrayStream (#39455) ### Rationale for this change The `ChunkedArray` has no equivalent in the C data interface; however, it is the primary array structure that higher level bindings interact with (because it is a column in a `Table`). In the Python capsule interface, this means that ChunkedArrays always require a workaround involving loops in Python. ### What changes are included in this PR? - Added `ImportChunkedArray()` and `ExportChunkedArray()` - Generalized the classes that support import/export to relax the assumption that every `ArrowArray` in an `ArrowArrayStream` is a `RecordBatch`. ### Are these changes tested? TODO ### Are there any user-facing changes? Yes, two new functions are added to bridge.h. * Closes: #38717 Lead-authored-by: Dewey Dunnington <[email protected]> Co-authored-by: Dewey Dunnington <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Dewey Dunnington <[email protected]>
…o/from ArrowArrayStream (apache#39455) ### Rationale for this change The `ChunkedArray` has no equivalent in the C data interface; however, it is the primary array structure that higher level bindings interact with (because it is a column in a `Table`). In the Python capsule interface, this means that ChunkedArrays always require a workaround involving loops in Python. ### What changes are included in this PR? - Added `ImportChunkedArray()` and `ExportChunkedArray()` - Generalized the classes that support import/export to relax the assumption that every `ArrowArray` in an `ArrowArrayStream` is a `RecordBatch`. ### Are these changes tested? TODO ### Are there any user-facing changes? Yes, two new functions are added to bridge.h. * Closes: apache#38717 Lead-authored-by: Dewey Dunnington <[email protected]> Co-authored-by: Dewey Dunnington <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Dewey Dunnington <[email protected]>
…o/from ArrowArrayStream (apache#39455) ### Rationale for this change The `ChunkedArray` has no equivalent in the C data interface; however, it is the primary array structure that higher level bindings interact with (because it is a column in a `Table`). In the Python capsule interface, this means that ChunkedArrays always require a workaround involving loops in Python. ### What changes are included in this PR? - Added `ImportChunkedArray()` and `ExportChunkedArray()` - Generalized the classes that support import/export to relax the assumption that every `ArrowArray` in an `ArrowArrayStream` is a `RecordBatch`. ### Are these changes tested? TODO ### Are there any user-facing changes? Yes, two new functions are added to bridge.h. * Closes: apache#38717 Lead-authored-by: Dewey Dunnington <[email protected]> Co-authored-by: Dewey Dunnington <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Dewey Dunnington <[email protected]>
…o/from ArrowArrayStream (apache#39455) ### Rationale for this change The `ChunkedArray` has no equivalent in the C data interface; however, it is the primary array structure that higher level bindings interact with (because it is a column in a `Table`). In the Python capsule interface, this means that ChunkedArrays always require a workaround involving loops in Python. ### What changes are included in this PR? - Added `ImportChunkedArray()` and `ExportChunkedArray()` - Generalized the classes that support import/export to relax the assumption that every `ArrowArray` in an `ArrowArrayStream` is a `RecordBatch`. ### Are these changes tested? TODO ### Are there any user-facing changes? Yes, two new functions are added to bridge.h. * Closes: apache#38717 Lead-authored-by: Dewey Dunnington <[email protected]> Co-authored-by: Dewey Dunnington <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Dewey Dunnington <[email protected]>
Describe the enhancement requested
I'm really excited about the new PyCapsule API.
It doesn't look like any dunder methods have been implemented on the
ChunkedArray
class. It would seem natural to implement__arrow_c_stream__
onChunkedArray
. Otherwise, other bindings have to go through a pyarrow-specific API to page through each array.Component(s)
Python
The text was updated successfully, but these errors were encountered: