Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Implement __arrow_c_stream__ on ChunkedArray #38717

Closed
kylebarron opened this issue Nov 14, 2023 · 4 comments · Fixed by #39455
Closed

[Python] Implement __arrow_c_stream__ on ChunkedArray #38717

kylebarron opened this issue Nov 14, 2023 · 4 comments · Fixed by #39455

Comments

@kylebarron
Copy link
Contributor

Describe the enhancement requested

I'm really excited about the new PyCapsule API.

It doesn't look like any dunder methods have been implemented on the ChunkedArray class. It would seem natural to implement __arrow_c_stream__ on ChunkedArray. Otherwise, other bindings have to go through a pyarrow-specific API to page through each array.

Component(s)

Python

@jorisvandenbossche
Copy link
Member

Indeed, it seems we missed that in the initial PR. I think it makes a lot of sense to add the stream protocol for ChunkedArray.

@wjones127
Copy link
Member

IIRC, Arrow C++ doesn't support exporting arrays as a C Arrow Stream, so I think that needs to be implemented first.

@jorisvandenbossche
Copy link
Member

Ah, yes, now I remember. For the stream support, we currently generally assume you have a stream of batches, i.e. some form of RecordBatchReader

@paleolimbot
Copy link
Member

Yes, I think this needs C++-level support (but I think it's worth it!). I will try to take a stab at an implementation before 15.0.0...I would like to use it in the R bindings as well.

To summarise where this came up in GeoArrow-land...we'd like to base our Python ecosystem on the C data/stream interfaces (e.g., functions use the dunder methods when consuming input, and return an object that implements a dunder); however, right now we can't consume or return ChunkedArray. We could force a concatenation (maybe expensive and maybe resulting in overflow since one common extension type is based on binary), or we could constantly special-case the ChunkedArray and loop over chunks at the Python level (verbose, and performance degrades when chunks get small/numerous). The ChunkedArray is the most common Array representation in pyarrow (e.g., column in a Table!), so this comes up quite a bit.

@raulcd raulcd modified the milestones: 15.0.0, 16.0.0 Jan 8, 2024
paleolimbot added a commit that referenced this issue Feb 7, 2024
… ArrowArrayStream (#39455)

### Rationale for this change

The `ChunkedArray` has no equivalent in the C data interface; however, it is the primary array structure that higher level bindings interact with (because it is a column in a `Table`). In the Python capsule interface, this means that ChunkedArrays always require a workaround involving loops in Python.

### What changes are included in this PR?

- Added `ImportChunkedArray()` and `ExportChunkedArray()`
- Generalized the classes that support import/export to relax the assumption that every `ArrowArray` in an `ArrowArrayStream` is a `RecordBatch`.

### Are these changes tested?

TODO

### Are there any user-facing changes?

Yes, two new functions are added to bridge.h.
* Closes: #38717

Lead-authored-by: Dewey Dunnington <[email protected]>
Co-authored-by: Dewey Dunnington <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Dewey Dunnington <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…o/from ArrowArrayStream (apache#39455)

### Rationale for this change

The `ChunkedArray` has no equivalent in the C data interface; however, it is the primary array structure that higher level bindings interact with (because it is a column in a `Table`). In the Python capsule interface, this means that ChunkedArrays always require a workaround involving loops in Python.

### What changes are included in this PR?

- Added `ImportChunkedArray()` and `ExportChunkedArray()`
- Generalized the classes that support import/export to relax the assumption that every `ArrowArray` in an `ArrowArrayStream` is a `RecordBatch`.

### Are these changes tested?

TODO

### Are there any user-facing changes?

Yes, two new functions are added to bridge.h.
* Closes: apache#38717

Lead-authored-by: Dewey Dunnington <[email protected]>
Co-authored-by: Dewey Dunnington <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Dewey Dunnington <[email protected]>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…o/from ArrowArrayStream (apache#39455)

### Rationale for this change

The `ChunkedArray` has no equivalent in the C data interface; however, it is the primary array structure that higher level bindings interact with (because it is a column in a `Table`). In the Python capsule interface, this means that ChunkedArrays always require a workaround involving loops in Python.

### What changes are included in this PR?

- Added `ImportChunkedArray()` and `ExportChunkedArray()`
- Generalized the classes that support import/export to relax the assumption that every `ArrowArray` in an `ArrowArrayStream` is a `RecordBatch`.

### Are these changes tested?

TODO

### Are there any user-facing changes?

Yes, two new functions are added to bridge.h.
* Closes: apache#38717

Lead-authored-by: Dewey Dunnington <[email protected]>
Co-authored-by: Dewey Dunnington <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Dewey Dunnington <[email protected]>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…o/from ArrowArrayStream (apache#39455)

### Rationale for this change

The `ChunkedArray` has no equivalent in the C data interface; however, it is the primary array structure that higher level bindings interact with (because it is a column in a `Table`). In the Python capsule interface, this means that ChunkedArrays always require a workaround involving loops in Python.

### What changes are included in this PR?

- Added `ImportChunkedArray()` and `ExportChunkedArray()`
- Generalized the classes that support import/export to relax the assumption that every `ArrowArray` in an `ArrowArrayStream` is a `RecordBatch`.

### Are these changes tested?

TODO

### Are there any user-facing changes?

Yes, two new functions are added to bridge.h.
* Closes: apache#38717

Lead-authored-by: Dewey Dunnington <[email protected]>
Co-authored-by: Dewey Dunnington <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Dewey Dunnington <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants