[Python] Public API to consume objects supporting the PyCapsule Arrow C Data Interface #38010

jorisvandenbossche · 2023-10-04T11:10:46Z

#37797 is adding official dunder methods to expose the Arrow C Data/Stream Interface in Python using PyCapsules (#34031 / #35531).

In addition to official dunders to expose this to other libraries, we also need public APIs in pyarrow to import / consume such PyCapsules (or rather the objects implementing the dunders to give you the PyCapsule).
#37797 already added this to the pa.array(..), pa.record_batch(..) and pa.schema(..) constructors, such that you can for example create a pyarrow array with pa.array(obj) given any object obj that supports the interface by defining __arrow_c_array__.

But that's not fully complete: we certainly need a way to construct a RecordBatchReader as well, where we don't have such a factory function available. For this, we could add a from_ function (similar to the existing from_batches) like RecordBatchReader.from_stream?

[Python] RecordBatchReader constructor from stream object implementing the PyCapsule Protocol #39217

(in addition there is also the Table, Field and DataType constructors, both those all have factory functions that could support this, similar to pa.array(..) et al)

Secondly, I am also wondering if we want to provide APIs that accept PyCapsules directly, instead of an object that implements the dunders. For example, if you are a library that has data in Arrow compatible memory, and you want to convert this to pyarrow through the C Data Interface, you might want to use a PyCapsule directly if your library doesn't expose a Python class that represents that data (to avoid that you need to create a small wrapper class just with the dunder to pass to the pyarrow constructor, although this is of course not difficult).

The text was updated successfully, but these errors were encountered:

kylebarron · 2024-03-20T20:51:57Z

I also just hit an instance where having the pa.field constructor consume these objects would be helpful.

In particular, I was trying to read an arrow array with GeoArrow extension metadata but manually persist the field metadata:

schema_capsule, array_capsule = data.__arrow_c_array__()

class SchemaHolder:
    schema_capsule: object

    def __init__(self, schema_capsule) -> None:
        self.schema_capsule = schema_capsule

    def __arrow_c_schema__(self):
        return self.schema_capsule

class ArrayHolder:
    schema_capsule: object
    array_capsule: object

    def __init__(self, schema_capsule, array_capsule) -> None:
        self.schema_capsule = schema_capsule
        self.array_capsule = array_capsule

    def __arrow_c_array__(self, requested_schema):
        return self.schema_capsule, self.array_capsule

# Here the pa.field constructor doesn't accept pycapsule objects
field = pa.field(SchemaHolder(schema_capsule))
array = pa.array(ArrayHolder(field.__arrow_c_schema__(), array_capsule))
schema = pa.schema([field.with_name("geometry")])
table = pa.Table.from_arrays([array], schema=schema)

Aside from this, the only way to maintain extension metadata is to ensure that the extension types are registered with pyarrow, which is harder to control because if its global scope.

jorisvandenbossche · 2024-03-25T14:47:04Z

Yes, we should provide a public way to create a Field object as well (and from there you can also get a DataType).

(short term, I would say it is safe to use pa.Field._import_from_c_capsule, if you check that the method is available)

I suppose adding this to pa.field(..) would be the easiest, although signature-wise it's also not a great addition, given that right now this constructor always takes both a name and type required arguments)

Closes #425, blocked on apache/arrow#38010 (comment). The main issue is that we need a reliable way to maintain the geoarrow extension metadata through FFI. The easiest way would be if `pa.field()` were able to support `__arrow_c_schema__` input. Or alternatively, one option is to have a context manager of sorts to register global pyarrow geoarrow extension arrays, and then deregister them after use.

…e schema object (PyCapsule protocol)

…rrow PyCapsule Protocol (#40818) ### Rationale for this change See #38010 (comment) for more context. Right now for _consuming_ ArrowSchema-compatible objects that implement the PyCapsule interface, we only have the private `_import_from_c_capsule` (on Schema, Field, DataType) and we check for the protocol in the public `pa.schema(..)`. But that means you currently can only consume objects that represent the schema of a batch (struct type), and not schemas of individual arrays. ### What changes are included in this PR? Expand the `pa.field(..)` constructor to accept objects implementing the protocol method. ### Are these changes tested? TODO * GitHub Issue: #38010 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

jorisvandenbossche · 2024-04-15T08:22:36Z

Issue resolved by pull request 40818
#40818

…rrow PyCapsule Protocol (#40818) ### Rationale for this change See #38010 (comment) for more context. Right now for _consuming_ ArrowSchema-compatible objects that implement the PyCapsule interface, we only have the private `_import_from_c_capsule` (on Schema, Field, DataType) and we check for the protocol in the public `pa.schema(..)`. But that means you currently can only consume objects that represent the schema of a batch (struct type), and not schemas of individual arrays. ### What changes are included in this PR? Expand the `pa.field(..)` constructor to accept objects implementing the protocol method. ### Are these changes tested? TODO * GitHub Issue: #38010 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

…ough Arrow PyCapsule Protocol (apache#40818) ### Rationale for this change See apache#38010 (comment) for more context. Right now for _consuming_ ArrowSchema-compatible objects that implement the PyCapsule interface, we only have the private `_import_from_c_capsule` (on Schema, Field, DataType) and we check for the protocol in the public `pa.schema(..)`. But that means you currently can only consume objects that represent the schema of a batch (struct type), and not schemas of individual arrays. ### What changes are included in this PR? Expand the `pa.field(..)` constructor to accept objects implementing the protocol method. ### Are these changes tested? TODO * GitHub Issue: apache#38010 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

jorisvandenbossche added the Component: Python label Oct 4, 2023

jorisvandenbossche mentioned this issue Oct 4, 2023

GH-35531: [Python] C Data Interface PyCapsule Protocol #37797

Merged

jorisvandenbossche mentioned this issue Oct 18, 2023

[Python] Add Python protocol for the Arrow C (Data/Stream) Interface #35531

Closed

kylebarron mentioned this issue Nov 15, 2023

Implement Arrow PyCapsule Interface apache/arrow-rs#5070

Merged

jorisvandenbossche added this to the 15.0.0 milestone Nov 24, 2023

jorisvandenbossche mentioned this issue Nov 24, 2023

python/adbc_driver_manager: use PyCapsule for handles to C structs apache/arrow-adbc#70

Closed

raulcd modified the milestones: 15.0.0, 16.0.0 Jan 8, 2024

This was referenced Jan 18, 2024

ENH: support the Arrow PyCapsule Interface on pandas.DataFrame (export) pandas-dev/pandas#56587

Merged

[Python] Promote usage of the Arrow PyCapsule Protocol (for the C Data Inteface) #39195

Open

timkpaine mentioned this issue Feb 23, 2024

WIP move to pycapsule timkpaine/arrow-cpp-python-nocopy#3

Merged

kylebarron mentioned this issue Mar 20, 2024

Support geoarrow array input into viz() developmentseed/lonboard#427

Merged

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Mar 27, 2024

apacheGH-38010: [Python] Construct pyarrow.Field from Arrow-compatibl…

6cf35cb

…e schema object (PyCapsule protocol)

jorisvandenbossche mentioned this issue Mar 27, 2024

GH-38010: [Python] Construct pyarrow.Field and ChunkedArray through Arrow PyCapsule Protocol #40818

Merged

github-actions bot assigned jorisvandenbossche Mar 27, 2024

raulcd modified the milestones: 16.0.0, 17.0.0 Apr 8, 2024

jorisvandenbossche modified the milestones: 17.0.0, 16.0.0 Apr 15, 2024

jorisvandenbossche closed this as completed Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Public API to consume objects supporting the PyCapsule Arrow C Data Interface #38010

[Python] Public API to consume objects supporting the PyCapsule Arrow C Data Interface #38010

jorisvandenbossche commented Oct 4, 2023 •

edited

Loading

kylebarron commented Mar 20, 2024 •

edited

Loading

jorisvandenbossche commented Mar 25, 2024

jorisvandenbossche commented Apr 15, 2024

[Python] Public API to consume objects supporting the PyCapsule Arrow C Data Interface #38010

[Python] Public API to consume objects supporting the PyCapsule Arrow C Data Interface #38010

Comments

jorisvandenbossche commented Oct 4, 2023 • edited Loading

kylebarron commented Mar 20, 2024 • edited Loading

jorisvandenbossche commented Mar 25, 2024

jorisvandenbossche commented Apr 15, 2024

jorisvandenbossche commented Oct 4, 2023 •

edited

Loading

kylebarron commented Mar 20, 2024 •

edited

Loading