-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Public API to consume objects supporting the PyCapsule Arrow C Data Interface #38010
Comments
I also just hit an instance where having the In particular, I was trying to read an arrow array with GeoArrow extension metadata but manually persist the field metadata: schema_capsule, array_capsule = data.__arrow_c_array__()
class SchemaHolder:
schema_capsule: object
def __init__(self, schema_capsule) -> None:
self.schema_capsule = schema_capsule
def __arrow_c_schema__(self):
return self.schema_capsule
class ArrayHolder:
schema_capsule: object
array_capsule: object
def __init__(self, schema_capsule, array_capsule) -> None:
self.schema_capsule = schema_capsule
self.array_capsule = array_capsule
def __arrow_c_array__(self, requested_schema):
return self.schema_capsule, self.array_capsule
# Here the pa.field constructor doesn't accept pycapsule objects
field = pa.field(SchemaHolder(schema_capsule))
array = pa.array(ArrayHolder(field.__arrow_c_schema__(), array_capsule))
schema = pa.schema([field.with_name("geometry")])
table = pa.Table.from_arrays([array], schema=schema) Aside from this, the only way to maintain extension metadata is to ensure that the extension types are registered with pyarrow, which is harder to control because if its global scope. |
Yes, we should provide a public way to create a Field object as well (and from there you can also get a DataType). (short term, I would say it is safe to use I suppose adding this to |
Closes #425, blocked on apache/arrow#38010 (comment). The main issue is that we need a reliable way to maintain the geoarrow extension metadata through FFI. The easiest way would be if `pa.field()` were able to support `__arrow_c_schema__` input. Or alternatively, one option is to have a context manager of sorts to register global pyarrow geoarrow extension arrays, and then deregister them after use.
…e schema object (PyCapsule protocol)
…rrow PyCapsule Protocol (#40818) ### Rationale for this change See #38010 (comment) for more context. Right now for _consuming_ ArrowSchema-compatible objects that implement the PyCapsule interface, we only have the private `_import_from_c_capsule` (on Schema, Field, DataType) and we check for the protocol in the public `pa.schema(..)`. But that means you currently can only consume objects that represent the schema of a batch (struct type), and not schemas of individual arrays. ### What changes are included in this PR? Expand the `pa.field(..)` constructor to accept objects implementing the protocol method. ### Are these changes tested? TODO * GitHub Issue: #38010 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
Issue resolved by pull request 40818 |
…rrow PyCapsule Protocol (#40818) ### Rationale for this change See #38010 (comment) for more context. Right now for _consuming_ ArrowSchema-compatible objects that implement the PyCapsule interface, we only have the private `_import_from_c_capsule` (on Schema, Field, DataType) and we check for the protocol in the public `pa.schema(..)`. But that means you currently can only consume objects that represent the schema of a batch (struct type), and not schemas of individual arrays. ### What changes are included in this PR? Expand the `pa.field(..)` constructor to accept objects implementing the protocol method. ### Are these changes tested? TODO * GitHub Issue: #38010 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…ough Arrow PyCapsule Protocol (apache#40818) ### Rationale for this change See apache#38010 (comment) for more context. Right now for _consuming_ ArrowSchema-compatible objects that implement the PyCapsule interface, we only have the private `_import_from_c_capsule` (on Schema, Field, DataType) and we check for the protocol in the public `pa.schema(..)`. But that means you currently can only consume objects that represent the schema of a batch (struct type), and not schemas of individual arrays. ### What changes are included in this PR? Expand the `pa.field(..)` constructor to accept objects implementing the protocol method. ### Are these changes tested? TODO * GitHub Issue: apache#38010 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…ough Arrow PyCapsule Protocol (apache#40818) ### Rationale for this change See apache#38010 (comment) for more context. Right now for _consuming_ ArrowSchema-compatible objects that implement the PyCapsule interface, we only have the private `_import_from_c_capsule` (on Schema, Field, DataType) and we check for the protocol in the public `pa.schema(..)`. But that means you currently can only consume objects that represent the schema of a batch (struct type), and not schemas of individual arrays. ### What changes are included in this PR? Expand the `pa.field(..)` constructor to accept objects implementing the protocol method. ### Are these changes tested? TODO * GitHub Issue: apache#38010 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
#37797 is adding official dunder methods to expose the Arrow C Data/Stream Interface in Python using PyCapsules (#34031 / #35531).
In addition to official dunders to expose this to other libraries, we also need public APIs in pyarrow to import / consume such PyCapsules (or rather the objects implementing the dunders to give you the PyCapsule).
#37797 already added this to the
pa.array(..)
,pa.record_batch(..)
andpa.schema(..)
constructors, such that you can for example create a pyarrow array withpa.array(obj)
given any objectobj
that supports the interface by defining__arrow_c_array__
.But that's not fully complete: we certainly need a way to construct a
RecordBatchReader
as well, where we don't have such a factory function available. For this, we could add afrom_
function (similar to the existingfrom_batches
) likeRecordBatchReader.from_stream
?(in addition there is also the Table, Field and DataType constructors, both those all have factory functions that could support this, similar to
pa.array(..)
et al)Secondly, I am also wondering if we want to provide APIs that accept PyCapsules directly, instead of an object that implements the dunders. For example, if you are a library that has data in Arrow compatible memory, and you want to convert this to pyarrow through the C Data Interface, you might want to use a PyCapsule directly if your library doesn't expose a Python class that represents that data (to avoid that you need to create a small wrapper class just with the dunder to pass to the pyarrow constructor, although this is of course not difficult).
The text was updated successfully, but these errors were encountered: