-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: support Arrow PyCapsule Interface on Series for export #59587
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -34,6 +34,7 @@ | |||||
from pandas._libs.lib import is_range_indexer | ||||||
from pandas.compat import PYPY | ||||||
from pandas.compat._constants import REF_COUNT | ||||||
from pandas.compat._optional import import_optional_dependency | ||||||
from pandas.compat.numpy import function as nv | ||||||
from pandas.errors import ( | ||||||
ChainedAssignmentError, | ||||||
|
@@ -558,6 +559,39 @@ def _init_dict( | |||||
|
||||||
# ---------------------------------------------------------------------- | ||||||
|
||||||
def __arrow_c_stream__(self, requested_schema=None): | ||||||
""" | ||||||
Export the pandas Series as an Arrow C stream PyCapsule. | ||||||
|
||||||
This relies on pyarrow to convert the pandas Series to the Arrow | ||||||
format (and follows the default behaviour of ``pyarrow.Array.from_pandas`` | ||||||
in its handling of the index, i.e. to ignore it). | ||||||
This conversion is not necessarily zero-copy. | ||||||
|
||||||
Parameters | ||||||
---------- | ||||||
requested_schema : PyCapsule, default None | ||||||
The schema to which the dataframe should be casted, passed as a | ||||||
PyCapsule containing a C ArrowSchema representation of the | ||||||
requested schema. | ||||||
|
||||||
Returns | ||||||
------- | ||||||
PyCapsule | ||||||
""" | ||||||
pa = import_optional_dependency("pyarrow", min_version="16.0.0") | ||||||
if requested_schema is not None: | ||||||
# todo: how should this be supported? | ||||||
msg = ( | ||||||
"Passing `requested_schema` to `Series.__arrow_c_stream__` is not yet " | ||||||
"supported" | ||||||
) | ||||||
raise NotImplementedError(msg) | ||||||
ca = pa.chunked_array([pa.Array.from_pandas(self, type=requested_schema)]) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The pyarrow types already use a chunkedarray for storage right? I think we can short-circuit on that (or in a larger PR, reasses why we use chunkedarray for storage) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As Will said, I think we should short-circuit (or special case it) for when the pyarrow array already is a chunked array. Right now, if you have a column such using e.g. StringDtype with pyarrow storage, which uses chunked arrays under the hood, the above will apparently concatenate the result, and this conversion will not be zero copy in a case where you actually expect it to be zero-copy (and this is actually the case which makes us use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Something else, the passing of Lines 985 to 986 in bb4ab4f
You can use the same but using |
||||||
return ca.__arrow_c_stream__() | ||||||
|
||||||
# ---------------------------------------------------------------------- | ||||||
|
||||||
@property | ||||||
def _constructor(self) -> type[Series]: | ||||||
return Series | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
import ctypes | ||
|
||
import pytest | ||
|
||
import pandas.util._test_decorators as td | ||
|
||
import pandas as pd | ||
|
||
pa = pytest.importorskip("pyarrow") | ||
|
||
|
||
@td.skip_if_no("pyarrow", min_version="16.0") | ||
mroeschke marked this conversation as resolved.
Show resolved
Hide resolved
|
||
def test_series_arrow_interface(): | ||
s = pd.Series([1, 4, 2]) | ||
|
||
capsule = s.__arrow_c_stream__() | ||
assert ( | ||
ctypes.pythonapi.PyCapsule_IsValid( | ||
ctypes.py_object(capsule), b"arrow_array_stream" | ||
) | ||
== 1 | ||
) | ||
|
||
ca = pa.chunked_array(s) | ||
expected = pa.chunked_array([[1, 4, 2]]) | ||
assert ca.equals(expected) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Best to add a test case here specifying the type (to cover the
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (but then using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kylebarron @jorisvandenbossche @WillAyd @PyCapsuleGang how should this be handled? I was looking at the Polars implementation and there's no tests there where
requested_schema
is notNone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine; I believe we'd have to do a lower level implementation to unpack the
requested_schema
capsule anywayThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a general sense, you can ignore it.
https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#schema-requests
However, in this case you should just delegate to pyarrow's implementation