Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support Arrow PyCapsule Interface on Series for export #59587

Merged
merged 3 commits into from
Aug 26, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ Other enhancements
- Users can globally disable any ``PerformanceWarning`` by setting the option ``mode.performance_warnings`` to ``False`` (:issue:`56920`)
- :meth:`Styler.format_index_names` can now be used to format the index and column names (:issue:`48936` and :issue:`47489`)
- :class:`.errors.DtypeWarning` improved to include column names when mixed data types are detected (:issue:`58174`)
- :class:`Series` now supports the Arrow PyCapsule Interface for export (:issue:`59518`)
- :func:`DataFrame.to_excel` argument ``merge_cells`` now accepts a value of ``"columns"`` to only merge :class:`MultiIndex` column header header cells (:issue:`35384`)
- :meth:`DataFrame.corrwith` now accepts ``min_periods`` as optional arguments, as in :meth:`DataFrame.corr` and :meth:`Series.corr` (:issue:`9490`)
- :meth:`DataFrame.cummin`, :meth:`DataFrame.cummax`, :meth:`DataFrame.cumprod` and :meth:`DataFrame.cumsum` methods now have a ``numeric_only`` parameter (:issue:`53072`)
Expand Down
34 changes: 34 additions & 0 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
from pandas._libs.lib import is_range_indexer
from pandas.compat import PYPY
from pandas.compat._constants import REF_COUNT
from pandas.compat._optional import import_optional_dependency
from pandas.compat.numpy import function as nv
from pandas.errors import (
ChainedAssignmentError,
Expand Down Expand Up @@ -558,6 +559,39 @@ def _init_dict(

# ----------------------------------------------------------------------

def __arrow_c_stream__(self, requested_schema=None):
"""
Export the pandas Series as an Arrow C stream PyCapsule.

This relies on pyarrow to convert the pandas Series to the Arrow
format (and follows the default behaviour of ``pyarrow.Array.from_pandas``
in its handling of the index, i.e. to ignore it).
This conversion is not necessarily zero-copy.

Parameters
----------
requested_schema : PyCapsule, default None
The schema to which the dataframe should be casted, passed as a
PyCapsule containing a C ArrowSchema representation of the
requested schema.

Returns
-------
PyCapsule
"""
pa = import_optional_dependency("pyarrow", min_version="16.0.0")
if requested_schema is not None:
# todo: how should this be supported?
msg = (
"Passing `requested_schema` to `Series.__arrow_c_stream__` is not yet "
"supported"
)
raise NotImplementedError(msg)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kylebarron @jorisvandenbossche @WillAyd @PyCapsuleGang how should this be handled? I was looking at the Polars implementation and there's no tests there where requested_schema is not None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine; I believe we'd have to do a lower level implementation to unpack the requested_schema capsule anyway

Copy link
Contributor

@kylebarron kylebarron Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a general sense, you can ignore it.

The callee should attempt to provide the data in the requested schema. However, if the callee cannot provide the data in the requested schema, they may return with the same schema as if None were passed to requested_schema.

https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#schema-requests

However, in this case you should just delegate to pyarrow's implementation

        ca = pa.chunked_array([pa.Array.from_pandas(self, type=requested_schema)])
        return ca.__arrow_c_stream__(requested_schema)

ca = pa.chunked_array([pa.Array.from_pandas(self, type=requested_schema)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pyarrow types already use a chunkedarray for storage right? I think we can short-circuit on that (or in a larger PR, reasses why we use chunkedarray for storage)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Will said, I think we should short-circuit (or special case it) for when the pyarrow array already is a chunked array.

Right now, if you have a column such using e.g. StringDtype with pyarrow storage, which uses chunked arrays under the hood, the above will apparently concatenate the result, and this conversion will not be zero copy in a case where you actually expect it to be zero-copy (and this is actually the case which makes us use __arrow_c_stream__ instead of __arrow_c_array__ in the first place)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something else, the passing of requested_schema is not going to work like this, I think. See how I first converted it to a pyarrow object before passing it on:

pandas/pandas/core/frame.py

Lines 985 to 986 in bb4ab4f

if requested_schema is not None:
requested_schema = pa.Schema._import_from_c_capsule(requested_schema)

You can use the same but using pa.DataType instead of pa.Schema

return ca.__arrow_c_stream__()

# ----------------------------------------------------------------------

@property
def _constructor(self) -> type[Series]:
return Series
Expand Down
26 changes: 26 additions & 0 deletions pandas/tests/series/test_arrow_interface.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
import ctypes

import pytest

import pandas.util._test_decorators as td

import pandas as pd

pa = pytest.importorskip("pyarrow")


@td.skip_if_no("pyarrow", min_version="16.0")
mroeschke marked this conversation as resolved.
Show resolved Hide resolved
def test_series_arrow_interface():
s = pd.Series([1, 4, 2])

capsule = s.__arrow_c_stream__()
assert (
ctypes.pythonapi.PyCapsule_IsValid(
ctypes.py_object(capsule), b"arrow_array_stream"
)
== 1
)

ca = pa.chunked_array(s)
expected = pa.chunked_array([[1, 4, 2]])
assert ca.equals(expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best to add a test case here specifying the type (to cover the requested_schema part). Something like:

arr = pa.array(s, type=pa.int32())
expected = pa.array([1, 4, 2], pa.int32())
assert arr.equals(expected)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but then using chunked_array() instead of array(), because array(..) actually doesn't work if we only define __arrow_c_stream__ (and not __arrow_c_array__) ..

Loading