Add dunder method for Arrow C Data Interface to DataFrame and Column objects #279

jorisvandenbossche · 2023-10-12T13:06:03Z

The Python Arrow community is adding a public way to interchange data through the C Data Interface, using PyCapsule objects holding the C struct, similary as DLPack's python interface: http://crossbow.voltrondata.com/pr_docs/37797/format/CDataInterface/PyCapsuleInterface.html

We have DLPack support at the Buffer level, and similarly, I think it would be useful to add Arrow support at the DataFrame and Column level.

Concretely, I would propose adding an optional __arrow_c_schema__, __arrow_c_array__ and __arrow_c_stream__ methods to both the DataFrame and Column interchange objects. Those methods would be optional, with their presence indicating that this specific implementation of the interchange object supports the Arrow interface.
Consumers of the interchange protocol could then check for the presence of those methods, and try them first for an easier and faster conversion, and otherwise use the standard APIs through the Column and Buffer objects (example: pyarrow and polars interchanging data).

It might be a bit strange to add both the array and stream interface methods, but that is due to that the interchange protocol hasn't really made a distinction between a single chunk vs a chunked object (#250). But I think the array method could then raise an error if the DataFrame or Column still exists of more than 1 chunk.

This would address #48 but without being tied to a specific library implementation, but solely memory layout.

The text was updated successfully, but these errors were encountered:

kkraus14 · 2023-10-12T17:41:52Z

I think we'd want to use the C Device Data Interface as opposed to the C Data Interface in order to support non-CPU memory as well?

jorisvandenbossche · 2023-10-12T18:00:50Z

The current Python Arrow capsule protocol only supports the C Data Interface and not yet the device version, so for now it would only support CPU memory (libraries with GPU memory would thus not yet add such a method to their interchange object)

kkraus14 · 2023-10-12T19:06:35Z

I think we should just wait until there's a Python protocol for the C Device Data Interface then instead of plumbing in the C Data Interface. That gives a single implementation that could be universally supported as opposed to adding something that downstream consumers will either have to check for existence or check if it throws or something else.

rgommers · 2023-10-26T13:18:32Z

I see the device data interface is still marked as experimental in Arrow (https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html). Is there a timeline for it appearing in PyArrow?

jorisvandenbossche · 2023-11-22T16:10:36Z

Is there a timeline for it appearing in PyArrow?

(for the device interface) Hopefully in the next release (15.0), depending on the progress in apache/arrow#38325)

Some questions here on what the expected return data would be when the data does not match default Arrow types exactly.

I think in general DLPack is expected to be zero-copy (and if that is not possible (because of the data type, the device, or being distributed, ...), you raise an error instead). The question is whether we want to define the same expectation here?

The stream protocol does support chunked data, so this matches the model of the interchange protocol (so if we want zero copy, we might want to add only __arrow_c_stream__ to the DataFrame object, or let __arrow_c_array__ raise an error if the DataFrame has more than 1 chunk.
The interchange protocol supports data type layouts that are not the native ones in Arrow (although everything can be represented). So for each of those data types, we have the choice between returning the closest native Arrow type (but this might not be zero copy) or return the data as-is potentially using an Arrow extension type. Examples:
- Assume you have a float column that uses NaN as missing value sentinel. Converting that zero-copy to Arrow would give an array that has different missing value semantics. One option is to create the mapping validity bitmap on export to Arrow (not fully zero-copy), or otherwise return the data as is (potentially defining an extension type to mark it as "float-with-nan").
- Assume you have a boolean column with byte. Do we convert that to an Arrow bit-based boolean array (non-zero-copy), or do we convert that zero-copy to (u)int8 and use an extension type to indicate this represents bools? (such an extension type might actually also be useful for cudf?)

For the types (native arrow vs extension types), there is a trade-off in performance (keep everything zero-copy, let the consumer decide what conversion they need) vs usability and compatibility (those extension types will not be understood by everyone, at least not initially)

* Add example of a sklearn like pipeline * Review comments * Update comment after discussion in #279 * update --------- Co-authored-by: MarcoGorelli <[email protected]>

rgommers · 2024-01-18T16:24:04Z

The interchange protocol supports data type layouts that are not the native ones in Arrow

This may not be a problem unless the __arrow_* methods are implemented by a library that does not actually use Arrow under the hood. Which probably is not desirable? Last time this was discussed, it was more a shortcut for "can we avoid the overhead of the interchange protocol if both producer and consumer already use Arrow under the hood"? That's also in your issue description above.

jorisvandenbossche added the interchange-protocol label Oct 12, 2023

jorisvandenbossche mentioned this issue Oct 12, 2023

Interchange between two dataframe types which use the same native storage representation #48

Open

cbourjau added a commit to cbourjau/dataframe-api that referenced this issue Oct 25, 2023

Update comment after discussion in data-apis#279

33c39c8

MarcoGorelli added a commit that referenced this issue Dec 7, 2023

Add example of a sklearn like pipeline (#294)

21271f5

* Add example of a sklearn like pipeline * Review comments * Update comment after discussion in #279 * update --------- Co-authored-by: MarcoGorelli <[email protected]>

This was referenced Dec 21, 2023

[Python] Promote usage of the Arrow PyCapsule Protocol (for the C Data Inteface) apache/arrow#39195

Open

Add methods for the Arrow PyCapsule Protocol to DataFrame/Column interchange protocol objects #342

Open

jorisvandenbossche mentioned this issue Jan 12, 2024

Move interchange protocol implementation into a separate project pandas-dev/pandas#56732

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dunder method for Arrow C Data Interface to DataFrame and Column objects #279

Add dunder method for Arrow C Data Interface to DataFrame and Column objects #279

jorisvandenbossche commented Oct 12, 2023 •

edited

Loading

kkraus14 commented Oct 12, 2023

jorisvandenbossche commented Oct 12, 2023

kkraus14 commented Oct 12, 2023

rgommers commented Oct 26, 2023

jorisvandenbossche commented Nov 22, 2023

rgommers commented Jan 18, 2024

Add dunder method for Arrow C Data Interface to DataFrame and Column objects #279

Add dunder method for Arrow C Data Interface to DataFrame and Column objects #279

Comments

jorisvandenbossche commented Oct 12, 2023 • edited Loading

kkraus14 commented Oct 12, 2023

jorisvandenbossche commented Oct 12, 2023

kkraus14 commented Oct 12, 2023

rgommers commented Oct 26, 2023

jorisvandenbossche commented Nov 22, 2023

rgommers commented Jan 18, 2024

jorisvandenbossche commented Oct 12, 2023 •

edited

Loading