Support dataframe protocol. #10452

trivialfis · 2024-06-19T11:02:25Z

https://data-apis.org/dataframe-protocol/latest/index.html

MarcoGorelli · 2024-11-20T13:29:12Z

Quick note to say that I'd discourage using the interchange protocol - I've collect some reasons why here: pandas-dev/pandas#56732 (comment)

If I may, I'd like to suggest Narwhals and/or the Arrow PyCapsule Interface. This is what several packages (e.g. Altair, Plotly, Vegafusion, Marimo, scikit-lego, Rio, and more) are using, with several others (Bokeh, Prophet, formulaic) considering doing the same

Happy to give this a go if you'd be open to it

trivialfis · 2024-11-20T17:07:51Z

Thank you for sharing, will look into these.

trivialfis · 2024-11-23T19:45:11Z

Perhaps I'm missing something, how come that none of these interfaces can return a read-only C pointer for each column.

MarcoGorelli · 2024-11-23T20:50:45Z

Going to cc @kylebarron into the conversation

kylebarron · 2024-11-23T20:53:47Z

The Arrow C Data interface returns a pointer that describes a struct-type column, which recursively contains the pointers for all nested columns. That's defined on this page: https://arrow.apache.org/docs/format/CDataInterface.html

trivialfis · 2024-11-23T21:45:32Z

Thank you for the references! I will look into that.

We are hoping to avoid any c dependency (including copied definition, and no cpython either) and rely on the Python stack for passing data. In addition, we simply concatenate all the chunks in a column and use a np array as the final data view.

xgboost/python-package/xgboost/data.py

Line 450 in e988b7c

def pandas_pa_type(ser: Any) -> np.ndarray:

As you can see, it's fragile and delicate in terms of performance and correctness. My wish is something simpler like the numpy __array_interface__ for each column(additional one for the mask if needed). It has no c dependency and we can simply serialize it as a JSON document and pass it around across languages.

baggiponte · 2024-11-25T07:46:42Z

and no cpython either

I think you mean Cython?

We are hoping to avoid any c dependency

I am not sure what you mean by "any C dependency" - do you mean it as in "numpy depends on C" or "does it contain Cython code"? Anyway, Narwhals is just Python. No Cython. It's a unified layer to write dataframe-agnostic code.

Perhaps I'm stating something obvious here, but just to be sure. Narwhals' syntax is a subset of the Polars API (just the syntax: it's not using Polars under the hood) and is used to provide maintainers with a way to write dataframe-agnostic code. In other words: if xgboost implemented its data transformation logic with Narwhals, it would work out of the box with Polars, pandas, cuDF, modin, dask... without the maintainers handling the complexity, or having to support requests to use Polars or whatever new dataframe library will be popular in the future.

This enables any-dataframe-in -> same-dataframe-out transformation: if the user passes a pandas dataframe, pandas will be used to do the transformation. if the user goes with Polars, Polars engine will do the transformation. Then, of course, at the end of your data transformation pipeline you can always cast everything into a (collection of) numpy array(s) for the good ol' model.fit().

Hope this provided a bit more context! 😊

trivialfis · 2024-11-25T08:07:05Z

I think you mean Cython?

I meant cpython, we use the ctypes Python module for foreign function calls. Using PyCapsule (from arrow) requires Python CAPI.

I am not sure what you mean by "any C dependency"

Apologies for the ambiguity. I meant the ArrowSchema C struct. Using it implies we will need to either copy the definition of this struct and helper functions into XGBoost (this project) and hope that the ABI is indeed stable or include the Arrow C package as a dependency.

Hope this provided a bit more context! 😊

Thank you for the context! Yes, it's helpful. When looking into the arrow interface, I was under the impression that XGBoost should directly consume the arrow stream or the arrow C arrays. (I was hoping that something could help streamline the existing code and improve the performance). But now it's clear that I should continue to use numpy as the middle layer. It's still quite helpful, not meant to complain.

kylebarron · 2024-11-25T18:22:19Z

Narwhals is just Python but at some level you need a "C dependency" to pass ABI-stable C data, right?

If you want to use Arrow data for interop, then you need to trust the Arrow project's guarantee that it is actually ABI stable.

I would tend to argue that you should use a helper library to receive the Arrow data rather than trying to manage that yourself. Consider pyarrow, nanoarrow, or arro3.

Arrow and numpy do not map 1:1 to each other. Arrow includes lots of structured types and includes a nullability bitmask that Numpy cannot directly use.

trivialfis added the feature-request label Jun 19, 2024

trivialfis mentioned this issue Jun 21, 2024

xgboost v2.1.0 is incompatible with pandas<1.2 #10471

Closed

trivialfis mentioned this issue Jul 6, 2024

XGBoost Fails with Polars DataFrames Containing Categorical Columns #10554

Open

MarcoGorelli mentioned this issue Jul 29, 2024

[python-package] Adding support for polars for input data microsoft/LightGBM#6204

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support dataframe protocol. #10452

Support dataframe protocol. #10452

trivialfis commented Jun 19, 2024

MarcoGorelli commented Nov 20, 2024

trivialfis commented Nov 20, 2024

trivialfis commented Nov 23, 2024 •

edited

Loading

MarcoGorelli commented Nov 23, 2024

kylebarron commented Nov 23, 2024

trivialfis commented Nov 23, 2024 •

edited

Loading

baggiponte commented Nov 25, 2024 •

edited

Loading

trivialfis commented Nov 25, 2024

kylebarron commented Nov 25, 2024

Support dataframe protocol. #10452

Support dataframe protocol. #10452

Comments

trivialfis commented Jun 19, 2024

MarcoGorelli commented Nov 20, 2024

trivialfis commented Nov 20, 2024

trivialfis commented Nov 23, 2024 • edited Loading

MarcoGorelli commented Nov 23, 2024

kylebarron commented Nov 23, 2024

trivialfis commented Nov 23, 2024 • edited Loading

baggiponte commented Nov 25, 2024 • edited Loading

trivialfis commented Nov 25, 2024

kylebarron commented Nov 25, 2024

trivialfis commented Nov 23, 2024 •

edited

Loading

trivialfis commented Nov 23, 2024 •

edited

Loading

baggiponte commented Nov 25, 2024 •

edited

Loading