-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support dataframe protocol. #10452
Comments
Hi @trivialfis Quick note to say that I'd discourage using the interchange protocol - I've collect some reasons why here: pandas-dev/pandas#56732 (comment) If I may, I'd like to suggest Narwhals and/or the Arrow PyCapsule Interface. This is what several packages (e.g. Altair, Plotly, Vegafusion, Marimo, scikit-lego, Rio, and more) are using, with several others (Bokeh, Prophet, formulaic) considering doing the same Happy to give this a go if you'd be open to it |
Thank you for sharing, will look into these. |
Perhaps I'm missing something, how come that none of these interfaces can return a read-only C pointer for each column. |
Going to cc @kylebarron into the conversation |
The Arrow C Data interface returns a pointer that describes a struct-type column, which recursively contains the pointers for all nested columns. That's defined on this page: https://arrow.apache.org/docs/format/CDataInterface.html |
Thank you for the references! I will look into that. We are hoping to avoid any c dependency (including copied definition, and no cpython either) and rely on the Python stack for passing data. In addition, we simply concatenate all the chunks in a column and use a np array as the final data view. xgboost/python-package/xgboost/data.py Line 450 in e988b7c
As you can see, it's fragile and delicate in terms of performance and correctness. My wish is something simpler like the numpy |
I think you mean Cython?
I am not sure what you mean by "any C dependency" - do you mean it as in "numpy depends on C" or "does it contain Cython code"? Anyway, Narwhals is just Python. No Cython. It's a unified layer to write dataframe-agnostic code. Perhaps I'm stating something obvious here, but just to be sure. Narwhals' syntax is a subset of the Polars API (just the syntax: it's not using Polars under the hood) and is used to provide maintainers with a way to write dataframe-agnostic code. In other words: if xgboost implemented its data transformation logic with Narwhals, it would work out of the box with Polars, pandas, cuDF, modin, dask... without the maintainers handling the complexity, or having to support requests to use Polars or whatever new dataframe library will be popular in the future. This enables Hope this provided a bit more context! 😊 |
I meant cpython, we use the
Apologies for the ambiguity. I meant the
Thank you for the context! Yes, it's helpful. When looking into the arrow interface, I was under the impression that XGBoost should directly consume the arrow stream or the arrow C arrays. (I was hoping that something could help streamline the existing code and improve the performance). But now it's clear that I should continue to use numpy as the middle layer. It's still quite helpful, not meant to complain. |
Narwhals is just Python but at some level you need a "C dependency" to pass ABI-stable C data, right? If you want to use Arrow data for interop, then you need to trust the Arrow project's guarantee that it is actually ABI stable. I would tend to argue that you should use a helper library to receive the Arrow data rather than trying to manage that yourself. Consider pyarrow, nanoarrow, or arro3. Arrow and numpy do not map 1:1 to each other. Arrow includes lots of structured types and includes a nullability bitmask that Numpy cannot directly use. |
https://data-apis.org/dataframe-protocol/latest/index.html
The text was updated successfully, but these errors were encountered: