-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move interchange protocol implementation into a separate project #56732
Comments
Sure, no objections to removing it from the pandas repo I could take it into https://github.com/data-apis/dataframe-api-compat and maintain it, if people are OK with that. There should be enough projects in the community and people in the consortium interested in this that we can keep it alive 💪 I think there's only a handful of repos* currently using it anyway (plotly, seaborn, vscode), it'd be pretty easy to manually make PRs to all of them. I could take this on too I think that scikit-learn and altair wouldn't need to change anything:
|
Would we still have some sort of testing burden for the interchange protocol even after it's removed? |
there could be just a simple test, like there is for the api standard (which also defers to pandas/pandas/tests/test_downstream.py Lines 307 to 323 in 8fb8b9f
|
How often does the issue pop up where tying the fixes to a release is problematic? I think that is actually preferable; syncing this across two projects seems more problematic to me
This is true today, although I've toyed around with moving part of the implementation down to Cython to have the pandas buffers adhere to the Python buffer protocol. I don't think that is possible in a pure Python package. See #55671 |
plotly currently do this if hasattr(args["data_frame"], "__dataframe__") and version.parse(
pd.__version__
) >= version.parse("2.0.2"):
import pandas.api.interchange
df_not_pandas = args["data_frame"]
args["data_frame"] = df_not_pandas.__dataframe__()
# According interchange protocol: `def column_names(self) -> Iterable[str]:`
# so this function can return for example a generator.
# The easiest way is to convert `columns` to `pandas.Index` so that the
# type is similar to the types in other code branches.
columns = pd.Index(args["data_frame"].column_names())
needs_interchanging = True
elif hasattr(args["data_frame"], "to_pandas"):
args["data_frame"] = args["data_frame"].to_pandas()
columns = args["data_frame"].columns and other issues have come up since I think it would've been cleaner to just set a minimum version on dataframe-api-compat, and handle different pandas versions in there
OK yeah, if there's something like that which ties it to living in pandas, then it probably shouldn't be removed, thanks for bringing this up |
IIUC the PandasBuffer (which seems like the thing that would get a cython buffer implementation mixed into it) takes an ndarray in Really the reason to split this out is "This shouldn't be our problem." |
@WillAyd could you please expand on why it would be beneficial for the buffers to adhere to the Python buffer protocol? Furthermore, would it be possible to take that out into a non-pure-Python package, or would it still need to live in pandas? |
The advantage of adhering to the Python protocol is that it allows other packages to access the raw data of a dataframe purely from Python. It doesn't necessarily need to live in pandas, but you would end up duplicating a lot of the same things that we already have in pandas (ex: build system, CI) just in another package. |
Thanks Just for my understanding, would this require other libraries also adhere to the Python buffer protocol in order to be useful? |
No other libraries would not need to do anything. An object that adheres to the buffer protocol can have its raw bytes accessed through standard Python objects. If you look at the tests added in the above PR you can see how that works with a memoryview |
Sure but I mean, would consumers because to consume a column's raw bytes through standard Python objects in a library-agnostic manner if other libraries don't change anything? For example, I can currently write if hasattr(df, '__dataframe__'):
df_xch = df.__dataframe__()
schema = {name: col.get_buffers()['data'][1][0] for (name, col) in zip(df_xch.column_names(), df_xch.get_columns())} and know that it will give me the schema of any dataframe which implements the interchange protocol. If pandas buffers were to adhere to the Python buffer protocol, then IIUC you'd also be able to do, say, |
Yea I don't think it replaces or changes any parts of the dataframe protocol; I think of it more as a lightweight convenience that a third party can use to perform smaller tasks or aid in debugging a larger implementation of the dataframe protocol |
In theory it could be brought up to the DataFrame Interchange Protocol spec to add Python buffer protocol support as a general requirement, but not sure all implementations would want to do that (for pyarrow it would be easy, since our main Buffer object already implements the buffer protocol). If we move it out, would the idea be that it becomes a required dependency of pandas, and not optional? (I assume so, otherwise consumers cannot have the guarantee calling |
it could still be optional, anyone using it (e.g. plotly, seaborn) would need to make it required |
But in general, we want to allow projects to be able to accept a pandas object, without depending on pandas. |
If a library already depends on pandas and does: if isinstance(df, pd.DataFrame):
pass
elif hasattr(df, '__dataframe__'):
df = pd.api.interchange.from_dataframe(df) then instead of just having pandas as a dependency, they would need to also have the interchange protocol package as a dependency. Users need not change anything. Hypothetically* speaking, if a library does not already depend on pandas, does not have a special path for pandas, and does, say if hasattr(df, '__dataframe__'):
df = pyarrow.interchange.from_dataframe(df) then they could either:
*I don't think any such libraries currently exist |
On the call today @jorisvandenbossche brought up the C data interface that pyarrow is pushing for, which looks further along developed and probably would make the interchange protocol obsolete. If that materializes I would definitely be in favor of deprecating / removing this. I guess the only theoretical advantage this would have over the Arrow approach is that we would have the flexibility to serialize non-Arrow types, but I'm not sure if that is really useful in practice |
I'm confused, doesn't the dataframe interchange protocol use the Arrow C Data Interface? If you inspect the third element of the In [1]: df = pd.DataFrame({'a': [pd.Timestamp('2020-01-01', tz='Asia/Kathmandu')]})
In [2]: df
Out[2]:
a
0 2020-01-01 00:00:00+05:45
In [3]: df.__dataframe__().get_column_by_name('a').dtype
Out[3]: (<DtypeKind.DATETIME: 22>, 64, 'tsn:Asia/Kathmandu', '=')
I'm probably missing something, but could you please elaborate on how it would make the dataframe interchange protocol obsolete? |
The C data interface documents the comparison There is definitely a lot of overlap. The difference would really come down to how important we think it is to exchange non-arrow types |
True, but of course it's a bit of the point of the interchange protocol to allow this more easily
While there is certainly overlap, it doesn't make it fully obsolete, there are definitely still use case for both. In fact, I actually also proposed to add the Arrow protocol to the DataFrame Interchange protocol: data-apis/dataframe-api#279 / data-apis/dataframe-api#342.
Indeed, that's the main disadvantage. For example, if you have one dataframe library that implements missing values with a sentinel or byte mask, and interchanges data with another dataframe library that uses the same representation, then passing through Arrow incurs some extra overhead (because it only has one way of representing missing values), while going through the columns/buffers of the interchange protocol allows you to get exactly the data as they are represented by the producing library.
The dataframe interchange protocol uses the type format string notation of the Arrow C Data Interface, as a way to have a more detailed type description than the generic enum (first element of the To mention it here as well, I have a PR to add the Arrow C Data Interface (PyCapsule protocol) to pandas: #56587 |
Thanks all for comments (and ensuing conversations), interesting stuff
Indeed! I do think it is nice to be able to do pandas_df = pd.api.interchange.from_dataframe(some_other_df) and to know that, so long as If I've understood Joris' proposal in #56587, then this would allow for the Python API to be kept, but for it to use (where possible) the Arrow C Data Interface This way, plotly / seaborn / altair could keep saying "if you input something which isn't pandas/pyarrow, we can agnostically convert to them" |
The main open issues have been addressed. There's only 2 left, and they're very minor - I have a PR open for one, and will open one for the other next week (already got it working locally) It's really not that bad to deal with IMO At this point, given that the interchange protocol already has users, and that splitting it out would require those users to take on an extra dependency, I think it may not be too bad to just keep it? |
Do we have an inventory of projects that have still adopted this? I think there has been a huge shift in the ecosystem to the PyArrow Capsule interface in the months since we last had this conversation. Maybe worth deprecating and pushing more people towards the PyArrow method going forward? |
As far as I'm aware:
As far as I'm aware, the only library still using it is Seaborn, but even for them it causes issues because there's no list type support mwaskom/seaborn#3533 In addition:
Will summarised his experience with it in https://willayd.com/leveraging-the-arrow-c-data-interface.html, and explains why he switched over to the PyCapsule interface (despite initially having been positive about the interchange protocol) If we can move Seaborn over, I think we could start the deprecation process |
News: cuDF have said that they'll deprecate support for the protocol: rapidsai/cudf#17282 (comment) Any objections to following suit? |
Some of us talked a little bit about this when we met in Basel. The interchange protocol was added to pandas a while ago, but we believe that it would be beneficial to remove it from the pandas main repo into a separate repository.
The main issue is that bugfixes to the interchange protocol are currently tied to the pandas release cycle, which means that users might have to pin the pandas version if they run into a bug that they care about. Having it as a dependency decouples the releases, which can make the protocol much more responsive and pins are no longer tied to the main pandas repository if necessary at all.
The package would be a pure and lightweight Python dependency which makes it easier to contribute for others, because the testsuite is not slowed down by the pandas testsuite.
The new optional dependency is extremly lightweight, no additional dependencies brought in and only around 30kb in size, so it really doesn't matter much.
Those are really small downsides for improved interoperability and the decoupling of release cycles for both components.
cc @pandas-dev/pandas-core
The text was updated successfully, but these errors were encountered: