-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Implement DataFrame interchange protocol #46141
Conversation
Hello @vnlitvinov! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2022-04-24 10:03:07 UTC |
cc @jreback for preliminary feedback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didnt look in detail but some top-level organizational comments
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
@jreback @jbrockmendel I've responded to your comments, but I suggest to refrain from re-reading the PR just yet - I'm in the middle of improving it yet further, and I'll make a comment when it's again ready for reviewing. Thanks again for your feedback! |
Okay, I think logic-wise it's ready to be reviewed. I still need to make CI happy about code style etc., but I don't expect a lot of changes for that. |
Please remove the |
This PR finally passes the code checks (and new functionality passes newly added tests which cover at least the basic usage of the API) on my end, so I'm marking this PR as "ready for review". |
98bfab4
to
a681598
Compare
CI failures look like flaky tests, not related to my changes. So I consider this PR ready for reviewing, ping @jbrockmendel @jreback |
c_arrow_dtype_f_str, | ||
"=", | ||
) | ||
elif is_string_dtype(dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a dataframe's column has object dtype, is_string_dtype returns True and the flow goes into this branch. Since the spec doesn't have a requirement to support object dtype, should we raise TypeError exception when calling df.__dataframe()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very good question, I think I'll stub it with a NotImplementedError
for now - I think it fits better than TypeError
as there is no error on user side, but a missing spec entry...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have performed a little more research and I'm no longer sure how can I properly check for the thing being str
vs being something more complex except checking all entries in a column via isinstance()
which feels wrong... adding a TODO
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we check isinstance(df.dtype, object)
in df.__dataframe() and if it is True, then throw NotImplementedError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because strings are usually stored as objects, don't they?.. this would effectively block strings altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this check be put in df.dataframe() to get an error before playing around with the dataframe implementing the protocol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vasily's answer:
I cannot find this comment somewhere where I can write an answer, so I'm going to type it as a general comment.
I think this check should be delayed as much as possible because it's potentially scanning all the items in the column, which is a heavy operation while a user might just be needing some small amount of information (or might be wanting to get some particular column but not this string/object one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what if the user play around with a df for a long time, which has a column with object dtype, not touching df.dtype, and only after a while gets the error. I think that is a controversial question. I would like to hear other opinions on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The protocol is mostly for exchanging the dataframe between certain libraries, not for some user to play around with.
I'm imagining the use case like "someone wants to plot some graphs for a few columns of a dataframe backed by library X, so they request matplotlib to show a graph; matplotlib then imports the dataframe using the protocol and shows the requested columns, it doesn't care about other columns or anything else". In this case it would be harmful to the end user of the scenario to check if any column could be represented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally looks good a few small comments
Signed-off-by: Vasily Litvinov <[email protected]>
Signed-off-by: Vasily Litvinov <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks fine for me overall, left couple of minor comments
@vnlitvinov, I see no answers to my questions above. Please take a look at them. |
Signed-off-by: Vasily Litvinov <[email protected]>
Signed-off-by: Vasily Litvinov <[email protected]>
@YarShev I hope I've answered all of them now, I'm sorry I've somehow missed that you've added more responses to initial review. @jorisvandenbossche should I rename subpackage |
Signed-off-by: Vasily Litvinov <[email protected]>
Signed-off-by: Vasily Litvinov <[email protected]>
Signed-off-by: Vasily Litvinov <[email protected]>
c_arrow_dtype_f_str, | ||
"=", | ||
) | ||
elif is_string_dtype(dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this check be put in df.dataframe() to get an error before playing around with the dataframe implementing the protocol?
I cannot find this comment somewhere where I can write an answer, so I'm going to type it as a general comment. I think this check should be delayed as much as possible because it's potentially scanning all the items in the column, which is a heavy operation while a user might just be needing some small amount of information (or might be wanting to get some particular column but not this string/object one). |
Signed-off-by: Vasily Litvinov <[email protected]>
This is about handling string and object dtype. Let's continue the discussion there (link.) |
Signed-off-by: Vasily Litvinov <[email protected]>
Signed-off-by: Vasily Litvinov <[email protected]>
should there be tests that the protocol is round-trippable? e.g.
for some/most of possible dfs? (e.g. empty, various types), if they have a non-range index they should raise? what about non-string columns names? can certainly do this in another PR as well. |
There already are a few: pandas/pandas/tests/exchange/test_impl.py Lines 53 to 68 in cc94e57
and pandas/pandas/tests/exchange/test_impl.py Lines 71 to 90 in cc94e57
Maybe I should extend the second one and take a subset of pandas DataFrame using same indices and compare it with the one obtained via protocol... |
I would rather make it in a separate PR, as this one is already big... |
no for sure, pls create a todo issue (and PRs)! thanks for all of this @vnlitvinov and @YarShev for all the review! |
Do note that this PR is currently work-in-progress, mostly to facilitate the discussion on how the implementation should be going.
It also vendors the exchange spec and exchange tests, which aren't yet merged at the consortium, so I'll keep updating the vendored copies as the discussion goes there.
More tests are also to be added, as well as the implementations of some cases (a lot of non-central cases are
NotImplemented
now, as I've built this upon the prototype.doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.