ENH: Implement DataFrame interchange protocol #46141

vnlitvinov · 2022-02-24T18:01:38Z

Do note that this PR is currently work-in-progress, mostly to facilitate the discussion on how the implementation should be going.

It also vendors the exchange spec and exchange tests, which aren't yet merged at the consortium, so I'll keep updating the vendored copies as the discussion goes there.

More tests are also to be added, as well as the implementations of some cases (a lot of non-central cases are NotImplemented now, as I've built this upon the prototype.

closes #xxxx (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

pep8speaks · 2022-02-24T18:01:43Z

Hello @vnlitvinov! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-04-24 10:03:07 UTC

vnlitvinov · 2022-02-24T18:02:14Z

cc @jreback for preliminary feedback

pandas/core/frame.py

jreback

didnt look in detail but some top-level organizational comments

pandas/api/exchange/dataframe_protocol.py

pandas/api/exchange/implementation.py

pandas/tests/api/conftest.py

pandas/tests/api/test_protocol.py

github-actions · 2022-03-30T00:04:46Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

vnlitvinov · 2022-03-30T15:53:13Z

@jreback @jbrockmendel I've responded to your comments, but I suggest to refrain from re-reading the PR just yet - I'm in the middle of improving it yet further, and I'll make a comment when it's again ready for reviewing.

Thanks again for your feedback!

vnlitvinov · 2022-03-31T16:05:20Z

Okay, I think logic-wise it's ready to be reviewed.

I still need to make CI happy about code style etc., but I don't expect a lot of changes for that.

vnlitvinov · 2022-03-31T16:05:47Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Please remove the stale label

vnlitvinov · 2022-03-31T17:15:28Z

This PR finally passes the code checks (and new functionality passes newly added tests which cover at least the basic usage of the API) on my end, so I'm marking this PR as "ready for review".

vnlitvinov · 2022-04-01T09:20:41Z

CI failures look like flaky tests, not related to my changes.

So I consider this PR ready for reviewing, ping @jbrockmendel @jreback

pandas/tests/exchange/test_impl.py

YarShev · 2022-04-01T13:22:25Z

pandas/core/exchange/column.py

+                c_arrow_dtype_f_str,
+                "=",
+            )
+        elif is_string_dtype(dtype):


If a dataframe's column has object dtype, is_string_dtype returns True and the flow goes into this branch. Since the spec doesn't have a requirement to support object dtype, should we raise TypeError exception when calling df.__dataframe()?

That's a very good question, I think I'll stub it with a NotImplementedError for now - I think it fits better than TypeError as there is no error on user side, but a missing spec entry...

I have performed a little more research and I'm no longer sure how can I properly check for the thing being str vs being something more complex except checking all entries in a column via isinstance() which feels wrong... adding a TODO instead.

Why don't we check isinstance(df.dtype, object) in df.__dataframe() and if it is True, then throw NotImplementedError?

because strings are usually stored as objects, don't they?.. this would effectively block strings altogether.

Should this check be put in df.dataframe() to get an error before playing around with the dataframe implementing the protocol?

Vasily's answer:

I cannot find this comment somewhere where I can write an answer, so I'm going to type it as a general comment.

I think this check should be delayed as much as possible because it's potentially scanning all the items in the column, which is a heavy operation while a user might just be needing some small amount of information (or might be wanting to get some particular column but not this string/object one).

But what if the user play around with a df for a long time, which has a column with object dtype, not touching df.dtype, and only after a while gets the error. I think that is a controversial question. I would like to hear other opinions on this.

The protocol is mostly for exchanging the dataframe between certain libraries, not for some user to play around with.

I'm imagining the use case like "someone wants to plot some graphs for a few columns of a dataframe backed by library X, so they request matplotlib to show a graph; matplotlib then imports the dataframe using the protocol and shows the requested columns, it doesn't care about other columns or anything else". In this case it would be harmful to the end user of the scenario to check if any column could be represented.

cc @jreback, @jbrockmendel, @rgommers, @dchigarev

pandas/core/exchange/column.py

pandas/core/exchange/buffer.py

pandas/core/exchange/column.py

pandas/core/exchange/utils.py

jreback

generally looks good a few small comments

pandas/core/exchange/buffer.py

pandas/core/exchange/column.py

pandas/core/exchange/dataframe_protocol.py

pandas/core/exchange/dataframe.py

pandas/core/exchange/column.py

Signed-off-by: Vasily Litvinov <[email protected]>

dchigarev

The implementation looks fine for me overall, left couple of minor comments

pandas/core/exchange/from_dataframe.py

pandas/tests/exchange/test_impl.py

YarShev · 2022-04-19T11:46:05Z

@vnlitvinov, I see no answers to my questions above. Please take a look at them.

Signed-off-by: Vasily Litvinov <[email protected]>

vnlitvinov · 2022-04-20T13:52:54Z

@YarShev I hope I've answered all of them now, I'm sorry I've somehow missed that you've added more responses to initial review.

@jorisvandenbossche should I rename subpackage pandas.core.exchange to pandas.core.interchange to align with new PR title?..

Signed-off-by: Vasily Litvinov <[email protected]>

pandas/tests/exchange/test_utils.py

pandas/tests/exchange/test_spec_conformance.py

pandas/core/exchange/from_dataframe.py

pandas/core/exchange/dataframe.py

YarShev · 2022-04-22T17:44:11Z

pandas/core/exchange/column.py

+                c_arrow_dtype_f_str,
+                "=",
+            )
+        elif is_string_dtype(dtype):


Should this check be put in df.dataframe() to get an error before playing around with the dataframe implementing the protocol?

vnlitvinov · 2022-04-23T07:35:27Z

@YarShev

Should this check be put in df.dataframe() to get an error before playing around with the dataframe implementing the protocol?

I cannot find this comment somewhere where I can write an answer, so I'm going to type it as a general comment.

I think this check should be delayed as much as possible because it's potentially scanning all the items in the column, which is a heavy operation while a user might just be needing some small amount of information (or might be wanting to get some particular column but not this string/object one).

Signed-off-by: Vasily Litvinov <[email protected]>

YarShev · 2022-04-24T09:24:57Z

@YarShev

Should this check be put in df.dataframe() to get an error before playing around with the dataframe implementing the protocol?

I cannot find this comment somewhere where I can write an answer, so I'm going to type it as a general comment.

I think this check should be delayed as much as possible because it's potentially scanning all the items in the column, which is a heavy operation while a user might just be needing some small amount of information (or might be wanting to get some particular column but not this string/object one).

This is about handling string and object dtype. Let's continue the discussion there (link.)

Signed-off-by: Vasily Litvinov <[email protected]>

jreback · 2022-04-26T01:25:30Z

should there be tests that the protocol is round-trippable? e.g.

tm.assert_frame_equal(df, pd.api.exchange.from_dataframe(df.__dataframe__()))

for some/most of possible dfs? (e.g. empty, various types), if they have a non-range index they should raise? what about non-string columns names?

can certainly do this in another PR as well.

vnlitvinov · 2022-04-26T08:53:05Z

There already are a few:

pandas/pandas/tests/exchange/test_impl.py

Lines 53 to 68 in cc94e57

    
           @pytest.mark.parametrize("data", [("ordered", True), ("unordered", False)]) 
        
           def test_categorical_dtype(data): 
        
               df = pd.DataFrame({"A": (test_data_categorical[data[0]])}) 
        
               col = df.__dataframe__().get_column_by_name("A") 
        
               assert col.dtype[0] == DtypeKind.CATEGORICAL 
        
               assert col.null_count == 0 
        
               assert col.describe_null == (ColumnNullType.USE_SENTINEL, -1) 
        
               assert col.num_chunks() == 1 
        
               assert col.describe_categorical == { 
        
                   "is_ordered": data[1], 
        
                   "is_dictionary": True, 
        
                   "mapping": {0: "a", 1: "d", 2: "e", 3: "s", 4: "t"}, 
        
               } 
        
               tm.assert_frame_equal(df, from_dataframe(df.__dataframe__()))

and

pandas/pandas/tests/exchange/test_impl.py

Lines 71 to 90 in cc94e57

    
           @pytest.mark.parametrize( 
        
               "data", [int_data, uint_data, float_data, bool_data, datetime_data] 
        
           ) 
        
           def test_dataframe(data): 
        
               df = pd.DataFrame(data) 
        
               df2 = df.__dataframe__() 
        
               assert df2.num_columns() == NCOLS 
        
               assert df2.num_rows() == NROWS 
        
               assert list(df2.column_names()) == list(data.keys()) 
        
               indices = (0, 2) 
        
               names = tuple(list(data.keys())[idx] for idx in indices) 
        
               tm.assert_frame_equal( 
        
                   from_dataframe(df2.select_columns(indices)), 
        
                   from_dataframe(df2.select_columns_by_name(names)), 
        
               )

Maybe I should extend the second one and take a subset of pandas DataFrame using same indices and compare it with the one obtained via protocol...

vnlitvinov · 2022-04-26T08:54:43Z

if they have a non-range index they should raise? what about non-string columns names?

can certainly do this in another PR as well.

I would rather make it in a separate PR, as this one is already big...

jreback · 2022-04-27T12:47:56Z

if they have a non-range index they should raise? what about non-string columns names?
can certainly do this in another PR as well.

I would rather make it in a separate PR, as this one is already big...

no for sure, pls create a todo issue (and PRs)!

thanks for all of this @vnlitvinov and @YarShev for all the review!

jbrockmendel reviewed Feb 26, 2022

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Feb 26, 2022

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

jreback added the Compat pandas objects compatability with Numpy or Python functions label Feb 27, 2022

jreback requested changes Feb 27, 2022

View reviewed changes

dchigarev mentioned this pull request Mar 21, 2022

Add more tests for the dataframe interchange protocol data-apis/dataframe-api#75

Closed

github-actions bot added the Stale label Mar 30, 2022

vnlitvinov force-pushed the df-xchg branch from c2ffc92 to 2a0c4ea Compare March 31, 2022 16:02

jbrockmendel removed the Stale label Mar 31, 2022

vnlitvinov changed the title ~~[WIP] DataFrame exchange protocol~~ ENH: Implement DataFrame exchange protocol Mar 31, 2022

vnlitvinov marked this pull request as ready for review March 31, 2022 17:35

vnlitvinov force-pushed the df-xchg branch 4 times, most recently from 98bfab4 to a681598 Compare March 31, 2022 21:12

vnlitvinov mentioned this pull request Apr 1, 2022

Declare enums explicitly, fix type hints data-apis/dataframe-api#74

Merged

YarShev reviewed Apr 1, 2022

View reviewed changes

jbrockmendel reviewed Apr 4, 2022

View reviewed changes

pandas/core/exchange/column.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Apr 4, 2022

View reviewed changes

pandas/core/exchange/utils.py Outdated Show resolved Hide resolved

jreback requested changes Apr 14, 2022

View reviewed changes

pandas/core/exchange/buffer.py Outdated Show resolved Hide resolved

pandas/core/exchange/column.py Show resolved Hide resolved

pandas/core/exchange/dataframe_protocol.py Show resolved Hide resolved

jorisvandenbossche changed the title ~~ENH: Implement DataFrame exchange protocol~~ ENH: Implement DataFrame interchange protocol Apr 14, 2022

dchigarev reviewed Apr 14, 2022

View reviewed changes

Vendor smoke tests from consortium

ac58967

Signed-off-by: Vasily Litvinov <[email protected]>

Fix tests broken by .column_names change

5d98ebf

Signed-off-by: Vasily Litvinov <[email protected]>

dchigarev reviewed Apr 19, 2022

View reviewed changes

pandas/core/exchange/from_dataframe.py Outdated Show resolved Hide resolved

pandas/core/exchange/from_dataframe.py Outdated Show resolved Hide resolved

pandas/tests/exchange/test_impl.py Outdated Show resolved Hide resolved

vnlitvinov added 2 commits April 20, 2022 16:36

Add tests for datetime dtype

60379e5

Signed-off-by: Vasily Litvinov <[email protected]>

Fix from_dataframe docstring

497ca24

Signed-off-by: Vasily Litvinov <[email protected]>

vnlitvinov added 3 commits April 21, 2022 18:06

Add tests for uint dtype

39f5a5c

Signed-off-by: Vasily Litvinov <[email protected]>

Handle string dtype better

d73558a

Signed-off-by: Vasily Litvinov <[email protected]>

Add test for mixed object dtype

4ed35bf

Signed-off-by: Vasily Litvinov <[email protected]>

YarShev reviewed Apr 22, 2022

View reviewed changes

Rename spec test for clarity

2fca3c0

Signed-off-by: Vasily Litvinov <[email protected]>

vnlitvinov added 2 commits April 24, 2022 12:52

Add missing test cases in test_dtype_to_arrow_c_fmt

f030d9f

Signed-off-by: Vasily Litvinov <[email protected]>

Add comments explaing magic dtype numbers

cc94e57

Signed-off-by: Vasily Litvinov <[email protected]>

jreback added this to the 1.5 milestone Apr 26, 2022

jreback approved these changes Apr 27, 2022

View reviewed changes

jreback merged commit 90140f0 into pandas-dev:main Apr 27, 2022

This was referenced May 27, 2022

Feature request: Protocol for converting something to a pandas DataFrame #30218

Open

SLEP018 Pandas output for transformers with set_output scikit-learn/enhancement_proposals#68

Merged

honno mentioned this pull request Jun 15, 2022

[BUG-REPORT] Arrow columns in interchange dataframes have erroneous null behaviour vaexio/vaex#2083

Open

cnpryer mentioned this pull request Jun 18, 2022

WIP: Implement DataFrame Interchange Protocol pola-rs/polars#3727

Closed

3 tasks

This was referenced Jun 22, 2022

[BUG-REPORT] Interchange Column.size returns 0d arrays as opposed to Python int vaexio/vaex#2093

Closed

[BUG-REPORT] Dataframes with no columns raise errors for various operations vaexio/vaex#2094

Open

honno mentioned this pull request Jul 5, 2022

[BUG-REPORT] describe_categorical in interchange columns is a tuple, not a dict vaexio/vaex#2113

Closed

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022

ENH: Implement DataFrame interchange protocol (pandas-dev#46141)

4dbcfb4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Implement DataFrame interchange protocol #46141

ENH: Implement DataFrame interchange protocol #46141

vnlitvinov commented Feb 24, 2022 •

edited

Loading

pep8speaks commented Feb 24, 2022 •

edited

Loading

vnlitvinov commented Feb 24, 2022

jreback left a comment

github-actions bot commented Mar 30, 2022

vnlitvinov commented Mar 30, 2022

vnlitvinov commented Mar 31, 2022

vnlitvinov commented Mar 31, 2022

vnlitvinov commented Mar 31, 2022 •

edited

Loading

vnlitvinov commented Apr 1, 2022

YarShev Apr 1, 2022

vnlitvinov Apr 13, 2022

vnlitvinov Apr 14, 2022

YarShev Apr 15, 2022

vnlitvinov Apr 20, 2022

YarShev Apr 22, 2022

YarShev Apr 24, 2022

YarShev Apr 24, 2022

vnlitvinov Apr 24, 2022

YarShev Apr 24, 2022

jreback left a comment

dchigarev left a comment •

edited

Loading

YarShev commented Apr 19, 2022

vnlitvinov commented Apr 20, 2022

YarShev Apr 22, 2022

vnlitvinov commented Apr 23, 2022

YarShev commented Apr 24, 2022

jreback commented Apr 26, 2022

vnlitvinov commented Apr 26, 2022

vnlitvinov commented Apr 26, 2022

jreback commented Apr 27, 2022

ENH: Implement DataFrame interchange protocol #46141

ENH: Implement DataFrame interchange protocol #46141

Conversation

vnlitvinov commented Feb 24, 2022 • edited Loading

pep8speaks commented Feb 24, 2022 • edited Loading

Comment last updated at 2022-04-24 10:03:07 UTC

vnlitvinov commented Feb 24, 2022

jreback left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 30, 2022

vnlitvinov commented Mar 30, 2022

vnlitvinov commented Mar 31, 2022

vnlitvinov commented Mar 31, 2022

vnlitvinov commented Mar 31, 2022 • edited Loading

vnlitvinov commented Apr 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

dchigarev left a comment • edited Loading

Choose a reason for hiding this comment

YarShev commented Apr 19, 2022

vnlitvinov commented Apr 20, 2022

Choose a reason for hiding this comment

vnlitvinov commented Apr 23, 2022

YarShev commented Apr 24, 2022

jreback commented Apr 26, 2022

vnlitvinov commented Apr 26, 2022

vnlitvinov commented Apr 26, 2022

jreback commented Apr 27, 2022

vnlitvinov commented Feb 24, 2022 •

edited

Loading

pep8speaks commented Feb 24, 2022 •

edited

Loading

vnlitvinov commented Mar 31, 2022 •

edited

Loading

dchigarev left a comment •

edited

Loading