Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fall back to to_pandas if interchange protocol fails #3534

Closed
wants to merge 3 commits into from

Conversation

MarcoGorelli
Copy link
Contributor

@MarcoGorelli MarcoGorelli commented Oct 19, 2023

Closes #3533

@codecov
Copy link

codecov bot commented Oct 19, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.34%. Comparing base (2c2c84e) to head (02c142f).
Report is 46 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #3534   +/-   ##
=======================================
  Coverage   98.34%   98.34%           
=======================================
  Files          75       75           
  Lines       24605    24617   +12     
=======================================
+ Hits        24197    24209   +12     
  Misses        408      408           
Files with missing lines Coverage Δ
seaborn/_core/data.py 97.50% <100.00%> (+0.08%) ⬆️
tests/_core/test_data.py 100.00% <100.00%> (ø)

@mwaskom
Copy link
Owner

mwaskom commented Oct 20, 2023

I've got no issues with this but it would be helpful to understand better when it's relevant. Is the idea that a library might have bugs in the hooks that frame_dataframecalls but not in to_pandas? Do we often expect those to go through very different code paths?

seaborn/_core/data.py Outdated Show resolved Hide resolved
seaborn/_core/data.py Outdated Show resolved Hide resolved
@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Oct 21, 2023

I've got no issues with this

Great, thanks!

Is the idea that a library might have bugs in the hooks that frame_dataframe calls but not in to_pandas?

A library can control its own to_pandas method, but for pd.api.interchange.from_dataframe, they're tied down to whatever decision pandas (and the dataframe consortium) make.

In particular, if there's a datatype not supported by the consortium, then interchanging to pandas will raise regardless of whether that column will be used or not. Doesn't look like there's any real prospect of this changing data-apis/dataframe-api#288

maybe I can get an extra argument in to pandas (like, from_dataframe(..., ignore_unsupported_columns: bool))

@MarcoGorelli MarcoGorelli marked this pull request as ready for review October 21, 2023 15:32
@mwaskom
Copy link
Owner

mwaskom commented Oct 21, 2023

Thanks for explaining. TBH as an observer (but interested party) these nuances make it fairly confusing to understand what the "right" way to generically consume a dataframe object is. Not sure that that's really actionable feedback (and based on the linked thread it looks like you agree). So this feels slightly hacky but I think reasonable given the circumstances.

@MarcoGorelli
Copy link
Contributor Author

I think the ideal way would be something like this:

https://github.com/plotly/plotly.py/blob/57e4d1d33c67c5cc715bec5c3c240dd6f4c3b10d/packages/python/plotly/plotly/express/_core.py#L1428-L1431

                columns = list(necessary_columns)
                df = pd.api.interchange.from_dataframe(
                    df.__dataframe__().select_columns_by_name(columns)
                )

but I understand that knowing in advance exactly which columns you need might not be possible without a major refactor

@mwaskom
Copy link
Owner

mwaskom commented Oct 22, 2023

I gather that a relevant example looks like this:

import polars as pl
pol_df = pl.DataFrame({
    "a": pl.Series([1, 2, 3, 4]),
    "b": pl.Series([[1, 2], [3], [4, 5, 6], []]),
})

# Works, "b" column is object-typed
pol_df.to_pandas()

# Raises with "large_list<item: int64> not supported by interchange protocol"
pd.api.interchange.from_dataframe(pol_df)

(Actually, first-class list-type support would be pretty handy for seaborn 🤔...)

This makes sense to me as a current source of disagreement that we should work around.

seaborn/_core/data.py Outdated Show resolved Hide resolved
Also removed the check for the to_pandas method since we don't need it given that we're passing on any errors.
@krz
Copy link

krz commented Jan 15, 2024

I'm really hoping for a fix, because this is an annoying problem I sometimes encounter.
I'm an R convert who loves both polars and seaborn.objects

For example, this works without to_pandas()

import seaborn.objects as so
import polars as pl

mpg = pl.read_csv("https://raw.githubusercontent.com/tidyverse/ggplot2/main/data-raw/mpg.csv")

(
    so.Plot(mpg, x="displ", y="hwy")
    .add(so.Dot())
)

However, this does not work:

economics = pl.read_csv("https://raw.githubusercontent.com/tidyverse/ggplot2/main/data-raw/economics.csv", try_parse_dates=True, dtypes={"pop": pl.Float32})

(
    so.Plot(economics, x="date", y="uempmed")
    .add(so.Path())
)

to_pandas() works:

(
    so.Plot(economics.to_pandas(), x="date", y="uempmed")
    .add(so.Path())
)

@MarcoGorelli
Copy link
Contributor Author

hi @mwaskom - just for my understanding, is there anything holding this back? or will it go in with the 0.14 release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Try to_pandas rather than erroring if interchanging to pandas doesn't work?
3 participants