Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] pd.api.interchange.from_dataframe fails with simple cuDF dataframe #17282

Open
MarcoGorelli opened this issue Nov 8, 2024 · 8 comments
Assignees
Labels
Python Affects Python cuDF API. wontfix This will not be worked on

Comments

@MarcoGorelli
Copy link
Contributor

Describe the bug
pd.api.interchange.from_dataframe fails with simple cuDF dataframe

Steps/Code to reproduce bug

import cudf
import pandas as pd
df = cudf.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
pd.api.interchange.from_dataframe(df)

this crashes the session https://colab.research.google.com/drive/1QXtKPcKQONi1g8WY9lI6FPZFhik_VVYg?usp=sharing

Expected behavior
it should convert to pandas dataframe with the same data

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
  • Method of cuDF install: [conda, Docker, or from source]
    • If method of install is [Docker], provide docker pull & docker run commands used

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

colab notebook

Additional context
Add any other context about the problem here.

@brandon-b-miller
Copy link
Contributor

Thanks for reporting! I'm looking into this.

@MarcoGorelli
Copy link
Contributor Author

Thanks!

Note that the PyCapsule Interface works perfectly well here, and outputs

pyarrow.Table
a: int64
b: int64
----
a: [[1,2,3]]
b: [[4,5,6]]

@brandon-b-miller
Copy link
Contributor

The technical reason this is happening is because pandas is trying to construct dataframe columns around buffers that correspond to GPU data, triggering a segfault. That's what you get when you access the __dataframe__ property - device columns.

The real reason it's happening though is that there's not really a standard spec for whose responsibility it is to move the data to the current memory space (CPU in this case). There will need to be a little wider discussion on what to do more broadly in situations like these.

Hopefully you are able to use to_pandas() as a workaround here for now.

cc @mroeschke @vyasr @quasiben

@wence-
Copy link
Contributor

wence- commented Nov 19, 2024

I do not think this is a cudf bug. cudf delivers an object that obeys the dataframe protocol. However, as noted, the protocol is silent on whose responsibility it is to do cross-memory-region copies. Pandas should probably inspect the __dlpack_device__() enum tag for the location of the data and fail sensibly if the memory in the interchange protocol isn't on CPU.

I note that in the pandas implementation it deliberately constructs numpy arrays from the raw pointers (rather than going through dlpack, which at the time was not supported). dlpack is now supported in numpy, and if you use that instead you at least get a useful error message from numpy: RuntimeError: Unsupported device in DLTensor.

@MarcoGorelli: I don't think we can fix this here without deciding to eagerly copy everything to host (which we really don't want to do). Is this a bug report because you really want this to work, or to point out that the interchange protocol doesn't handle an obvious usecase?

@wence- wence- self-assigned this Nov 19, 2024
@wence- wence- added the 0 - Waiting on Author Waiting for author to respond to review label Nov 19, 2024
@MarcoGorelli
Copy link
Contributor Author

Thanks for your response!

I reported this because I was expecting it to work, and was surprised that it didn't when I tested out Plotly with cudF

At least in plotly/plotly.py#4244 it looks like people were expecting that pd.api.interchange.from_dataframe would enable support for cuDF

@wence- wence- removed the 0 - Waiting on Author Waiting for author to respond to review label Nov 20, 2024
@wence-
Copy link
Contributor

wence- commented Nov 21, 2024

OK thanks. Concretely here, because the dlpack interface does work, and the arrow C interchange protocol has gained enough adoption (and has a device version), our plan is to deprecate this interchange format (in what will be released as 25.02), and point people at those interchange options instead.

I've marked this one as wontfix, because, per my reading of the spec, cudf does not have a bug here.

@wence- wence- added wontfix This will not be worked on and removed bug Something isn't working labels Nov 21, 2024
@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Nov 21, 2024

thanks for looking into this!

agree on deprecating - the nice thing about the interchange protocol was that it brought people from libraries together to collaborate, and that is valuable - but at this point I think it's outlived its usefulness

@vyasr
Copy link
Contributor

vyasr commented Nov 21, 2024

#17403

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Python Affects Python cuDF API. wontfix This will not be worked on
Projects
Status: Todo
Development

No branches or pull requests

4 participants