Signature for a standard `from_dataframe` constructor function #42

rgommers · 2021-05-19T21:45:10Z

One of the "to be decided" items at https://github.com/data-apis/dataframe-api/blob/dataframe-interchange-protocol/protocol/dataframe_protocol_summary.md#to-be-decided is:

Should there be a standard from_dataframe constructor function? This isn't completely necessary, however it's expected that a full dataframe API standard will have such a function. The array API standard also has such a function, namely from_dlpack. Adding at least a recommendation on syntax for this function would make sense, e.g., from_dataframe(df, stream=None). Discussion at #29 (comment) is relevant.

In the announcement blog post draft I tentatively answered that with "yes", and added an example. The question is what the desired signature should be. The Pandas prototype currently has the most basic signature one can think of:

def from_dataframe(df : DataFrameObject) -> pd.DataFrame:
    """
    Construct a pandas DataFrame from ``df`` if it supports ``__dataframe__``
    """
    if isinstance(df, pd.DataFrame):
        return df

    if not hasattr(df, '__dataframe__'):
        raise ValueError("`df` does not support __dataframe__")

    return _from_dataframe(df.__dataframe__())

The above just takes any dataframe supporting the protocol, and turns the whole things in the "library-native" dataframe. Now of course, it's possible to add functionality to it, to extract only a subset of the data. Most obviously, named columns:

def from_dataframe(df : DataFrameObject, *, colnames : Optional[Iterable[str]]= None) -> pd.DataFrame:

Other things we may or may not want to support:

columns by index
get a subset of chunks

My personal feeling is:

columns by index: maybe, and if we do then with a separate keyword like col_indices=None
a subset of chunks: probably not. This is more advanced usage, and if one needs it it's likely one wants to get the object returned by __dataframe__ first, then inspect some metadata, and only then decide what chunks to get.

Thoughts?

The text was updated successfully, but these errors were encountered:

rgommers · 2021-05-27T12:39:17Z

There was a little bit of hesitation about adding this function to a public API. For the initial I'd suggest adding in phrasing along these lines:

Each dataframe library will need to implement functionality to construct a new dataframe from a foreign dataframe object with a __dataframe__ method.
This could be exposed to users within existing public API, e.g., handling it within DataFrame.__init__().
We recommend (but don't require) adding a standalone factory function for this purpose. IF an implementer decides to do this, please use the from_dataframe signature given above. And if you need extensions, come talk to us.

rgommers · 2022-09-13T12:05:21Z

This would be nice to revisit, before everyone makes up their own thing in a different namespace in their library. Like this:

>>> import pandas as pd
>>> pd.__version__
'1.5.0rc0'
>>> [name for name in dir(pd.api.interchange) if not name.startswith('_')]
['DataFrame', 'from_dataframe']

>>> pd.api.interchange.from_dataframe?
Signature: pd.core.interchange.from_dataframe.from_dataframe(df, allow_copy=True) -> 'pd.DataFrame'

See https://pandas.pydata.org/docs/dev/reference/api/pandas.api.interchange.from_dataframe.html

jorisvandenbossche · 2022-09-13T12:22:34Z

Do you want to standardize the signature, or also the namespace / location in the library?

rgommers · 2022-09-13T12:24:56Z

Good point. I think those are separate questions. Signature is more important I'd say. Namespace is only important once we have a concept of a "dataframe API standard namespace" - so that can be ignored for the purpose of this issue.

rgommers · 2022-09-15T15:02:50Z

Pandas code and signature:

def from_dataframe(df, allow_copy=True) -> pd.DataFrame:

Vaex code and signature:

def from_dataframe_to_vaex(df: DataFrameObject, allow_copy: bool = True) -> vaex.dataframe.DataFrame:

Modin code for function and code for method and signature:

def from_dataframe(df):

class PandasDataframe:
    def from_dataframe(cls, df: "ProtocolDataframe") -> "PandasDataframe":

cuDF code and signature:

def from_dataframe(df, allow_copy=False):

I found the explanation for allow_copy deviations in some older meeting notes:

@maartenbreddels: if allow_copy or allow_memory_copy, then clearer to me. I am more in favor of allow_copy being False and thus being safe (performance-wise, and that I don't accidentally crash my computer).

@jorisvandenbossche: an example would be string columns in pandas. Currently, in pandas, we cannot support arrow string columns, where two buffers. In the future, pandas will use arrow, but right now uses NumPy's object dtype. So atm, pandas would require a copy, so would always raise an exception.

Based on the above, I think we can explicitly state that allow_copy can have any default, and that libraries must add an allow_copy keyword.

rgommers · 2022-09-16T11:34:06Z

The summary of a discussion on this yesterday was:

It's be useful to align the signature across libraries
allow_copy default value may vary
It must be clear what the exact semantics of allow_copy are. Given that it's implemented as a pass-through to __dataframe__, the description there should apply: https://github.com/data-apis/dataframe-api/blob/2b37b1d/protocol/dataframe_protocol.py#L401-L404. That could still be made more precise though.
Modin has no allow_copy now because it basically always makes a copy. There's an internal "object store" where the data is copied to even if it's nicely laid out in memory already. That said, @vnlitvinov was fine with adding allow_copy=True for consistency.
It is not desired at this point to expose from_dataframe in a standard namespace, to avoid folks using (e.g.) pd.from_dataframe and still converting away from the native representation. However, Allow to reconstruct a library-specific DataFrame object from an interchange object #85 has a better alternative (expose the native converter on the exchange df object itself).

rgommers mentioned this issue May 20, 2021

Add a prototype of the dataframe interchange protocol #38

Merged

rgommers mentioned this issue Jun 25, 2021

Tracking issue: dataframe protocol implementation #46

Open

8 tasks

rgommers added the interchange-protocol label Jun 25, 2021

YarShev mentioned this issue Mar 16, 2022

FEAT-#4244: Implement dataframe exchange protocol for OmniSci modin-project/modin#4269

Merged

8 tasks

rgommers mentioned this issue Sep 19, 2022

Allow to reconstruct a library-specific DataFrame object from an interchange object #85

Open

thomasjpfan mentioned this issue Mar 21, 2023

Support other dataframes like polars and pyarrow not just pandas scikit-learn/scikit-learn#25896

Open

thomasjpfan mentioned this issue Apr 7, 2023

ENH Support dataframe exchange protocol in ColumnTransformer as input scikit-learn/scikit-learn#26115

Closed

rgommers mentioned this issue Jul 20, 2023

"levels" of the Standard #201

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Signature for a standard `from_dataframe` constructor function #42

Signature for a standard `from_dataframe` constructor function #42

rgommers commented May 19, 2021

rgommers commented May 27, 2021

rgommers commented Sep 13, 2022

jorisvandenbossche commented Sep 13, 2022

rgommers commented Sep 13, 2022

rgommers commented Sep 15, 2022

rgommers commented Sep 16, 2022

Signature for a standard from_dataframe constructor function #42

Signature for a standard from_dataframe constructor function #42

Comments

rgommers commented May 19, 2021

rgommers commented May 27, 2021

rgommers commented Sep 13, 2022

jorisvandenbossche commented Sep 13, 2022

rgommers commented Sep 13, 2022

rgommers commented Sep 15, 2022

rgommers commented Sep 16, 2022

Signature for a standard `from_dataframe` constructor function #42

Signature for a standard `from_dataframe` constructor function #42