-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Signature for a standard from_dataframe
constructor function
#42
Comments
There was a little bit of hesitation about adding this function to a public API. For the initial I'd suggest adding in phrasing along these lines:
|
This would be nice to revisit, before everyone makes up their own thing in a different namespace in their library. Like this:
See https://pandas.pydata.org/docs/dev/reference/api/pandas.api.interchange.from_dataframe.html |
Do you want to standardize the signature, or also the namespace / location in the library? |
Good point. I think those are separate questions. Signature is more important I'd say. Namespace is only important once we have a concept of a "dataframe API standard namespace" - so that can be ignored for the purpose of this issue. |
Pandas code and signature: def from_dataframe(df, allow_copy=True) -> pd.DataFrame: Vaex code and signature: def from_dataframe_to_vaex(df: DataFrameObject, allow_copy: bool = True) -> vaex.dataframe.DataFrame: Modin code for function and code for method and signature: def from_dataframe(df):
class PandasDataframe:
def from_dataframe(cls, df: "ProtocolDataframe") -> "PandasDataframe": cuDF code and signature: def from_dataframe(df, allow_copy=False): I found the explanation for @maartenbreddels: if @jorisvandenbossche: an example would be string columns in pandas. Currently, in pandas, we cannot support arrow string columns, where two buffers. In the future, pandas will use arrow, but right now uses NumPy's object dtype. So atm, pandas would require a copy, so would always raise an exception. Based on the above, I think we can explicitly state that |
The summary of a discussion on this yesterday was:
|
One of the "to be decided" items at https://github.com/data-apis/dataframe-api/blob/dataframe-interchange-protocol/protocol/dataframe_protocol_summary.md#to-be-decided is:
Should there be a standard from_dataframe constructor function? This isn't completely necessary, however it's expected that a full dataframe API standard will have such a function. The array API standard also has such a function, namely from_dlpack. Adding at least a recommendation on syntax for this function would make sense, e.g., from_dataframe(df, stream=None). Discussion at #29 (comment) is relevant.
In the announcement blog post draft I tentatively answered that with "yes", and added an example. The question is what the desired signature should be. The Pandas prototype currently has the most basic signature one can think of:
The above just takes any dataframe supporting the protocol, and turns the whole things in the "library-native" dataframe. Now of course, it's possible to add functionality to it, to extract only a subset of the data. Most obviously, named columns:
Other things we may or may not want to support:
My personal feeling is:
col_indices=None
__dataframe__
first, then inspect some metadata, and only then decide what chunks to get.Thoughts?
The text was updated successfully, but these errors were encountered: