-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC Replace column with expression #247
Conversation
@@ -63,104 +61,49 @@ def concat(dataframes: Sequence[DataFrame]) -> DataFrame: | |||
""" | |||
... | |||
|
|||
def column_from_sequence(sequence: Sequence[Any], *, dtype: Any, name: str = '', api_version: str | None = None) -> Column[Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we get rid of Column, then we can't initialise a column. So, column_from_sequence
, column_from_1d_array
, and dataframe_from_dict
would need to go, and the only initialiser left would be dataframe_from_2d_array
maybe we can consider adding others, but I think this is fine for now (and the only one which scikit-learn would probably need, if they're converting to ndarray and then converting back to dataframe)
@@ -63,104 +61,49 @@ def concat(dataframes: Sequence[DataFrame]) -> DataFrame: | |||
""" | |||
... | |||
|
|||
def column_from_sequence(sequence: Sequence[Any], *, dtype: Any, name: str = '', api_version: str | None = None) -> Column[Any]: | |||
def any_rowwise(keys: list[str] | None = None, *, skip_nulls: bool = True) -> Expression: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any_rowwise
returns an expression, so it no longer makes sense as a DataFrame method
Hence, I'm moving it to a top-level function
example usage:
def my_agnostic_func(df):
df = df.__dataframe_consortium_standard__()
namespace = df.__dataframe_namespace__()
result = df.get_rows_by_mask(namespace.any_rowwise())
return result.dataframe
@@ -90,25 +90,6 @@ def groupby(self, keys: str | list[str], /) -> GroupBy: | |||
""" | |||
... | |||
|
|||
def get_column_by_name(self, name: str, /) -> Column[Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can go
if somebody need to refer to a particular column (e.g. to use for filtering / creating a new column / updating a new column), they can use namespace.col
6069d72
to
0642028
Compare
High-level question: wouldn't this affect array interop significantly? How do you take the result of |
I won't be able to make the meeting today and will be on parental leave for the next month or so (and don't know how my schedule will align following that month...), so sharing some of my thoughts here. I'm -1 on this RFC as it currently exists. I don't think it's a reasonable tradeoff to remove all of the constructors that we currently have to create Columns and DataFrames other than from 2d arrays. It looks like Polars Is there a reasonable path forward towards supporting something like both a Maybe there's some path forward where we have some kind of In my opinion, we should identify problematic functionality / APIs and drive consensus on what behavior we wish to support for them before we dive into solutions. Top of head for me is:
|
Here's an example:
@kkraus14 could you show an example of where you need a Column, and a 1-column Dataframe won't suffice? |
Things just get funky and non-ergonomic in general. I.E. what does We lose a ton of typing information, number of columns in DataFrames need to be checked all over the place and you could imagine lazy implementations lazily resolving the number and names of columns, and I have a strong suspicion as we continue to build out the APIs we support that we will run face first into ambiguity between different behaviors of a 1-column DataFrame vs a Column because the former is inherently 2 dimensional while the latter is inherently 1 dimensional. For someone wanting to generically handling inputs, it also makes things ambiguous between whether a DataFrame with 1 column should be treated as 1d or 2d, i.e. your example above in calling |
This list is very useful. Comments on a few concrete methods that we discussed yesterday:
It looks like we shouldn't have
The answer seems to be no for For |
They can be supported in expressions, e.g. def my_func(df):
df = df.__dataframe_consortium_standard__()
namespace = df.__dataframe_namespace__()
col = namespace.col
result = df.select([col('a') - col('a').mean()])
return result.dataframe
We'd need to explicitly state what the broadcasting rules are, but we'd only be dealing with scalars and 1D columns, so they should be simple
def my_func(df):
df = df.__dataframe_consortium_standard__()
namespace = df.__dataframe_namespace__()
result = df.get_rows(namespace.sorted_indices(['b', 'a']))
return result.dataframe My implementation's here if you want to try this out: data-apis/dataframe-api-compat#13 |
If I'm reading the temperature of the room right, people are generally:
If that's the case, I'll flesh out the proposal and implementation, and hopefully we can work something out If anyone's completely "-1" on the above and unwilling to change idea, please do speak out I'm off-work most of next week and part of the week after, so that'll limit how much I can get done, but I think I can get something ready by the next call |
closing in favour of #249 |
Just demoing what #229 would look like
Why?
query engine frontends
The current API makes it look like certain operations are possible. However, for some implementations (e.g. dataframes as frontends to SQL backends, polars lazyframes), the following is currently not possible:
because cross-DataFrame comparisons require a join to have been done beforehand.
However, the following is:
The API should make it clear what's allowed and what isn't
Readability
We currently need to write code like
which to be honest looks like someone's taken the worst parts of pandas and made them even uglier
What's the suggestion?
Therefore, to make the API more clearly suggest what is possible, I'm suggesting to add
namespace.col
, which would allow the above to be rewritten as:col('a')
is an Expression. It is a function which maps a DataFrame to a column, and is only evaluated when in the context of a DataFrame method. In the example above,col('a')
is kind of likelambda df: df['a']
, andcol('b')
likelambda df: df['b']
. So, when evaluated withindf1
, they resolve to:Expressions can be combined. For example, here is a little demo of standardising each column of the Iris dateset (based on the example I gave at EuroScipy2023):
What would the consequences be?
Every method currently available on
Column
should be available onExpression
. The only difference is thatExpression
s are always lazy, and can only be evaluated within DataFrame methods which accept them.Concretely:
Column
should be available onExpression
, except:dtype