Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC Replace column with expression #247

Closed
wants to merge 1 commit into from

Conversation

MarcoGorelli
Copy link
Contributor

@MarcoGorelli MarcoGorelli commented Aug 31, 2023

Just demoing what #229 would look like

Why?

query engine frontends

The current API makes it look like certain operations are possible. However, for some implementations (e.g. dataframes as frontends to SQL backends, polars lazyframes), the following is currently not possible:

df1: DataFrame
df2: DataFrame
df1.get_rows_by_mask(df1.get_column_by_name('a') > df2.get_column_by_name('b'))

because cross-DataFrame comparisons require a join to have been done beforehand.

However, the following is:

df1: DataFrame
df1.get_rows_by_mask(df1.get_column_by_name('a') > df1.get_column_by_name('b'))

The API should make it clear what's allowed and what isn't

Readability

We currently need to write code like

    plant_statistics = plant_statistics.filter(
        (
            plant_statistics.get_column_by_name("sepal_width")
            > plant_statistics.get_column_by_name("sepal_height")
        )
        | (plant_statistics.get_column_by_name("species") == "setosa")
    )

which to be honest looks like someone's taken the worst parts of pandas and made them even uglier

What's the suggestion?

Therefore, to make the API more clearly suggest what is possible, I'm suggesting to add namespace.col, which would allow the above to be rewritten as:

df1: DataFrame
df1.filter(col('a') > col('b'))

col('a') is an Expression. It is a function which maps a DataFrame to a column, and is only evaluated when in the context of a DataFrame method. In the example above, col('a') is kind of like lambda df: df['a'], and col('b') like lambda df: df['b']. So, when evaluated within df1, they resolve to:

df1.loc[df1['a'] > df1['b']]

Expressions can be combined. For example, here is a little demo of standardising each column of the Iris dateset (based on the example I gave at EuroScipy2023):

def my_dataframe_agnostic_function(df):
    df = df.__dataframe_consortium_standard__(api_version="2023.09-beta")
    namespace = df.__dataframe_namespace__()
    col = namespace.col

    mask = col("species") != "setosa"
    df = df.filter(mask)

    updated_columns = []
    for column_name in df.get_column_names():
        if column_name == "species":
            continue
        new_column = col(column_name)
        new_column = (new_column - new_column.mean()) / new_column.std()
        updated_columns.append(new_column)

    df = df.update_columns(*updated_columns)
    return df.dataframe

What would the consequences be?

Every method currently available on Column should be available on Expression. The only difference is that Expressions are always lazy, and can only be evaluated within DataFrame methods which accept them.

Concretely:

  • The following dataframe methods would accept Expressions instead Columns:
    • DataFrame.get_rows_by_mask
    • DataFrame.insert_columns
    • DataFrame.update_columns
    • DataFrame.get_columns_by_name
  • The following would become top-level functions which return Expressions, rather than DataFrame functions which return Columns:
    • DataFrame.all_rowwise -> namespace.all_rowwise
    • DataFrame.any_rowwise -> namespace.any_rowwise
    • DataFrame.sorted_indices -> namespace.sorted_indices
    • DataFrame.unique_indices -> namespace.unique_indices
  • The following would be removed:
    • namespace.column_from_sequence
    • namespace.column_from_1d_array
    • namespace.dataframe_from_dict
    • DataFrame.get_column_by_name
    • Column
  • The following would be added:
    • namespace.col (a function which creates an Expression)
  • Every method currently available on Column should be available on Expression, except:
    • dtype

@@ -63,104 +61,49 @@ def concat(dataframes: Sequence[DataFrame]) -> DataFrame:
"""
...

def column_from_sequence(sequence: Sequence[Any], *, dtype: Any, name: str = '', api_version: str | None = None) -> Column[Any]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we get rid of Column, then we can't initialise a column. So, column_from_sequence, column_from_1d_array, and dataframe_from_dict would need to go, and the only initialiser left would be dataframe_from_2d_array

maybe we can consider adding others, but I think this is fine for now (and the only one which scikit-learn would probably need, if they're converting to ndarray and then converting back to dataframe)

@@ -63,104 +61,49 @@ def concat(dataframes: Sequence[DataFrame]) -> DataFrame:
"""
...

def column_from_sequence(sequence: Sequence[Any], *, dtype: Any, name: str = '', api_version: str | None = None) -> Column[Any]:
def any_rowwise(keys: list[str] | None = None, *, skip_nulls: bool = True) -> Expression:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any_rowwise returns an expression, so it no longer makes sense as a DataFrame method

Hence, I'm moving it to a top-level function

example usage:

def my_agnostic_func(df):
    df = df.__dataframe_consortium_standard__()
    namespace = df.__dataframe_namespace__()

    result = df.get_rows_by_mask(namespace.any_rowwise())
    return result.dataframe

@@ -90,25 +90,6 @@ def groupby(self, keys: str | list[str], /) -> GroupBy:
"""
...

def get_column_by_name(self, name: str, /) -> Column[Any]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can go

if somebody need to refer to a particular column (e.g. to use for filtering / creating a new column / updating a new column), they can use namespace.col

@MarcoGorelli MarcoGorelli changed the title Replace column with expression RFC Replace column with expression Aug 31, 2023
@MarcoGorelli MarcoGorelli marked this pull request as ready for review August 31, 2023 14:23
@shwina
Copy link
Contributor

shwina commented Aug 31, 2023

High-level question: wouldn't this affect array interop significantly? How do you take the result of sorted_indices for example to sort a (standard) array? Alternately, if you have an array of bools, how would you apply that as a mask to select rows of a DataFrame?

@kkraus14
Copy link
Collaborator

I won't be able to make the meeting today and will be on parental leave for the next month or so (and don't know how my schedule will align following that month...), so sharing some of my thoughts here.

I'm -1 on this RFC as it currently exists.

I don't think it's a reasonable tradeoff to remove all of the constructors that we currently have to create Columns and DataFrames other than from 2d arrays. It looks like Polars LazyFrame supports being constructed by Series which are Polars' equivalent of Column today. Ibis doesn't actually manage memory itself, but allows constructing it's Tables from Pandas / PyArrow / Polars / etc. equivalents of Column. We should keep this functionality in my opinion.

Is there a reasonable path forward towards supporting something like both a Column and a ColumnReference? You could imagine Column is as it currently exists and is generally an eager object. ColumnReference you could imagine being returned from something like DataFrame.col("a") as opposed to namespace.col("a"). This makes it non-reusable across DataFrames, but binding the expression to the specific DataFrame which would allow for resolving things like dtypes and potentially avoids some of the issues we saw in #231 related to allowing building on top of other ColumnReferences. Then for selections of dataframe columns maybe it could return ColumnReference objects instead of Column objects?

Maybe there's some path forward where we have some kind of ColumnBase class which both Column and ColumnReference can be derived from and then DataFrame.col() can return a ColumnBase instance? It feels a bit clunky to me but I'm really trying to make a path work that is friendly to both lazy and eager implementations without having to overly restrict functionality.

In my opinion, we should identify problematic functionality / APIs and drive consensus on what behavior we wish to support for them before we dive into solutions. Top of head for me is:

  • Column.get_value, Column.any and other reductions, any other function return a scalar: Can functions that return scalars be nicely supported across lazy implementations? If so and assuming they remain lazy, see next point.
  • Scalar.__bool__ and other dunder methods use for Python control flow: Can they be included in the standard which would imply implicit computation?
  • Column.__eq__ and other binary operators: Lazy implementations like Polars-lazy and Ibis don't allow comparing columns from different dataframes. Our current Column class has no concept of what DataFrame they belong to. What behaviors do we wish to allow or disallow here?
  • Column.__len__: I don't think this allows returning a ducktyped Scalar because Python magic? If so, maybe remove supporting this and add an explicit method that returns a ducktyped Scalar and then falls into the Scalar category above.
  • DataFrame.to_array_object and Column.to_array_object: Can these be included in the standard which would likely imply implicit computation?
  • DataFrame.shape: I think this would be a tuple where the number of columns is a normal Python integer, number of rows would potentially be a lazy Scalar? Any concerns here?
  • DataFrame.insert_columns: See Add DataFrame.insert_columns #231, but lazy implementations like Polars LazyFrame and Ibis Tables don't support adding arbitrary columns, where the columns need to have been derived from the respective dataframe. What behaviors would we want to support and not support across both eager and lazy implementations?
  • DataFrame.__eq__ and other binary operators: See Remove cross-dataframe comparisons #242, but lazy implementations like Polars and Ibis don't allow cross dataframe operations unless the two dataframes are derived from the same dataframe. Proposal is to remove DataFrame <--> DataFrame support, and only support DataFrame <--> Scalar. Is this reasonable behavior?

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Aug 31, 2023

How do you take the result of sorted_indices for example to sort a (standard) array?

Here's an example:

df: DataFrame
namespace = df.__dataframe_namespace__()
df = df.select(namespace.sorted_indices(['a', 'b']).rename('indices'))
df = df.collect()  # not yet in the standard
arr = df.to_array_object()
# xp = ... whatever the array api call is to get the namespace
arr = xp.squeeze(arr)

@kkraus14 could you show an example of where you need a Column, and a 1-column Dataframe won't suffice?

@kkraus14
Copy link
Collaborator

@kkraus14 could you show an example of where you need a Column, and a 1-column Dataframe won't suffice?

Things just get funky and non-ergonomic in general. I.E. what does Column.get_value look like without Columns, DataFrame.get_value? Does it throw if it's a DataFrame with more than 1 column? Does DataFrame.sorted_indices return a 1-column DataFrame and then get_rows takes a DataFrame with 1 column?

We lose a ton of typing information, number of columns in DataFrames need to be checked all over the place and you could imagine lazy implementations lazily resolving the number and names of columns, and I have a strong suspicion as we continue to build out the APIs we support that we will run face first into ambiguity between different behaviors of a 1-column DataFrame vs a Column because the former is inherently 2 dimensional while the latter is inherently 1 dimensional.

For someone wanting to generically handling inputs, it also makes things ambiguous between whether a DataFrame with 1 column should be treated as 1d or 2d, i.e. your example above in calling squeeze it makes an assumption that someone knows that it should be treated as logically 1d.

@rgommers
Copy link
Member

rgommers commented Sep 1, 2023

In my opinion, we should identify problematic functionality / APIs and drive consensus on what behavior we wish to support for them before we dive into solutions. Top of head for me is:

This list is very useful. Comments on a few concrete methods that we discussed yesterday:

DataFrame.shape: I think this would be a tuple where the number of columns is a normal Python integer, number of rows would potentially be a lazy Scalar? Any concerns here?

It looks like we shouldn't have .shape as an attribute, but separate row and column lengths/numbers. And make at least the number of rows a method (e.g., num_rows(), count_rows() or similar), so that it can actually stay lazy. The number of rows is in general unknown for lazy implementations, and computing it as a Python integer is expensive and shouldn't be done implicitly. What it looks like we want here is:

  • .num_columns(): method call, can stay lazy; number of columns is typically known also for lazy implementations, but isn't guaranteed to be
  • .num_rows(): method call, can stay lazy; typically unknown in lazy mode
  • .num_columns() > 3: if number of columns is statically known (like in polars-lazy), this is still fine in lazy mode
  • .num_rows() > 4: now this may raise in lazy mode. but that's a lot better than all of .shape being problematic. Once we whittle it down to this, we can figure out what the syntax is to let the user specify with something like .collect() that yes they really need to know this number and are okay'ing a computation to happen.

Column.get_value, Column.any and other reductions, any other function return a scalar: Can functions that return scalars be nicely supported across lazy implementations?

The answer seems to be no for get_value at least.

For .any and other reductions, it looks like they can be supported at the dataframe level, where they'll return a 1-row dataframe I think (see https://github.com/data-apis/dataframe-api-compat/blob/main/dataframe_api_compat/polars_standard/polars_standard.py#L1235-L1237), while at the column level it raises currently (there are no lazy scalars).

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Sep 1, 2023

For .any and other reductions, it looks like they can be supported at the dataframe level, where they'll return a 1-row dataframe I think [...] while at the column level it raises currently (there are no lazy scalars).

They can be supported in expressions, e.g.

def my_func(df):
    df = df.__dataframe_consortium_standard__()
    namespace = df.__dataframe_namespace__()
    col = namespace.col

    result = df.select([col('a') - col('a').mean()])
    return result.dataframe

get_value can also work on the expression level

We'd need to explicitly state what the broadcasting rules are, but we'd only be dealing with scalars and 1D columns, so they should be simple

Does DataFrame.sorted_indices return a 1-column DataFrame and then get_rows takes a DataFrame with 1 column?

namespace.sorted_indices is all you need, e.g.:

def my_func(df):
    df = df.__dataframe_consortium_standard__()
    namespace = df.__dataframe_namespace__()
    result = df.get_rows(namespace.sorted_indices(['b', 'a']))
    return result.dataframe

My implementation's here if you want to try this out: data-apis/dataframe-api-compat#13

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Sep 1, 2023

If I'm reading the temperature of the room right, people are generally:

  • "-1" on completely removing Column and replacing it with Expression
  • open to the idea of having both Column (eager-only) and Expression (eager or lazy), and separating out DataFrame and EagerFrame, so that the lazy and eager APIs are distinct (even though, apart from .lazy and/or .collect, the former should be a subset of the latter). Not yet "+1", but at least open to the idea

If that's the case, I'll flesh out the proposal and implementation, and hopefully we can work something out

If anyone's completely "-1" on the above and unwilling to change idea, please do speak out

I'm off-work most of next week and part of the week after, so that'll limit how much I can get done, but I think I can get something ready by the next call

@MarcoGorelli
Copy link
Contributor Author

closing in favour of #249

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants