Column reductions should return 1-row Column #297

MarcoGorelli · 2023-10-24T08:55:26Z

Let's talk column reductions.

I see two uses cases for them:

a user wants the exact value of a scalar, right now:
```
df: DataFrame
df.col('a').mean()
```
a user just needs to use the scalar as part of another operation, so it can stay lazy if necessary:
```
df: DataFrame
df.assign((df.col('a') - df.col('a').mean()).rename('a_centered'))
```

The Standard currently defines the return value of Column.mean to be Scalar. Implementations are supposed to figure out which of the two cases above the user wants.

I have two problems with this:

I really don't like anything related to implicit materialisation (so long as we're defining a top-level Python API)
we have an inconsistency with the DataFrame case:
- DataFrame.mean returns a 1-row DataFrame
- Column.mean returns a Scalar

Proposal

Column reductions return 1-row Columns (just like how DataFrame reductions return 1-row DataFrames).

Broadcasting rules: a binary operation between a n-row Column and a 1-row Column, the 1-row Column is broadcast to be of length-n. So column - column.mean() is well-defined, and everything can stay lazy if necessary.

If someone really need the value of a reduction now, they can call .get_value(0). And behaviour of scalars may vary based on implementations, but I think that's fine.

At least, for the (much more common) case when reductions are used as part of other operations, the operations can stay completely within the DataFrame API now, the rules become predictable, and everything is well-defined

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2023-10-24T11:23:47Z

@kkraus14 @shwina @rgommers @ cbourjau I'd really like to discuss this one on Thursday - gentle ping to have a read through before then if you have time, thanks 🙏

cbourjau · 2023-10-24T12:24:09Z

I totally agree that this is a problematic point in the API as it stands now. Thanks for the suggestion!

I am a bit concerned about making Column objects broadcastable, though. The behavior would be rather surprising coming from Pandas (even though I realize that the idea of the standard is not to copy what Pandas is doing):

>>> import pandas as pd
>>> s = pd.Series([1, 2])
>>> s + pd.Series([1])
0    2.0
1    NaN
dtype: float64

I'd rather raise an exception in such a case.

I think this may be an instance of a more general question that I am wondering about: To what extent should the data frame standard piggy-back on the array-api? If the dataframe-api were to fully embrace the array-api, then the answer to the problem at hand would be clear: Reduction functions on Column such as mean and std return a rank-0 standard array (of an array-api-implementation chosen by the dataframe implementation). This would mean that the Column's dtype must have a "physical" dtype which is representable by an array-api dtype (maybe a good general restriction?).

Alternatively, one may consider that reduction functions are always meant to be called on an array object (i.e. column.to_array_object()). However I think that would pose two problems:

I could imagine that this would cause issues for an SQL-backed implementation?
It would be an awkward user experience since a standard array object currently has no member function mean. Instead, the free function of the array namespace would need to be called.

MarcoGorelli · 2023-10-24T12:49:02Z

Thanks for looking into this

The behavior would be rather surprising coming from Pandas

It's surprising only if you expect index alignment, which the consortium is explicitly forbidding (it's not even optional, it's strictly forbidden 😄 )

So if you think of columns as array-like, without an index, then this broadcasting behaviour is exactly what you'd expect:

In [5]: s = np.array([1, 2])

In [6]: s + np.array([1])
Out[6]: array([2, 3])

kkraus14 · 2023-10-24T17:24:05Z

I think this may be an instance of a more general question that I am wondering about: To what extent should the data frame standard piggy-back on the array-api? If the dataframe-api were to fully embrace the array-api, then the answer to the problem at hand would be clear: Reduction functions on Column such as mean and std return a rank-0 standard array (of an array-api-implementation chosen by the dataframe implementation). This would mean that the Column's dtype must have a "physical" dtype which is representable by an array-api dtype (maybe a good general restriction?).

This doesn't work for String, date, datetime, etc. dtype columns that the array API doesn't support or situations where you need to return NULL.

I believe the decision as of now was that Scalars were to follow Python scalars via ducktyping to allow preventing implicit materialization. I.E. in your example above of df.col('a').mean() vs df.assign((df.col('a') - df.col('a').mean()).rename('a_centered')). Assuming df.col('a').mean() returns a Scalar, there's nothing forcing materialization in the second case, and in the first case what forces materialization is __repr__ which is where I think we should focus discussion.

I'm -1 on broadcasting columns because people will run into issues where some data ends up being 1 row unexpectedly and things end up being broadcast instead of failing due to a shape mismatch. Scalars allow us to handle that behavior more nicely, we just need to align on what type of behavior we want to have for things like __repr__, __bool__, __int__, etc.

MarcoGorelli · 2023-10-24T17:35:07Z

thanks for taking a look

we just need to align on what type of behavior we want to have for things like repr, bool, int, etc.

ok let's pivot to that (or rather, punt on it), I'll just raise for now

MarcoGorelli added the API design label Oct 24, 2023

MarcoGorelli mentioned this issue Oct 24, 2023

Add example of a sklearn like pipeline #294

Merged

MarcoGorelli closed this as completed Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column reductions should return 1-row Column #297

Column reductions should return 1-row Column #297

MarcoGorelli commented Oct 24, 2023 •

edited

Loading

MarcoGorelli commented Oct 24, 2023

cbourjau commented Oct 24, 2023

MarcoGorelli commented Oct 24, 2023

kkraus14 commented Oct 24, 2023

MarcoGorelli commented Oct 24, 2023

Column reductions should return 1-row Column #297

Column reductions should return 1-row Column #297

Comments

MarcoGorelli commented Oct 24, 2023 • edited Loading

Let's talk column reductions.

Proposal

MarcoGorelli commented Oct 24, 2023

cbourjau commented Oct 24, 2023

MarcoGorelli commented Oct 24, 2023

kkraus14 commented Oct 24, 2023

MarcoGorelli commented Oct 24, 2023

MarcoGorelli commented Oct 24, 2023 •

edited

Loading