-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Column reductions should return 1-row Column #297
Comments
I totally agree that this is a problematic point in the API as it stands now. Thanks for the suggestion! I am a bit concerned about making >>> import pandas as pd
>>> s = pd.Series([1, 2])
>>> s + pd.Series([1])
0 2.0
1 NaN
dtype: float64 I'd rather raise an exception in such a case. I think this may be an instance of a more general question that I am wondering about: To what extent should the data frame standard piggy-back on the array-api? If the dataframe-api were to fully embrace the array-api, then the answer to the problem at hand would be clear: Reduction functions on Alternatively, one may consider that reduction functions are always meant to be called on an array object (i.e.
|
Thanks for looking into this
It's surprising only if you expect index alignment, which the consortium is explicitly forbidding (it's not even optional, it's strictly forbidden 😄 ) So if you think of columns as array-like, without an index, then this broadcasting behaviour is exactly what you'd expect: In [5]: s = np.array([1, 2])
In [6]: s + np.array([1])
Out[6]: array([2, 3]) |
This doesn't work for I believe the decision as of now was that Scalars were to follow Python scalars via ducktyping to allow preventing implicit materialization. I.E. in your example above of I'm -1 on broadcasting columns because people will run into issues where some data ends up being 1 row unexpectedly and things end up being broadcast instead of failing due to a shape mismatch. Scalars allow us to handle that behavior more nicely, we just need to align on what type of behavior we want to have for things like |
thanks for taking a look
ok let's pivot to that (or rather, punt on it), I'll just raise for now |
Let's talk column reductions.
I see two uses cases for them:
The Standard currently defines the return value of
Column.mean
to beScalar
. Implementations are supposed to figure out which of the two cases above the user wants.I have two problems with this:
DataFrame.mean
returns a 1-row DataFrameColumn.mean
returns a ScalarProposal
Column reductions return 1-row Columns (just like how DataFrame reductions return 1-row DataFrames).
Broadcasting rules: a binary operation between a n-row Column and a 1-row Column, the 1-row Column is broadcast to be of length-n. So
column - column.mean()
is well-defined, and everything can stay lazy if necessary.If someone really need the value of a reduction now, they can call
.get_value(0)
. And behaviour of scalars may vary based on implementations, but I think that's fine.At least, for the (much more common) case when reductions are used as part of other operations, the operations can stay completely within the DataFrame API now, the rules become predictable, and everything is well-defined
The text was updated successfully, but these errors were encountered: