-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistencies in groupby aggregation with non-numeric types #13416
Comments
hmm, yeah this look suspect. although Note I am only referring to the inbuilt methods |
Normal aggregation also has inconsistencies. It excludes categorical if it has timestamp column, includes otherwise.
Maybe it should be excluded by default even if categorical internal is numeric, and included if |
As for normal aggregation, there's also this: df[['C1']].sum()
Out[10]:
C1 23
dtype: int64
df['C1'].sum()
...
TypeError: Categorical cannot perform the operation sum |
Please also consider the following case: If I call DataFrame.mean(axis=1), all values are NaN and the dtype becomes float64, because _ensure_numeric is called that raises "TypeError: Could not convert A(..) to numeric". |
Every op in OP with non-numeric data now raises due to #46560 |
xref #13992
Some of the issues and inconsistencies I noticed.
(Apologies for a bit lengthy input.)
Settings:
Usually, non-numeric types are skipped in aggregation functions:
Issues:
But if there are no numeric types, an output varies. (I use here subframes
df_xxx
of the original data frame.)Some other issues:
Multiple groupers with a categrical one (it's already addressed in #13204).
apply
transform
What should be the expected output?
Some ideas for aggregation (with
sum()
) when there's no numeric types (1)-(10):.mean()
does)(but (a) should
.mean()
do the same? (b) should groupers be returned whenas_index=False
?)output of
pd.show_versions()
Exactly the same output with:
Just a thought about a possible approach:
A call to
_python_apply_general
inside_python_agg_general
https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L795
seems to trigger some of the issues. (It doesn't exclude grouper columns when
as_index=False
, treats categoricals as their underlying types, and possibly just replicates some actions from earlier part of_python_agg_general
.)The following change affects (and possibly solves, at least partially)
issues (5), (6), (7), (9), (10), (14):
The text was updated successfully, but these errors were encountered: