-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API/PERF: Don't reorder categoricals when grouping by an unordered categorical and sort=False
#48749
Comments
Do we have an idea what kind of extra time cost the sorting imposes for something sensible, like 10 000 rows and a 100 categories? |
With the example in dask/dask#9486 (comment) (2592000 rows, 2481967 groups), the perf difference was ~2x slower for integer categories and ~10x slower for string categories with |
I'm thinking we shouldn't reorder the categories at all - the categories are part of the dtype, not the result. We shouldn't be changing the dtype on the user. We're also not consistent - see examples in #49129. |
Looking into this, it is complicated by the fact that the _libs groupby code returns the I may be missing something, but I think since the groupby libs assumes codes are in the order of the desired output, we'll have to recode categorical codes in groupby. That said, we can fix the result to not reorder the categories by reordering the categories of Grouping.group_index. |
Having not looked at the groupby code in a while and given your insight, I was wondering if internally the result could be returned in the order of |
I believe that's what is going on with the recoding for categoricals. Instead of using factorize, groupby uses the categorical codes. But for this, one must recode [2, 1, 2, 0] as [0, 1, 0, 2] when |
@mroeschke - I am wondering if we should be sorting values when |
I would support this behavior. xref #42482 IIUC since |
This appears to be fixed on main
Not sure if it needs tests. |
xref: dask/dask#9486 (comment)
TLDR: When calling
df.groupby(key=categocial<order=False>, sort=True, observed=False)
the resultingCategoricalIndex
will have it's values and categories unordered.It's reasonable that the values are not sorted, but a lot of extra work can be spent un-ordering the categories in:
pandas/pandas/core/groupby/categorical.py
Lines 77 to 92 in 44a4f16
May have been an outcome of fixing #8868, but if grouping and
sort=False
the values can be achieved without reordering the categories, there would probably be a nice performance benefit.The text was updated successfully, but these errors were encountered: