-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: df.groupby(categorical, sort=False) #48976
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, nice speedup!
pandas/core/groupby/categorical.py
Outdated
# TODO: Probably why dropa=False doesn't work GH 36327 | ||
unique_notnan_codes = unique1d(c.codes[c.codes != -1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working on a branch that fixes dropna=False
for categorical (currently writing tests, but at least reducers and transformers seem to be working), this section of code is unchanged in that. My thinking here is that this is determining the categories and categories cannot include nan, so dropping -1 is okay even for dropna=False
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay thanks for the insights. I'll remove my comment then referencing dropna=False.
I just found it odd that this generates self.grouping_vector
and it specifically drops -1 codes unlike sort=True
|
||
return c.reorder_categories(cat.categories), None | ||
# xref GH:46909: Re-ordering codes faster than using (set|add|reorder)_categories | ||
all_codes = np.arange(c.categories.nunique(), dtype=np.int8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mroeschke - why the specification of int8 here? This causes overflow when there are more than 256 categories. Found this via the failing asv in #49131 - if it's okay, going to remove the specification of dtype there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was trying to match the output of Categorical.codes
here which was int8, but here specifically I think ensuring the dtype
is int
should be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to change it in #49131
* refactor using codes * PREF: df.groupby(categorical, sort=False) * Add pr number * remove comment
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.xref: #48749