Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: df.groupby(categorical, sort=False) #48976

Merged
merged 5 commits into from
Oct 7, 2022

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Oct 6, 2022

xref: #48749

In [1]: df = pd.DataFrame({"key": pd.Categorical(range(1_000_000 - 1, -1, -1), range(1_000_000))})

In [2]: %timeit df.groupby("key", sort=False)  # PR
66.7 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [9]: %timeit df.groupby("key", sort=False)  # main
586 ms ± 4.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@mroeschke mroeschke added Groupby Performance Memory or execution speed performance Categorical Categorical Data Type labels Oct 6, 2022
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, nice speedup!

Comment on lines 82 to 83
# TODO: Probably why dropa=False doesn't work GH 36327
unique_notnan_codes = unique1d(c.codes[c.codes != -1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on a branch that fixes dropna=False for categorical (currently writing tests, but at least reducers and transformers seem to be working), this section of code is unchanged in that. My thinking here is that this is determining the categories and categories cannot include nan, so dropping -1 is okay even for dropna=False.

Copy link
Member Author

@mroeschke mroeschke Oct 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay thanks for the insights. I'll remove my comment then referencing dropna=False.

I just found it odd that this generates self.grouping_vector and it specifically drops -1 codes unlike sort=True

@mroeschke mroeschke added this to the 1.6 milestone Oct 7, 2022
@mroeschke mroeschke merged commit 77d4c3e into pandas-dev:main Oct 7, 2022
@mroeschke mroeschke deleted the perf/groupby/cat_codes branch October 7, 2022 21:05
@mroeschke mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022

return c.reorder_categories(cat.categories), None
# xref GH:46909: Re-ordering codes faster than using (set|add|reorder)_categories
all_codes = np.arange(c.categories.nunique(), dtype=np.int8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke - why the specification of int8 here? This causes overflow when there are more than 256 categories. Found this via the failing asv in #49131 - if it's okay, going to remove the specification of dtype there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was trying to match the output of Categorical.codes here which was int8, but here specifically I think ensuring the dtype is int should be sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to change it in #49131

noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
* refactor using codes

* PREF: df.groupby(categorical, sort=False)

* Add pr number

* remove comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants