PERF: df.groupby(categorical, sort=False) #48976

mroeschke · 2022-10-06T18:34:36Z

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

In [1]: df = pd.DataFrame({"key": pd.Categorical(range(1_000_000 - 1, -1, -1), range(1_000_000))})

In [2]: %timeit df.groupby("key", sort=False)  # PR
66.7 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [9]: %timeit df.groupby("key", sort=False)  # main
586 ms ± 4.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

rhshadrach

lgtm, nice speedup!

rhshadrach · 2022-10-07T02:49:12Z

pandas/core/groupby/categorical.py

+    # TODO: Probably why dropa=False doesn't work GH 36327
+    unique_notnan_codes = unique1d(c.codes[c.codes != -1])


I'm working on a branch that fixes dropna=False for categorical (currently writing tests, but at least reducers and transformers seem to be working), this section of code is unchanged in that. My thinking here is that this is determining the categories and categories cannot include nan, so dropping -1 is okay even for dropna=False.

Ah okay thanks for the insights. I'll remove my comment then referencing dropna=False.

I just found it odd that this generates self.grouping_vector and it specifically drops -1 codes unlike sort=True

rhshadrach · 2022-10-21T02:32:08Z

pandas/core/groupby/categorical.py

-
-    return c.reorder_categories(cat.categories), None
+    # xref GH:46909: Re-ordering codes faster than using (set|add|reorder)_categories
+    all_codes = np.arange(c.categories.nunique(), dtype=np.int8)


@mroeschke - why the specification of int8 here? This causes overflow when there are more than 256 categories. Found this via the failing asv in #49131 - if it's okay, going to remove the specification of dtype there.

I think I was trying to match the output of Categorical.codes here which was int8, but here specifically I think ensuring the dtype is int should be sufficient.

Feel free to change it in #49131

* refactor using codes * PREF: df.groupby(categorical, sort=False) * Add pr number * remove comment

mroeschke added 2 commits October 6, 2022 08:54

refactor using codes

eb58451

PREF: df.groupby(categorical, sort=False)

c8da563

mroeschke added Groupby Performance Memory or execution speed performance Categorical Categorical Data Type labels Oct 6, 2022

Add pr number

a08f936

rhshadrach approved these changes Oct 7, 2022

View reviewed changes

mroeschke added 2 commits October 7, 2022 10:48

Merge remote-tracking branch 'upstream/main' into perf/groupby/cat_codes

88c11eb

remove comment

330e3c3

mroeschke added this to the 1.6 milestone Oct 7, 2022

mroeschke merged commit 77d4c3e into pandas-dev:main Oct 7, 2022

mroeschke deleted the perf/groupby/cat_codes branch October 7, 2022 21:05

mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022

rhshadrach reviewed Oct 21, 2022

View reviewed changes

noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022

PERF: df.groupby(categorical, sort=False) (pandas-dev#48976)

bf3b613

* refactor using codes * PREF: df.groupby(categorical, sort=False) * Add pr number * remove comment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: df.groupby(categorical, sort=False) #48976

PERF: df.groupby(categorical, sort=False) #48976

mroeschke commented Oct 6, 2022 •

edited

Loading

rhshadrach left a comment

rhshadrach Oct 7, 2022

mroeschke Oct 7, 2022 •

edited

Loading

rhshadrach Oct 21, 2022

mroeschke Oct 21, 2022

mroeschke Oct 21, 2022

		# TODO: Probably why dropa=False doesn't work GH 36327
		unique_notnan_codes = unique1d(c.codes[c.codes != -1])

PERF: df.groupby(categorical, sort=False) #48976

PERF: df.groupby(categorical, sort=False) #48976

Conversation

mroeschke commented Oct 6, 2022 • edited Loading

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach Oct 7, 2022

Choose a reason for hiding this comment

mroeschke Oct 7, 2022 • edited Loading

Choose a reason for hiding this comment

rhshadrach Oct 21, 2022

Choose a reason for hiding this comment

mroeschke Oct 21, 2022

Choose a reason for hiding this comment

mroeschke Oct 21, 2022

Choose a reason for hiding this comment

mroeschke commented Oct 6, 2022 •

edited

Loading

mroeschke Oct 7, 2022 •

edited

Loading