Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/PERF: Don't reorder categoricals when grouping by an unordered categorical and sort=False #48749

Open
mroeschke opened this issue Sep 23, 2022 · 9 comments
Labels
Categorical Categorical Data Type Groupby Performance Memory or execution speed performance

Comments

@mroeschke
Copy link
Member

xref: dask/dask#9486 (comment)

TLDR: When calling df.groupby(key=categocial<order=False>, sort=True, observed=False) the resulting CategoricalIndex will have it's values and categories unordered.

In [1]:     df = DataFrame(
   ...:         [
   ...:             ["(7.5, 10]", 10, 10],
   ...:             ["(7.5, 10]", 8, 20],
   ...:             ["(2.5, 5]", 5, 30],
   ...:             ["(5, 7.5]", 6, 40],
   ...:             ["(2.5, 5]", 4, 50],
   ...:             ["(0, 2.5]", 1, 60],
   ...:             ["(5, 7.5]", 7, 70],
   ...:         ],
   ...:         columns=["range", "foo", "bar"],
   ...:     )

In [2]: col = "range"

In [3]: df["range"] = Categorical(df["range"], ordered=False)

In [4]: df.groupby(col, sort=True, observed=False).first().index
Out[4]: CategoricalIndex(['(0, 2.5]', '(2.5, 5]', '(5, 7.5]', '(7.5, 10]'], categories=['(0, 2.5]', '(2.5, 5]', '(5, 7.5]', '(7.5, 10]'], ordered=False, dtype='category', name='range')

In [5]: df.groupby(col, sort=False, observed=False).first().index
Out[5]: CategoricalIndex(['(7.5, 10]', '(2.5, 5]', '(5, 7.5]', '(0, 2.5]'], categories=['(7.5, 10]', '(2.5, 5]', '(5, 7.5]', '(0, 2.5]'], ordered=False, dtype='category', name='range')

It's reasonable that the values are not sorted, but a lot of extra work can be spent un-ordering the categories in:

# sort=False should order groups in as-encountered order (GH-8868)
cat = c.unique()
# See GH-38140 for block below
# exclude nan from indexer for categories
take_codes = cat.codes[cat.codes != -1]
if cat.ordered:
take_codes = np.sort(take_codes)
cat = cat.set_categories(cat.categories.take(take_codes))
# But for groupby to work, all categories should be present,
# including those missing from the data (GH-13179), which .unique()
# above dropped
cat = cat.add_categories(c.categories[~c.categories.isin(cat.categories)])
return c.reorder_categories(cat.categories), None

May have been an outcome of fixing #8868, but if grouping and sort=False the values can be achieved without reordering the categories, there would probably be a nice performance benefit.

@mroeschke mroeschke added Groupby Performance Memory or execution speed performance Categorical Categorical Data Type Needs Info Clarification about behavior needed to assess issue Needs Discussion Requires discussion from core team before further action and removed Needs Info Clarification about behavior needed to assess issue labels Sep 23, 2022
@AlexKirko
Copy link
Member

Do we have an idea what kind of extra time cost the sorting imposes for something sensible, like 10 000 rows and a 100 categories?

@mroeschke
Copy link
Member Author

With the example in dask/dask#9486 (comment) (2592000 rows, 2481967 groups), the perf difference was ~2x slower for integer categories and ~10x slower for string categories with sort=False

@rhshadrach
Copy link
Member

I'm thinking we shouldn't reorder the categories at all - the categories are part of the dtype, not the result. We shouldn't be changing the dtype on the user. We're also not consistent - see examples in #49129.

@rhshadrach
Copy link
Member

rhshadrach commented Oct 16, 2022

but if grouping and sort=False the values can be achieved without reordering the categories, there would probably be a nice performance benefit.

Looking into this, it is complicated by the fact that the _libs groupby code returns the codes (called labels there) in sorted order. So if the input has categorical codes [2, 1, 2, 0], the result returned from e.g. _libs.groupby.group_sum is in the order of [0, 1, 2]. We then slap grouper.result_index on it, which has the order [2, 1, 0]. This gives incorrect results (namely groups 0 and 2 are switched).

I may be missing something, but I think since the groupby libs assumes codes are in the order of the desired output, we'll have to recode categorical codes in groupby.

That said, we can fix the result to not reorder the categories by reordering the categories of Grouping.group_index.

@mroeschke
Copy link
Member Author

So if the input has categorical codes [2, 1, 2, 0], the result returned from e.g. _libs.groupby.group_sum is in the order of [0, 1, 2].

Having not looked at the groupby code in a while and given your insight, I was wondering if internally the result could be returned in the order of [2, 1, 0] if sort=False. Generally I would hope that the values, grouping labels and resulting index could be processed to respect sort=True/False before going through the _libs groupby code?

@rhshadrach
Copy link
Member

Generally I would hope that the values, grouping labels and resulting index could be processed to respect sort=True/False before going through the _libs groupby code?

I believe that's what is going on with the recoding for categoricals. Instead of using factorize, groupby uses the categorical codes. But for this, one must recode [2, 1, 2, 0] as [0, 1, 0, 2] when sort=False to respect the order of occurrences. Now I think you have to reorder the categories to respect the new codes - e.g. code 0 refers to the first category.

@rhshadrach
Copy link
Member

@mroeschke - I am wondering if we should be sorting values when ordered=True with .groupby(..., sort=False). I'm thinking that categorical is the only dtype that still sorts values in this situation, which is a bit odd. I'd suggest having categorical respect sort=False here.

@mroeschke
Copy link
Member Author

I'd suggest having categorical respect sort=False here.

I would support this behavior. xref #42482

IIUC since ordered refers to the categories having an order, I think respecting sort=False for the values should be okay

@rhshadrach
Copy link
Member

rhshadrach commented Nov 18, 2023

This appears to be fixed on main

CategoricalIndex(['(0, 2.5]', '(2.5, 5]', '(5, 7.5]', '(7.5, 10]'], categories=['(0, 2.5]', '(2.5, 5]', '(5, 7.5]', '(7.5, 10]'], ordered=False, dtype='category', name='range')
CategoricalIndex(['(7.5, 10]', '(2.5, 5]', '(5, 7.5]', '(0, 2.5]'], categories=['(0, 2.5]', '(2.5, 5]', '(5, 7.5]', '(7.5, 10]'], ordered=False, dtype='category', name='range')

Not sure if it needs tests.

@rhshadrach rhshadrach removed the Needs Discussion Requires discussion from core team before further action label Nov 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Groupby Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

3 participants