API/PERF: Don't reorder categoricals when grouping by an unordered categorical and `sort=False` #48749

mroeschke · 2022-09-23T20:45:55Z

TLDR: When calling df.groupby(key=categocial<order=False>, sort=True, observed=False) the resulting CategoricalIndex will have it's values and categories unordered.

In [1]:     df = DataFrame(
   ...:         [
   ...:             ["(7.5, 10]", 10, 10],
   ...:             ["(7.5, 10]", 8, 20],
   ...:             ["(2.5, 5]", 5, 30],
   ...:             ["(5, 7.5]", 6, 40],
   ...:             ["(2.5, 5]", 4, 50],
   ...:             ["(0, 2.5]", 1, 60],
   ...:             ["(5, 7.5]", 7, 70],
   ...:         ],
   ...:         columns=["range", "foo", "bar"],
   ...:     )

In [2]: col = "range"

In [3]: df["range"] = Categorical(df["range"], ordered=False)

In [4]: df.groupby(col, sort=True, observed=False).first().index
Out[4]: CategoricalIndex(['(0, 2.5]', '(2.5, 5]', '(5, 7.5]', '(7.5, 10]'], categories=['(0, 2.5]', '(2.5, 5]', '(5, 7.5]', '(7.5, 10]'], ordered=False, dtype='category', name='range')

In [5]: df.groupby(col, sort=False, observed=False).first().index
Out[5]: CategoricalIndex(['(7.5, 10]', '(2.5, 5]', '(5, 7.5]', '(0, 2.5]'], categories=['(7.5, 10]', '(2.5, 5]', '(5, 7.5]', '(0, 2.5]'], ordered=False, dtype='category', name='range')

It's reasonable that the values are not sorted, but a lot of extra work can be spent un-ordering the categories in:

pandas/pandas/core/groupby/categorical.py

Lines 77 to 92 in 44a4f16

    
           # sort=False should order groups in as-encountered order (GH-8868) 
        
           cat = c.unique() 
        
           # See GH-38140 for block below 
        
           # exclude nan from indexer for categories 
        
           take_codes = cat.codes[cat.codes != -1] 
        
           if cat.ordered: 
        
               take_codes = np.sort(take_codes) 
        
           cat = cat.set_categories(cat.categories.take(take_codes)) 
        
           # But for groupby to work, all categories should be present, 
        
           # including those missing from the data (GH-13179), which .unique() 
        
           # above dropped 
        
           cat = cat.add_categories(c.categories[~c.categories.isin(cat.categories)]) 
        
           return c.reorder_categories(cat.categories), None

May have been an outcome of fixing #8868, but if grouping and sort=False the values can be achieved without reordering the categories, there would probably be a nice performance benefit.

The text was updated successfully, but these errors were encountered:

AlexKirko · 2022-09-25T11:44:48Z

Do we have an idea what kind of extra time cost the sorting imposes for something sensible, like 10 000 rows and a 100 categories?

mroeschke · 2022-09-26T17:18:46Z

With the example in dask/dask#9486 (comment) (2592000 rows, 2481967 groups), the perf difference was ~2x slower for integer categories and ~10x slower for string categories with sort=False

rhshadrach · 2022-10-16T14:09:15Z

I'm thinking we shouldn't reorder the categories at all - the categories are part of the dtype, not the result. We shouldn't be changing the dtype on the user. We're also not consistent - see examples in #49129.

rhshadrach · 2022-10-16T16:35:41Z

but if grouping and sort=False the values can be achieved without reordering the categories, there would probably be a nice performance benefit.

Looking into this, it is complicated by the fact that the _libs groupby code returns the codes (called labels there) in sorted order. So if the input has categorical codes [2, 1, 2, 0], the result returned from e.g. _libs.groupby.group_sum is in the order of [0, 1, 2]. We then slap grouper.result_index on it, which has the order [2, 1, 0]. This gives incorrect results (namely groups 0 and 2 are switched).

I may be missing something, but I think since the groupby libs assumes codes are in the order of the desired output, we'll have to recode categorical codes in groupby.

That said, we can fix the result to not reorder the categories by reordering the categories of Grouping.group_index.

mroeschke · 2022-10-17T17:28:35Z

So if the input has categorical codes [2, 1, 2, 0], the result returned from e.g. _libs.groupby.group_sum is in the order of [0, 1, 2].

Having not looked at the groupby code in a while and given your insight, I was wondering if internally the result could be returned in the order of [2, 1, 0] if sort=False. Generally I would hope that the values, grouping labels and resulting index could be processed to respect sort=True/False before going through the _libs groupby code?

rhshadrach · 2022-10-18T04:14:23Z

Generally I would hope that the values, grouping labels and resulting index could be processed to respect sort=True/False before going through the _libs groupby code?

I believe that's what is going on with the recoding for categoricals. Instead of using factorize, groupby uses the categorical codes. But for this, one must recode [2, 1, 2, 0] as [0, 1, 0, 2] when sort=False to respect the order of occurrences. Now I think you have to reorder the categories to respect the new codes - e.g. code 0 refers to the first category.

rhshadrach · 2022-10-28T17:19:16Z

@mroeschke - I am wondering if we should be sorting values when ordered=True with .groupby(..., sort=False). I'm thinking that categorical is the only dtype that still sorts values in this situation, which is a bit odd. I'd suggest having categorical respect sort=False here.

mroeschke · 2022-10-28T18:20:45Z

I'd suggest having categorical respect sort=False here.

I would support this behavior. xref #42482

IIUC since ordered refers to the categories having an order, I think respecting sort=False for the values should be okay

rhshadrach · 2023-11-18T00:55:34Z

This appears to be fixed on main

CategoricalIndex(['(0, 2.5]', '(2.5, 5]', '(5, 7.5]', '(7.5, 10]'], categories=['(0, 2.5]', '(2.5, 5]', '(5, 7.5]', '(7.5, 10]'], ordered=False, dtype='category', name='range')
CategoricalIndex(['(7.5, 10]', '(2.5, 5]', '(5, 7.5]', '(0, 2.5]'], categories=['(0, 2.5]', '(2.5, 5]', '(5, 7.5]', '(7.5, 10]'], ordered=False, dtype='category', name='range')

Not sure if it needs tests.

ian-r-rose mentioned this issue Sep 23, 2022

Groupby result disagrees with Pandas for categorical keys dask/dask#9515

Open

mroeschke mentioned this issue Oct 6, 2022

PERF: df.groupby(categorical, sort=False) #48976

Merged

3 tasks

rhshadrach mentioned this issue Oct 16, 2022

BUG: groupby.nth reorders categories #49129

Closed

rhshadrach mentioned this issue Oct 16, 2022

BUG: groupby reorders categorical categories #49131

Merged

4 tasks

rhshadrach mentioned this issue Oct 28, 2022

BUG: groupby with CategoricalIndex doesn't include unobserved categories #49373

Merged

6 tasks

rhshadrach mentioned this issue Nov 12, 2023

REF: Compute complete result_index upfront in groupby #55738

Merged

11 tasks

rhshadrach removed the Needs Discussion Requires discussion from core team before further action label Nov 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API/PERF: Don't reorder categoricals when grouping by an unordered categorical and `sort=False` #48749

API/PERF: Don't reorder categoricals when grouping by an unordered categorical and `sort=False` #48749

mroeschke commented Sep 23, 2022

AlexKirko commented Sep 25, 2022

mroeschke commented Sep 26, 2022

rhshadrach commented Oct 16, 2022

rhshadrach commented Oct 16, 2022 •

edited

Loading

mroeschke commented Oct 17, 2022

rhshadrach commented Oct 18, 2022

rhshadrach commented Oct 28, 2022

mroeschke commented Oct 28, 2022

rhshadrach commented Nov 18, 2023 •

edited

Loading

API/PERF: Don't reorder categoricals when grouping by an unordered categorical and sort=False #48749

API/PERF: Don't reorder categoricals when grouping by an unordered categorical and sort=False #48749

Comments

mroeschke commented Sep 23, 2022

AlexKirko commented Sep 25, 2022

mroeschke commented Sep 26, 2022

rhshadrach commented Oct 16, 2022

rhshadrach commented Oct 16, 2022 • edited Loading

mroeschke commented Oct 17, 2022

rhshadrach commented Oct 18, 2022

rhshadrach commented Oct 28, 2022

mroeschke commented Oct 28, 2022

rhshadrach commented Nov 18, 2023 • edited Loading

API/PERF: Don't reorder categoricals when grouping by an unordered categorical and `sort=False` #48749

API/PERF: Don't reorder categoricals when grouping by an unordered categorical and `sort=False` #48749

rhshadrach commented Oct 16, 2022 •

edited

Loading

rhshadrach commented Nov 18, 2023 •

edited

Loading