groupby with categorical type returns all combinations #17594

bear24rw · 2017-09-19T19:08:32Z

Code Sample, a copy-pastable example if possible

import pandas as pd                                                                                                                                                                                                                    
df = pd.DataFrame({'a': ['x','x','y'], 'b': [0,1,0], 'c': [7,8,9]})                                                                                                                                                                    
print(df.groupby(['a','b']).mean().reset_index())                                                                                                                                                                                      
df['a'] = df['a'].astype('category')                                                                                                                                                                                                   
print(df.groupby(['a','b']).mean().reset_index())

Returns two different results:

   a  b  c
0  x  0  7
1  x  1  8
2  y  0  9

   a  b    c
0  x  0  7.0
1  x  1  8.0
2  y  0  9.0
3  y  1  NaN

Problem description

Performing a groupby with a categorical type returns all combination of the groupby columns. This is a problem in my actual application as it results in a massive dataframe that is mostly filled with nans. I would also prefer not to move off of category dtype since it provides necessary memory savings.

Expected Output

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.0-26-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 33.1.1
Cython: None
numpy: 1.13.1
scipy: 0.19.0
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: 0.9.6
lxml: None
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-09-20T10:12:58Z

this is by-definition on what groupby of categorical is/does. Categorical is serving 2 masters acting as a 'real' categorical and providing some memory savings.

bear24rw · 2017-09-23T02:41:17Z

What is an example usecase for a groupby aggregation returning all combinations with rows full of nans? It seems non intuitive that the dtype of a column would change the groupby behavior. It took me a while to figure out why I was getting nan rows.

jreback · 2017-09-23T14:22:35Z

It seems non intuitive that the dtype of a column would change the groupby behavior.

its not, groupby gives you observed values, here the observed values are by definition the categories. if you don't want this, then just change your grouper.

alejcas · 2018-02-28T17:35:32Z

But... When working with big datasets we use categoricals to use less memory (without categoricals the data load just crashes with a memory error)...
I am just using them for the memory issue.
But now I have figured out I can not groupby as expected.

Can't use object type because it will blast my pc memory.

What can I do? any ideas?

alejcas · 2018-03-08T15:54:17Z

@jreback how can I change this behaviour changing the Grouper? I don't see how in the docs:
pandas.Grouper

jreback · 2018-03-09T00:26:27Z

In [11]: df.groupby([df.a.astype(object), 'b']).mean()
Out[11]: 
     c
a b   
x 0  7
  1  8
y 0  9

alejcas · 2018-03-12T09:52:43Z

@jreback this defeats the purpose of using categoricals in my use case... Again I can't convert to object because I hit memory error.

Using categoricals was the perfect answer to avoid memory errors... only if groupby worked as expected.
Grouping with 5 categorical columns results in a exploding dataframe, as every possible combination is done even when there is no data at all for 98% of the combinations!

I understand that how categoricals works on groupby operations is correct, but there must be a solution to our problem.
That is: using categoricals ONLY to avoid memory errors, and having the same behaviour as object like columns.

I appreciate your answers! Thanks!

jreback · 2018-03-12T10:25:55Z

@janscas this was never the purpose of categoricals. sure you can use them to save memory and folks do.

I don't know of an elegant solution here, you can do something like

I wouldn't object to having a function do this (maybe via a keyword on groupby)

In [13]: result = df.groupby([df.a.cat.codes, 'b']).mean()

In [14]: result.index = result.index.set_levels(df.a.cat.categories, level=0)

In [15]: result
Out[15]: 
     c
  b   
x 0  7
  1  8
y 0  9

bear24rw · 2018-03-12T18:14:42Z

@janscas this was never the purpose of categoricals. sure you can use them to save memory and folks do.

The first bullet of the categorical documentation advertises its use for memory saving:

The categorical data type is useful in the following cases:

A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.

So if it's not primarily for memory savings, what is its primary usecase?

Additionally, no where can I find an example of why you would want categoricals to expand after a groupby. Is it a side effect of some other desirable behavior?

alejcas · 2018-03-13T10:49:26Z

I wouldn't object to having a function do this (maybe via a keyword on groupby)

@jreback I would greatly appreciate this. I think much more people out there needs this.

Thanks for the solution provided, I'm going to test it

batterseapower · 2018-03-19T11:07:37Z

I keep getting bitten by this special case. It's just really surprising that groupby works differently for categoricals specifically. Just like @janscas I'm using categoricals for memory savings as advised by the docs, but I periodically try to groupby a categorical column and blow up my memory because pandas wants to generate a result filled with tons of NaNs.

It's a bit weird that pandas tries to "guess" what the observed values are from the categories rather than just using the data it was given. For example, let's say I was studying diet, I might do something like this:

df = pd.DataFrame.from_items([
    ('food',   pd.Categorical(['BURGER', 'CHIPS', 'SALAD', 'RICE'] * 2)),
    ('day',    ([0] * 4) + ([1] * 4)),
    ('weight', np.arange(8)),
])

df2 = df[df['food'].isin({'BURGER', 'CHIPS'})]
print(df2.groupby(['day', 'food']).mean())

When I work with df2 I'm studying junk food specifically, but Pandas insists on telling me about the healthy foods I explicitly filtered out.

What's even weirder is that the behaviour changes depending on how many column your group-by result has! With pandas 0.22.0 the following alternative groupby operation does not include the extra NaN rows for SALAD and RICE:

print(df2.groupby(['day', 'food'])['weight'].mean())

This seems like a bug?

mattharrison · 2019-10-23T02:27:26Z

Not meaning to pile on but this behavior was completely unexpected.

Where is the defined behavior of category, ie:

this is by-definition on what groupby of categorical is/does. Categorical is serving 2 masters acting as a 'real' categorical and providing some memory savings.

Perhaps an option to enable|disable cartesian products could be added to the .groupby method

jreback · 2019-10-23T02:39:16Z

this is exactly what observed does

mattharrison · 2019-10-23T02:43:18Z

Oh, the observed parameter in .groupby! Awesome, thanks!

I would prefer it to default to True and also work with any grouper.

JacobHayes · 2020-01-17T04:22:43Z

Would a PR changing observed=True as the default be considered for/before pandas 1.0.0? That seems to more closely match expectations (and most intentions). I'd be happy to submit that.

jreback · 2020-01-17T04:31:44Z

1.0.0rc1 has been out for a few weeks so
no it won’t be considered; it’s possible to add a depreciation that it would change in the future for 1.1

JacobHayes · 2020-01-17T04:36:54Z

Thanks @jreback, that makes sense. I'll take a look into a deprecation.

jreback closed this as completed Sep 20, 2017

jreback added Categorical Categorical Data Type Groupby labels Sep 20, 2017

jreback added this to the No action milestone Sep 20, 2017

jreback mentioned this issue Sep 22, 2017

Groupby on categoricals includes unused categories #17631

Closed

jreback mentioned this issue Mar 20, 2018

API: groupby with categoricals inconsistent with Series vs DataFrame result #20416

Closed

jreback mentioned this issue Apr 2, 2018

API: categorical grouping will no longer return the cartesian product #20583

Merged

gspracklin mentioned this issue Dec 19, 2019

Update dotfinder.py open2c/cooltools#126

Merged

sergpolly mentioned this issue Dec 29, 2019

dot-caller chroms are floats not ints open2c/cooltools#125

Closed

igorborgest mentioned this issue Jan 21, 2020

pandas.write_parquet creates unnecessary partitions when writing views with categoricals aws/aws-sdk-pandas#112

Closed

jseabold mentioned this issue Aug 28, 2020

Deprecate groupby/pivot observed=False default #35967

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby with categorical type returns all combinations #17594

groupby with categorical type returns all combinations #17594

bear24rw commented Sep 19, 2017

INSTALLED VERSIONS

jreback commented Sep 20, 2017

bear24rw commented Sep 23, 2017

jreback commented Sep 23, 2017

alejcas commented Feb 28, 2018 •

edited

Loading

alejcas commented Mar 8, 2018 •

edited

Loading

jreback commented Mar 9, 2018

alejcas commented Mar 12, 2018 •

edited

Loading

jreback commented Mar 12, 2018

bear24rw commented Mar 12, 2018

alejcas commented Mar 13, 2018

batterseapower commented Mar 19, 2018 •

edited

Loading

mattharrison commented Oct 23, 2019 •

edited

Loading

jreback commented Oct 23, 2019

mattharrison commented Oct 23, 2019 •

edited

Loading

JacobHayes commented Jan 17, 2020

jreback commented Jan 17, 2020

JacobHayes commented Jan 17, 2020

groupby with categorical type returns all combinations #17594

groupby with categorical type returns all combinations #17594

Comments

bear24rw commented Sep 19, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Sep 20, 2017

bear24rw commented Sep 23, 2017

jreback commented Sep 23, 2017

alejcas commented Feb 28, 2018 • edited Loading

alejcas commented Mar 8, 2018 • edited Loading

jreback commented Mar 9, 2018

alejcas commented Mar 12, 2018 • edited Loading

jreback commented Mar 12, 2018

bear24rw commented Mar 12, 2018

alejcas commented Mar 13, 2018

batterseapower commented Mar 19, 2018 • edited Loading

mattharrison commented Oct 23, 2019 • edited Loading

jreback commented Oct 23, 2019

mattharrison commented Oct 23, 2019 • edited Loading

JacobHayes commented Jan 17, 2020

jreback commented Jan 17, 2020

JacobHayes commented Jan 17, 2020

Output of `pd.show_versions()`

alejcas commented Feb 28, 2018 •

edited

Loading

alejcas commented Mar 8, 2018 •

edited

Loading

alejcas commented Mar 12, 2018 •

edited

Loading

batterseapower commented Mar 19, 2018 •

edited

Loading

mattharrison commented Oct 23, 2019 •

edited

Loading

mattharrison commented Oct 23, 2019 •

edited

Loading