-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
groupby with categorical type returns all combinations #17594
Comments
this is by-definition on what groupby of categorical is/does. Categorical is serving 2 masters acting as a 'real' categorical and providing some memory savings. |
What is an example usecase for a groupby aggregation returning all combinations with rows full of nans? It seems non intuitive that the dtype of a column would change the groupby behavior. It took me a while to figure out why I was getting nan rows. |
its not, groupby gives you observed values, here the observed values are by definition the categories. if you don't want this, then just change your grouper. |
But... When working with big datasets we use categoricals to use less memory (without categoricals the data load just crashes with a memory error)... Can't use object type because it will blast my pc memory. What can I do? any ideas? |
@jreback how can I change this behaviour changing the Grouper? I don't see how in the docs: |
|
@jreback this defeats the purpose of using categoricals in my use case... Again I can't convert to object because I hit memory error. Using categoricals was the perfect answer to avoid memory errors... only if groupby worked as expected. I understand that how categoricals works on groupby operations is correct, but there must be a solution to our problem. I appreciate your answers! Thanks! |
@janscas this was never the purpose of categoricals. sure you can use them to save memory and folks do. I don't know of an elegant solution here, you can do something like I wouldn't object to having a function do this (maybe via a keyword on groupby)
|
The first bullet of the categorical documentation advertises its use for memory saving:
So if it's not primarily for memory savings, what is its primary usecase? Additionally, no where can I find an example of why you would want categoricals to expand after a groupby. Is it a side effect of some other desirable behavior? |
@jreback I would greatly appreciate this. I think much more people out there needs this. Thanks for the solution provided, I'm going to test it |
I keep getting bitten by this special case. It's just really surprising that groupby works differently for categoricals specifically. Just like @janscas I'm using categoricals for memory savings as advised by the docs, but I periodically try to groupby a categorical column and blow up my memory because pandas wants to generate a result filled with tons of NaNs. It's a bit weird that pandas tries to "guess" what the observed values are from the categories rather than just using the data it was given. For example, let's say I was studying diet, I might do something like this: df = pd.DataFrame.from_items([
('food', pd.Categorical(['BURGER', 'CHIPS', 'SALAD', 'RICE'] * 2)),
('day', ([0] * 4) + ([1] * 4)),
('weight', np.arange(8)),
])
df2 = df[df['food'].isin({'BURGER', 'CHIPS'})]
print(df2.groupby(['day', 'food']).mean()) When I work with df2 I'm studying junk food specifically, but Pandas insists on telling me about the healthy foods I explicitly filtered out. What's even weirder is that the behaviour changes depending on how many column your group-by result has! With pandas 0.22.0 the following alternative groupby operation does not include the extra NaN rows for SALAD and RICE: print(df2.groupby(['day', 'food'])['weight'].mean()) This seems like a bug? |
Not meaning to pile on but this behavior was completely unexpected. Where is the defined behavior of category, ie:
Perhaps an option to enable|disable cartesian products could be added to the |
this is exactly what observed does |
Oh, the I would prefer it to default to |
Would a PR changing |
1.0.0rc1 has been out for a few weeks so |
Thanks @jreback, that makes sense. I'll take a look into a deprecation. |
Code Sample, a copy-pastable example if possible
Returns two different results:
Problem description
Performing a groupby with a categorical type returns all combination of the groupby columns. This is a problem in my actual application as it results in a massive dataframe that is mostly filled with nans. I would also prefer not to move off of category dtype since it provides necessary memory savings.
Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.0-26-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 33.1.1
Cython: None
numpy: 1.13.1
scipy: 0.19.0
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: 0.9.6
lxml: None
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: