BUG: Groupby drops NA even though dropna=False with categorical dtype #48492

adam-carruthers · 2022-09-10T06:42:27Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

print(pd.show_versions())

df_normal = pd.DataFrame({"x": ["a", "b", "b", np.nan]})

print(df_normal, end="\n\n")

print(df_normal.groupby("x", dropna=False).size(), end="\n\n")

df_categorical = pd.DataFrame({"x": ["a", "b", "b", np.nan]}).astype(
    {"x": pd.CategoricalDtype()}
)

print(df_categorical, end="\n\n")

print(df_categorical.groupby("x", dropna=False).size(), end="\n\n")

Issue Description

Although dropna=False, pandas drops the NA in the categorical column. This can be seen because it has not shown up in the produced series.

If the data type is not categorical, the groupby correctly keeps the NA in the column, as shown by df_normal

Expected Behavior

Pandas does not drop the NA in the categorical column in the groupby when dropna=False
The normal and the categorical DataFrame function the same

Installed Versions

INSTALLED VERSIONS

commit : ca60aab
python : 3.9.13.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19043
machine : AMD64
processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : English_United Kingdom.1252

pandas : 1.4.4
numpy : 1.23.3
pytz : 2022.2.1
dateutil : 2.8.2
setuptools : 58.1.0
pip : 22.0.4
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
None

The text was updated successfully, but these errors were encountered:

erg0dic · 2022-09-11T13:12:27Z

First as a disclaimer, I'm not a regular here but I tried looking into this and I think this is expected behavior but this could be highlighted better maybe as a warning or something.

From what I found, NaN cannot be a valid category type. Also, when you specify restricted categorical types like pd.CategoricalDtype(['a', 'b']), anything else that wasn't covered in the pd.Series automatically becomes a NaN and isn't considered by group operations. Have a look at this issue. And the test files for categorical types. There are example test cases that you might find convincing.

AkshayJain1995 · 2022-09-12T08:35:19Z

Hi @goodyguts , I would like to pick this up. Can you assign it to me?

radekSF · 2022-09-22T10:20:56Z

Is this a duplicate of #36327 ?

adam-carruthers · 2022-09-23T16:28:50Z

Yes it is

CBMollerup · 2022-12-05T09:26:59Z

#36327 has been closed by #49652

rhshadrach · 2022-12-15T01:41:07Z

Thanks @CBMollerup; the output of the last line in the OP is now:

x
a      1
b      2
NaN    1
dtype: int64

which is correct. As identified, this was closed by #49652.

mdruiter · 2023-01-30T10:28:23Z

The bug is back in 1.5.1. I didn't find an open issue to fix this is a future release. What's the plan?

radekSF · 2023-01-30T10:52:21Z

@mdruiter Perhaps @rhshadrach can confirm if that's the case but #49652 is in milestone 2.0 so the bug will not be fixed during 1.5 - however, the solution has been implemented and will be released at some point in the future.

rhshadrach · 2023-01-31T04:56:38Z

Yes - that's correct. I believe we're aiming to have a release candidate in early February and then release 2.0 several weeks later.

adam-carruthers added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 10, 2022

rhshadrach closed this as completed Dec 15, 2022

rhshadrach added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Categorical Categorical Data Type Duplicate Report Duplicate issue or pull request and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Groupby drops NA even though dropna=False with categorical dtype #48492

BUG: Groupby drops NA even though dropna=False with categorical dtype #48492

adam-carruthers commented Sep 10, 2022 •

edited

Loading

INSTALLED VERSIONS

erg0dic commented Sep 11, 2022

AkshayJain1995 commented Sep 12, 2022

radekSF commented Sep 22, 2022

adam-carruthers commented Sep 23, 2022

CBMollerup commented Dec 5, 2022

rhshadrach commented Dec 15, 2022

mdruiter commented Jan 30, 2023

radekSF commented Jan 30, 2023

rhshadrach commented Jan 31, 2023

BUG: Groupby drops NA even though dropna=False with categorical dtype #48492

BUG: Groupby drops NA even though dropna=False with categorical dtype #48492

Comments

adam-carruthers commented Sep 10, 2022 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

erg0dic commented Sep 11, 2022

AkshayJain1995 commented Sep 12, 2022

radekSF commented Sep 22, 2022

adam-carruthers commented Sep 23, 2022

CBMollerup commented Dec 5, 2022

rhshadrach commented Dec 15, 2022

mdruiter commented Jan 30, 2023

radekSF commented Jan 30, 2023

rhshadrach commented Jan 31, 2023

adam-carruthers commented Sep 10, 2022 •

edited

Loading