Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Groupby drops NA even though dropna=False with categorical dtype #48492

Closed
2 of 3 tasks
adam-carruthers opened this issue Sep 10, 2022 · 9 comments
Closed
2 of 3 tasks
Labels
Bug Categorical Categorical Data Type Duplicate Report Duplicate issue or pull request Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@adam-carruthers
Copy link

adam-carruthers commented Sep 10, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

print(pd.show_versions())

df_normal = pd.DataFrame({"x": ["a", "b", "b", np.nan]})

print(df_normal, end="\n\n")

print(df_normal.groupby("x", dropna=False).size(), end="\n\n")

df_categorical = pd.DataFrame({"x": ["a", "b", "b", np.nan]}).astype(
    {"x": pd.CategoricalDtype()}
)

print(df_categorical, end="\n\n")

print(df_categorical.groupby("x", dropna=False).size(), end="\n\n")

Issue Description

Although dropna=False, pandas drops the NA in the categorical column. This can be seen because it has not shown up in the produced series.

If the data type is not categorical, the groupby correctly keeps the NA in the column, as shown by df_normal

Expected Behavior

  • Pandas does not drop the NA in the categorical column in the groupby when dropna=False
  • The normal and the categorical DataFrame function the same

Installed Versions

INSTALLED VERSIONS

commit : ca60aab
python : 3.9.13.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19043
machine : AMD64
processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : English_United Kingdom.1252

pandas : 1.4.4
numpy : 1.23.3
pytz : 2022.2.1
dateutil : 2.8.2
setuptools : 58.1.0
pip : 22.0.4
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
None

@adam-carruthers adam-carruthers added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 10, 2022
@erg0dic
Copy link

erg0dic commented Sep 11, 2022

First as a disclaimer, I'm not a regular here but I tried looking into this and I think this is expected behavior but this could be highlighted better maybe as a warning or something.

From what I found, NaN cannot be a valid category type. Also, when you specify restricted categorical types like pd.CategoricalDtype(['a', 'b']), anything else that wasn't covered in the pd.Series automatically becomes a NaN and isn't considered by group operations. Have a look at this issue. And the test files for categorical types. There are example test cases that you might find convincing.

@AkshayJain1995
Copy link

Hi @goodyguts , I would like to pick this up. Can you assign it to me?

@radekSF
Copy link

radekSF commented Sep 22, 2022

Is this a duplicate of #36327 ?

@adam-carruthers
Copy link
Author

Yes it is

@CBMollerup
Copy link

#36327 has been closed by #49652

@rhshadrach
Copy link
Member

Thanks @CBMollerup; the output of the last line in the OP is now:

x
a      1
b      2
NaN    1
dtype: int64

which is correct. As identified, this was closed by #49652.

@rhshadrach rhshadrach added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Categorical Categorical Data Type Duplicate Report Duplicate issue or pull request and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 15, 2022
@mdruiter
Copy link
Contributor

The bug is back in 1.5.1. I didn't find an open issue to fix this is a future release. What's the plan?

@radekSF
Copy link

radekSF commented Jan 30, 2023

@mdruiter Perhaps @rhshadrach can confirm if that's the case but #49652 is in milestone 2.0 so the bug will not be fixed during 1.5 - however, the solution has been implemented and will be released at some point in the future.

@rhshadrach
Copy link
Member

Yes - that's correct. I believe we're aiming to have a release candidate in early February and then release 2.0 several weeks later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Duplicate Report Duplicate issue or pull request Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

7 participants