Pandas groupby sum misbehaves when one of the columns has string objects. #24196

kaushikb11 · 2018-12-10T07:52:06Z

Code Sample, a copy-pastable example if possible

df = pd.DataFrame({'A': 'a b c'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
df['B'] = np.nan
df.groupby(lambda x:x, axis=1).sum(min_count=1)

Problem description

The output for the above code should return NaN for the column 'B' of the DataFrame when min_count=1 as mentioned in DataFrame.sum doc., but it is returning 0 instead.

Actual Output

	A	B	C
0	a	0.0	4
1	b	0.0	6
2	c	0.0	5

Expected Ouput

	A	B	 C
0	a	NaN	 4
1	b	NaN	 6
2	c	NaN	 5

So, I decided to remove the column with strings, then it returns NaN for the column 'B'. However, I think the columns in the Dataframe should be independent to each other. It seems like a bug.

del df['A']
df.groupby(lambda x:x, axis=1).sum(min_count=1)

Output

	B	C
0	NaN	4.0
1	NaN	6.0
2	NaN	5.0

Output of `pd.show_versions()`

pandas: 0.23.4
pytest: 4.0.1
pip: 8.1.1
setuptools: 40.6.2
Cython: 0.29.1
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.11.1
xarray: 0.11.0
IPython: 7.2.0
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: 1.6.2
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 0.999
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.2.0
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-12-11T06:58:45Z

Hmm OK that is strange. Definitely some unwanted casting going on with the presence of column A.

Investigation and PRs are always welcome if you care to take a look

Koustav-Samaddar · 2018-12-28T14:46:29Z

This is the most stripped down case where I am still able to reproduce the bug.

# bug_var = 1
bug_var = 'a'

df = pd.DataFrame({
    'A': [bug_var, np.nan]
})

gresult = df.groupby(lambda x: x)
result = gresult.sum(min_count=1)

This yields,

result
   A
0  a
1  0

but,

expected
     A
0    a
1  NaN

Due to the non-numeric type, self._cython_agg_general at groupby.py:1256 raises an Error (DataError if numeric_only=True otherwise AttributeError) which causes the program to rely on self.aggregate (groupby.py:1262) which is where the bug stems from.

I'll continue to take a look at this tomorrow, but in the meantime I'm putting this here if anyone finds this useful.

ikramersh · 2021-11-05T18:16:13Z

This issue no longer occurs when tested with pandas version 1.3.4

gmaiwald · 2023-04-18T13:30:52Z

take

WillAyd added Bug Groupby labels Dec 11, 2018

WillAyd added this to the Contributions Welcome milestone Dec 11, 2018

jbrockmendel added Reduction Operations sum, mean, min, max, etc. Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply labels Sep 21, 2020

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

jbrockmendel added the Needs Tests Unit test(s) needed to prevent regressions label Mar 29, 2023

phofl mentioned this issue Apr 9, 2023

Add regression tests noatamir/pyladies-workshop#7

Closed

26 tasks

github-actions bot assigned gmaiwald Apr 18, 2023

gmaiwald added a commit to gmaiwald/pandas that referenced this issue Apr 18, 2023

pandas-devGH-24196: regression test on sum nan implemented

7c9b0f1

gmaiwald added a commit to gmaiwald/pandas that referenced this issue Apr 18, 2023

pandas-devGH-24196: typo fixed

132f0e8

gmaiwald added a commit to gmaiwald/pandas that referenced this issue Apr 18, 2023

pandas-devGH-24196: linting rules applied

3e10c6a

gmaiwald mentioned this issue Apr 18, 2023

TST: Test groupby for columns with string objects #52757

Merged

5 tasks

phofl closed this as completed in #52757 Apr 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas groupby sum misbehaves when one of the columns has string objects. #24196

Pandas groupby sum misbehaves when one of the columns has string objects. #24196

kaushikb11 commented Dec 10, 2018 •

edited

Loading

WillAyd commented Dec 11, 2018

Koustav-Samaddar commented Dec 28, 2018

ikramersh commented Nov 5, 2021

gmaiwald commented Apr 18, 2023

Pandas groupby sum misbehaves when one of the columns has string objects. #24196

Pandas groupby sum misbehaves when one of the columns has string objects. #24196

Comments

kaushikb11 commented Dec 10, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Actual Output

Expected Ouput

Output

Output of pd.show_versions()

WillAyd commented Dec 11, 2018

Koustav-Samaddar commented Dec 28, 2018

ikramersh commented Nov 5, 2021

gmaiwald commented Apr 18, 2023

kaushikb11 commented Dec 10, 2018 •

edited

Loading

Output of `pd.show_versions()`