Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas groupby sum misbehaves when one of the columns has string objects. #24196

Closed
Tracked by #7
kaushikb11 opened this issue Dec 10, 2018 · 4 comments · Fixed by #52757
Closed
Tracked by #7

Pandas groupby sum misbehaves when one of the columns has string objects. #24196

kaushikb11 opened this issue Dec 10, 2018 · 4 comments · Fixed by #52757
Assignees
Labels
Bug Groupby Needs Tests Unit test(s) needed to prevent regressions Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply Reduction Operations sum, mean, min, max, etc.

Comments

@kaushikb11
Copy link

kaushikb11 commented Dec 10, 2018

Code Sample, a copy-pastable example if possible

df = pd.DataFrame({'A': 'a b c'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
df['B'] = np.nan
df.groupby(lambda x:x, axis=1).sum(min_count=1)

Problem description

The output for the above code should return NaN for the column 'B' of the DataFrame when min_count=1 as mentioned in DataFrame.sum doc., but it is returning 0 instead.

Actual Output

	A	B	C
0	a	0.0	4
1	b	0.0	6
2	c	0.0	5

Expected Ouput

	A	B	 C
0	a	NaN	 4
1	b	NaN	 6
2	c	NaN	 5

So, I decided to remove the column with strings, then it returns NaN for the column 'B'. However, I think the columns in the Dataframe should be independent to each other. It seems like a bug.

del df['A']
df.groupby(lambda x:x, axis=1).sum(min_count=1)

Output

	B	C
0	NaN	4.0
1	NaN	6.0
2	NaN	5.0

Output of pd.show_versions()

pandas: 0.23.4
pytest: 4.0.1
pip: 8.1.1
setuptools: 40.6.2
Cython: 0.29.1
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.11.1
xarray: 0.11.0
IPython: 7.2.0
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: 1.6.2
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 0.999
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.2.0
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Dec 11, 2018

Hmm OK that is strange. Definitely some unwanted casting going on with the presence of column A.

Investigation and PRs are always welcome if you care to take a look

@WillAyd WillAyd added this to the Contributions Welcome milestone Dec 11, 2018
@Koustav-Samaddar
Copy link
Contributor

This is the most stripped down case where I am still able to reproduce the bug.

# bug_var = 1
bug_var = 'a'

df = pd.DataFrame({
    'A': [bug_var, np.nan]
})

gresult = df.groupby(lambda x: x)
result = gresult.sum(min_count=1)

This yields,

result
   A
0  a
1  0

but,

expected
     A
0    a
1  NaN

Due to the non-numeric type, self._cython_agg_general at groupby.py:1256 raises an Error (DataError if numeric_only=True otherwise AttributeError) which causes the program to rely on self.aggregate (groupby.py:1262) which is where the bug stems from.

I'll continue to take a look at this tomorrow, but in the meantime I'm putting this here if anyone finds this useful.

@jbrockmendel jbrockmendel added Reduction Operations sum, mean, min, max, etc. Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply labels Sep 21, 2020
@ikramersh
Copy link

This issue no longer occurs when tested with pandas version 1.3.4

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@jbrockmendel jbrockmendel added the Needs Tests Unit test(s) needed to prevent regressions label Mar 29, 2023
@gmaiwald
Copy link
Contributor

take

gmaiwald added a commit to gmaiwald/pandas that referenced this issue Apr 18, 2023
gmaiwald added a commit to gmaiwald/pandas that referenced this issue Apr 18, 2023
gmaiwald added a commit to gmaiwald/pandas that referenced this issue Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Needs Tests Unit test(s) needed to prevent regressions Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants