Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Nullable integer type ("Int64") lost after summing along columns-index [df.sum(axis=1) #50438

Closed
2 of 3 tasks
brobr opened this issue Dec 26, 2022 · 4 comments
Closed
2 of 3 tasks
Labels
Bug Duplicate Report Duplicate issue or pull request NA - MaskedArrays Related to pd.NA and nullable extension arrays Reduction Operations sum, mean, min, max, etc.

Comments

@brobr
Copy link

brobr commented Dec 26, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

di = pd.DataFrame({'A':[1, 2, 4],'B':[3, 4, -5]}, dtype='Int64')
di.sum().dtypes        # dtype('int64')

di.sum(axis=1).dtypes  # dtype('float64') ??

di.T.sum().T.dtypes    # dtype('int64') 

di.T.sum().T - di.sum(axis=1)
# 0    0.0
# 1    0.0
# 2    0.0
# dtype: float64

df = pd.DataFrame({'A':[1, 2, pd.NA, 4],'B':[3, pd.NA, 4, -5]}, dtype="Int64")
assert( (df.sum().dtypes == 'int64') and 
        (df.sum(axis=1).dtypes == 'float64')and 
        (df.T.sum().T.dtypes == 'int64') )

dg = pd.DataFrame({'A':[1, 2, None, 4],'B':[3, None, 4, -5]}, dtype='int') # FutureWarning
assert( (dg.sum().dtypes == 'O') and (dg.sum(axis=1).dtypes == 'float64'))

dh = pd.DataFrame({'A':[1, 2, 4],'B':[3, 4, -5]}, dtype='int')
assert(dh.sum().dtypes == dh.sum(axis=1).dtypes == 'int64')

Issue Description

With normal integers dh.sum(axis=1); dh.sum() the obtained sums are integers as well unless a value is missing, then things go odd (with dg.sum(); dg.sum(axis=1)) one gets an object or a float.

The proposed solution for this, the Nullable integer type ('Int64'), only partly works here.
Summing along the index axis (0, default), di.sum(), keeps 'Int64' as would be expected.
But this type is not kept when summing over rows, along the columns-axis.
See code example: with di.sum(axis=1) the resulting sums are dtype 'float' not 'Int64'.

Using the expected behaviour for axis=0, one can keep 'Int64' after summing rows by means of double transposition .

A pd.NA missing value in a dataframe of dtype 'Int64', also yields a float after summing rows (df.sum(axis=1))

Expected Behavior

Nullability of 'Int64', would mean that integers are not becoming floats due to other datatypes or after some normal operation on the dataframe (that would not affect integers, like summing).

Installed Versions

INSTALLED VERSIONS

commit : 8dab54d
python : 3.9.16.final.0
python-bits : 64
OS : Linux
OS-release : 5.19.17
Version : #1 SMP PREEMPT_DYNAMIC Mon Oct 24 13:00:29 CDT 2022
machine : x86_64
processor : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.5.2
numpy : 1.23.5
pytz : 2022.1
dateutil : 2.8.2
setuptools : 65.1.1
pip : 22.2.2
Cython : 0.29.28
pytest : 7.2.0
hypothesis : None
sphinx : 4.5.0
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.8.0
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.7.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
brotli : 1.0.9
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.6.2
numba : None
numexpr : 2.8.3
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.1
snappy : None
sqlalchemy : 1.4.45
tables : None
tabulate : None
xarray : 2022.12.0
xlrd : 1.1.0
xlwt : None
zstandard : None
tzdata : None

@brobr brobr added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 26, 2022
@phofl
Copy link
Member

phofl commented Dec 26, 2022

Hi, thanks for your report. We have a bunch of open issues discussing this for reduction operations. Please search the issue tracker

@phofl phofl closed this as completed Dec 26, 2022
@phofl phofl added Duplicate Report Duplicate issue or pull request NA - MaskedArrays Related to pd.NA and nullable extension arrays Reduction Operations sum, mean, min, max, etc. and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 26, 2022
@brobr
Copy link
Author

brobr commented Dec 26, 2022

Sorry, for the bother, but whatever you meant by "discussing this for reduction operations", I did not notice this seemingly quite weird error mentioned among the open issues for 'Int64' .

Could you maybe explain what exactly it was the duplicate of? There is talk of 'reduction operations' in #49603 (which does not mention that summing integer values over one axis should change type; it starts with objects), while #42895 referred there concerned the bug that a mean of 'Int64' values did not give a float (which was to be expected). By summing 'Int64' values you would expect to keep the type, or at least consistent output.

Possibly all this stuff is programmatically related but I am not too familiar with the inner workings of pandas. In view of the experimental state of "Int64" I hoped this user-feedback would have been helpful.

Please, don't get me wrong, I appreciate pandas enormously (it made an idea possible I had carried around for years before getting any clue how to do it until I learnt a bit of pandas). Keep up the good work.

@phofl
Copy link
Member

phofl commented Dec 26, 2022

The underlying reason is 1D eas like in #42895, the behavior you observed is off for all reduction operations

@brobr
Copy link
Author

brobr commented Dec 26, 2022

Thanks, that 1D 'eas' explains it all; sorry I missed that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Duplicate Report Duplicate issue or pull request NA - MaskedArrays Related to pd.NA and nullable extension arrays Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

No branches or pull requests

2 participants