Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame[Int64].mean().dtype is object, should be Float64 #42895

Closed
jbrockmendel opened this issue Aug 4, 2021 · 10 comments
Closed

BUG: DataFrame[Int64].mean().dtype is object, should be Float64 #42895

jbrockmendel opened this issue Aug 4, 2021 · 10 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions NA - MaskedArrays Related to pd.NA and nullable extension arrays Reduction Operations sum, mean, min, max, etc.

Comments

@jbrockmendel
Copy link
Member

This breaks the last corner case I want to test for #33036.

arr = np.random.randn(4, 3).astype("int64")
df = pd.DataFrame(arr).astype("Int64")
df.iloc[:, 1] = pd.NA  # <-- incorrectly casts to object, lets cast back and ignore that for now
df = df.astype("Int64")

res = df.mean()

>>> res
0     0.0
1    <NA>
2   -0.25
dtype: object

Trivial to fix with 2D EAs, not sure how/if to fix it without.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member NA - MaskedArrays Related to pd.NA and nullable extension arrays Reduction Operations sum, mean, min, max, etc. and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 4, 2021
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Aug 27, 2021
@simonjayhawkins simonjayhawkins added the Dtype Conversions Unexpected or buggy dtype conversions label Aug 27, 2021
@Demetrio92
Copy link

df.iloc[:, 1] = pd.NA  # <-- incorrectly casts to object, lets cast back and ignore that for now

this here is a whole separate issue.

Reported here #44199

@Demetrio92
Copy link

Also, I now know how to fix your example.

Fixed:

arr = np.random.randn(4, 3).astype("int64")
df = pd.DataFrame(arr).astype(pd.Int64Dtype())
df.iloc[:, 1] = pd.NA  # no more casting issues
# df = df.astype("Int64")  # not needed anymore

res = df.mean()
>>> res
0   -0.50
1     NaN
2    0.25
dtype: float64

@jbrockmendel
Copy link
Member Author

Also, I now know how to fix your example.

Not quite. The point of the example is that we get unwanted behavior when we have an all-NA Int64 column

@Demetrio92
Copy link

@jbrockmendel pd.NA is experimental. You can use np.nan instead, and it again will fix your issue.

This is really not about .mean or an all-NA column. Anything in pandas that will touch pd.NA will be "broken" if using current default types.

@jreback
Copy link
Contributor

jreback commented Oct 27, 2021

@Demetrio92 maybe it's not clear
@jbrockmendel opened this issue to have a reference in order to work in it - how else do you think things get fixed?

@Demetrio92
Copy link

@jreback
As I see it @jbrockmendel is trying to test whether .mean is correctly working on nullable integers. I can show that it does, and the problem is in .astype("Int64"), which seems not to do what it is intended.

So, for testing .mean, you can use .astype(pd.Int64Dtype()), and open a separate issue on why .astype(pd.Int64Dtype()) does not do the same as astype("Int64"). Which deserves a way more thorough investigation and a separate issue.

Issues get fixed, when devs see what is the problem. We can create 1000 tickets all stating the same ".dtype is object, should be Float64" by using .mean, .var, .median. It will create a lot of noise and get nothing fixed.

@jbrockmendel
Copy link
Member Author

you can use .astype(pd.Int64Dtype()), and open a separate issue on why .astype(pd.Int64Dtype()) does not do the same as astype("Int64").

They look the same to me. If you think there is something different, please open a separate issue for that to avoid disrailing this one.

test whether .mean is correctly working on nullable integers. I can show that it does

Incorrect. In both your example and the OP, after setting df.iloc[:, 1] = pd.NA, we end up with df.dtypes[1] == object. Without re-casting (the need for which you correctly identify as a separate issue) the .mean call is no longer testing the dtype of interest.

@Demetrio92
Copy link

I was digging into the related issue. And figured out that despite that casting problem .mean() actually works as OP expects it if the second casting is removed.

arr = np.random.randn(4, 3).astype("int64")
df = pd.DataFrame(arr).astype("Int64")
df.iloc[:, 1] = pd.NA  # <-- incorrectly casts to object, lets cast back and ignore that for now
# df = df.astype("Int64")

res = df.mean()

Not sure why, but the above code works as expected

0    0.00
1     NaN
2   -0.25
dtype: float64

Tested using pandas 1.3.4

@jbrockmendel
Copy link
Member Author

@Demetrio92 the underlying cause here is the lack of 2D support for IntegerArray. Further investigation is not a good use of your time. A place where investigative eyeballs would be very helpful is https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals/blocks.py#L1265 "# TODO: in all tests we have mask.all(); can we rely on that?"

@mroeschke
Copy link
Member

Closed by #52788

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions NA - MaskedArrays Related to pd.NA and nullable extension arrays Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants