BUG: DataFrame[Int64].mean().dtype is object, should be Float64 #42895

jbrockmendel · 2021-08-04T23:38:27Z

This breaks the last corner case I want to test for #33036.

arr = np.random.randn(4, 3).astype("int64")
df = pd.DataFrame(arr).astype("Int64")
df.iloc[:, 1] = pd.NA  # <-- incorrectly casts to object, lets cast back and ignore that for now
df = df.astype("Int64")

res = df.mean()

>>> res
0     0.0
1    <NA>
2   -0.25
dtype: object

Trivial to fix with 2D EAs, not sure how/if to fix it without.

The text was updated successfully, but these errors were encountered:

Demetrio92 · 2021-10-27T00:25:15Z

df.iloc[:, 1] = pd.NA  # <-- incorrectly casts to object, lets cast back and ignore that for now

this here is a whole separate issue.

Reported here #44199

Demetrio92 · 2021-10-27T00:30:22Z

Also, I now know how to fix your example.

Fixed:

arr = np.random.randn(4, 3).astype("int64")
df = pd.DataFrame(arr).astype(pd.Int64Dtype())
df.iloc[:, 1] = pd.NA  # no more casting issues
# df = df.astype("Int64")  # not needed anymore

res = df.mean()
>>> res
0   -0.50
1     NaN
2    0.25
dtype: float64

jbrockmendel · 2021-10-27T02:18:11Z

Also, I now know how to fix your example.

Not quite. The point of the example is that we get unwanted behavior when we have an all-NA Int64 column

Demetrio92 · 2021-10-27T11:16:22Z

@jbrockmendel pd.NA is experimental. You can use np.nan instead, and it again will fix your issue.

This is really not about .mean or an all-NA column. Anything in pandas that will touch pd.NA will be "broken" if using current default types.

jreback · 2021-10-27T11:31:31Z

@Demetrio92 maybe it's not clear
@jbrockmendel opened this issue to have a reference in order to work in it - how else do you think things get fixed?

Demetrio92 · 2021-10-27T12:32:55Z

@jreback
As I see it @jbrockmendel is trying to test whether .mean is correctly working on nullable integers. I can show that it does, and the problem is in .astype("Int64"), which seems not to do what it is intended.

So, for testing .mean, you can use .astype(pd.Int64Dtype()), and open a separate issue on why .astype(pd.Int64Dtype()) does not do the same as astype("Int64"). Which deserves a way more thorough investigation and a separate issue.

Issues get fixed, when devs see what is the problem. We can create 1000 tickets all stating the same ".dtype is object, should be Float64" by using .mean, .var, .median. It will create a lot of noise and get nothing fixed.

jbrockmendel · 2021-10-27T15:25:20Z

you can use .astype(pd.Int64Dtype()), and open a separate issue on why .astype(pd.Int64Dtype()) does not do the same as astype("Int64").

They look the same to me. If you think there is something different, please open a separate issue for that to avoid disrailing this one.

test whether .mean is correctly working on nullable integers. I can show that it does

Incorrect. In both your example and the OP, after setting df.iloc[:, 1] = pd.NA, we end up with df.dtypes[1] == object. Without re-casting (the need for which you correctly identify as a separate issue) the .mean call is no longer testing the dtype of interest.

Demetrio92 · 2021-11-07T14:52:59Z

I was digging into the related issue. And figured out that despite that casting problem .mean() actually works as OP expects it if the second casting is removed.

arr = np.random.randn(4, 3).astype("int64")
df = pd.DataFrame(arr).astype("Int64")
df.iloc[:, 1] = pd.NA  # <-- incorrectly casts to object, lets cast back and ignore that for now
# df = df.astype("Int64")

res = df.mean()

Not sure why, but the above code works as expected

0    0.00
1     NaN
2   -0.25
dtype: float64

Tested using pandas 1.3.4

jbrockmendel · 2021-11-07T15:33:42Z

@Demetrio92 the underlying cause here is the lack of 2D support for IntegerArray. Further investigation is not a good use of your time. A place where investigative eyeballs would be very helpful is https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals/blocks.py#L1265 "# TODO: in all tests we have mask.all(); can we rely on that?"

mroeschke · 2023-07-13T16:40:50Z

Closed by #52788

simonjayhawkins added this to the Contributions Welcome milestone Aug 27, 2021

simonjayhawkins added the Dtype Conversions Unexpected or buggy dtype conversions label Aug 27, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

jbrockmendel mentioned this issue Dec 21, 2022

BUG: DataFrame reductions with object dtype and axis=1 #49603

Closed

brobr mentioned this issue Dec 26, 2022

BUG: Nullable integer type ("Int64") lost after summing along columns-index [df.sum(axis=1) #50438

Closed

3 tasks

jbrockmendel mentioned this issue Mar 14, 2023

REGR: Performance regression in axis=1 DataFrame ops #51923

Closed

4 tasks

jbrockmendel mentioned this issue Apr 6, 2023

BUG: DataFrame reductions losing EA dtypes #52261

Closed

6 tasks

topper-123 mentioned this issue Apr 19, 2023

ENH: better dtype inference when doing DataFrame reductions #52788

Merged

1 task

mroeschke closed this as completed Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame[Int64].mean().dtype is object, should be Float64 #42895

BUG: DataFrame[Int64].mean().dtype is object, should be Float64 #42895

jbrockmendel commented Aug 4, 2021

Demetrio92 commented Oct 27, 2021

Demetrio92 commented Oct 27, 2021

jbrockmendel commented Oct 27, 2021

Demetrio92 commented Oct 27, 2021

jreback commented Oct 27, 2021

Demetrio92 commented Oct 27, 2021

jbrockmendel commented Oct 27, 2021

Demetrio92 commented Nov 7, 2021

jbrockmendel commented Nov 7, 2021

mroeschke commented Jul 13, 2023

BUG: DataFrame[Int64].mean().dtype is object, should be Float64 #42895

BUG: DataFrame[Int64].mean().dtype is object, should be Float64 #42895

Comments

jbrockmendel commented Aug 4, 2021

Demetrio92 commented Oct 27, 2021

Demetrio92 commented Oct 27, 2021

jbrockmendel commented Oct 27, 2021

Demetrio92 commented Oct 27, 2021

jreback commented Oct 27, 2021

Demetrio92 commented Oct 27, 2021

jbrockmendel commented Oct 27, 2021

Demetrio92 commented Nov 7, 2021

jbrockmendel commented Nov 7, 2021

mroeschke commented Jul 13, 2023