Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support downcasting of nullable EAs in pd.to_numeric #38746

Merged
merged 20 commits into from
Dec 30, 2020

Conversation

arw2019
Copy link
Member

@arw2019 arw2019 commented Dec 28, 2020

Picking up #33435.

In pd.to_numeric when encountering an IntegerArray and FloatingArray (or Series built from them) we downcast _data then reconstruct the array using the downcast _data and the _mask from the original array.

@arw2019 arw2019 added NA - MaskedArrays Related to pd.NA and nullable extension arrays Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Numeric Operations Arithmetic, Comparison, and Logical operations and removed Dtype Conversions Unexpected or buggy dtype conversions labels Dec 28, 2020
@arw2019 arw2019 changed the title [WIP] ENH: downcasting of nullable EAs in pd.to_numeric [WIP] ENH: support downcasting of nullable EAs in pd.to_numeric Dec 28, 2020
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this one!


if isinstance(arg, ABCSeries):
is_series = True
values = arg.values
if is_extension_array_dtype(arg) and isinstance(values, NumericArray):
is_numeric_extension_dtype = True
values = extract_array(arg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this extract_array line needed? (the values = arg.values from above should have worked fine, I think)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not, I reverted this bit

pandas/tests/tools/test_to_numeric.py Show resolved Hide resolved
@arw2019 arw2019 changed the title [WIP] ENH: support downcasting of nullable EAs in pd.to_numeric ENH: support downcasting of nullable EAs in pd.to_numeric Dec 28, 2020
@@ -142,6 +147,10 @@ def to_numeric(arg, errors="raise", downcast=None):
else:
values = arg

if is_extension_array_dtype(arg) and isinstance(values, NumericArray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this really should be part of the above logic. also mask needs to be defined for all cases (can be default to None)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this really should be part of the above logic.

The reason for having it down here is to handle EA dtype Series and array in a single place (and Index when #34159/#37869 go through)

also mask needs to be defined for all cases (can be default to None)

done

@@ -142,6 +147,10 @@ def to_numeric(arg, errors="raise", downcast=None):
else:
values = arg

if is_extension_array_dtype(arg) and isinstance(values, NumericArray):
is_numeric_extension_dtype = True
mask, values = values._mask, values._data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you likely need to just take the masked values only for the following block of code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -188,6 +197,16 @@ def to_numeric(arg, errors="raise", downcast=None):
if values.dtype == dtype:
break

if is_numeric_extension_dtype:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L194 might need to handle the mask

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean?

            float_32_ind = typecodes.index(float_32_char)

I feel like there's a testcase I haven't looked at if yes (the ones I have work as is)

arw2019 and others added 2 commits December 28, 2020 14:16
Co-authored-by: Joris Van den Bossche <[email protected]>
@arw2019
Copy link
Member Author

arw2019 commented Dec 28, 2020

Green + addressed comments

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also update the doc-string of to_numeric as well as add some examples.

pandas/core/tools/numeric.py Show resolved Hide resolved
([-1, -1], "Int32", "unsigned", "Int32"),
([1, 1], "Float64", "float", "Float32"),
([1, 1.1], "Float64", "float", "Float32"),
([1, 1], "Float64", "integer", "Int8"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe move this one up to the other "integer" downcast cases

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

([-1, -1], "Int32", "unsigned", "Int32"),
([1, 1], "Float64", "float", "Float32"),
([1, 1.1], "Float64", "float", "Float32"),
([1, 1], "Float64", "integer", "Int8"),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

pandas/tests/tools/test_to_numeric.py Show resolved Hide resolved
pandas/core/tools/numeric.py Show resolved Hide resolved
@@ -108,6 +110,21 @@ def to_numeric(arg, errors="raise", downcast=None):
2 2.0
3 -3.0
dtype: float64

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring updated.

pandas/core/tools/numeric.py Outdated Show resolved Hide resolved
pandas/core/tools/numeric.py Show resolved Hide resolved
Co-authored-by: Joris Van den Bossche <[email protected]>
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @arw2019

2 3
dtype: Int8
>>> s = pd.Series([1.0, 2.1, 3.0], dtype="Float64")
>>> pd.to_numeric(s, downcast="float")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to also accept Float and Integer as aliases for float | integer (separate issue)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should do that, since those are not actually dtypes, but rather values to a downcast keyword which eg also accepts "signed"/"unsigned"

pandas/core/tools/numeric.py Show resolved Hide resolved
@jreback jreback added this to the 1.3 milestone Dec 30, 2020
@jreback jreback merged commit 94810d1 into pandas-dev:master Dec 30, 2020
@jorisvandenbossche
Copy link
Member

Thanks @arw2019 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. NA - MaskedArrays Related to pd.NA and nullable extension arrays Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Support downcasting of nullable dtypes in to_numeric
3 participants