-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: support downcasting of nullable EAs in pd.to_numeric #38746
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this one!
pandas/core/tools/numeric.py
Outdated
|
||
if isinstance(arg, ABCSeries): | ||
is_series = True | ||
values = arg.values | ||
if is_extension_array_dtype(arg) and isinstance(values, NumericArray): | ||
is_numeric_extension_dtype = True | ||
values = extract_array(arg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this extract_array line needed? (the values = arg.values
from above should have worked fine, I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not, I reverted this bit
pandas/core/tools/numeric.py
Outdated
@@ -142,6 +147,10 @@ def to_numeric(arg, errors="raise", downcast=None): | |||
else: | |||
values = arg | |||
|
|||
if is_extension_array_dtype(arg) and isinstance(values, NumericArray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this really should be part of the above logic. also mask needs to be defined for all cases (can be default to None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas/core/tools/numeric.py
Outdated
@@ -142,6 +147,10 @@ def to_numeric(arg, errors="raise", downcast=None): | |||
else: | |||
values = arg | |||
|
|||
if is_extension_array_dtype(arg) and isinstance(values, NumericArray): | |||
is_numeric_extension_dtype = True | |||
mask, values = values._mask, values._data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you likely need to just take the masked values only for the following block of code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/core/tools/numeric.py
Outdated
@@ -188,6 +197,16 @@ def to_numeric(arg, errors="raise", downcast=None): | |||
if values.dtype == dtype: | |||
break | |||
|
|||
if is_numeric_extension_dtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L194 might need to handle the mask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean?
float_32_ind = typecodes.index(float_32_char)
I feel like there's a testcase I haven't looked at if yes (the ones I have work as is)
Co-authored-by: Joris Van den Bossche <[email protected]>
Green + addressed comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also update the doc-string of to_numeric as well as add some examples.
([-1, -1], "Int32", "unsigned", "Int32"), | ||
([1, 1], "Float64", "float", "Float32"), | ||
([1, 1.1], "Float64", "float", "Float32"), | ||
([1, 1], "Float64", "integer", "Int8"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe move this one up to the other "integer" downcast cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
([-1, -1], "Int32", "unsigned", "Int32"), | ||
([1, 1], "Float64", "float", "Float32"), | ||
([1, 1.1], "Float64", "float", "Float32"), | ||
([1, 1], "Float64", "integer", "Int8"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -108,6 +110,21 @@ def to_numeric(arg, errors="raise", downcast=None): | |||
2 2.0 | |||
3 -3.0 | |||
dtype: float64 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring updated.
Co-authored-by: Joris Van den Bossche <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @arw2019
2 3 | ||
dtype: Int8 | ||
>>> s = pd.Series([1.0, 2.1, 3.0], dtype="Float64") | ||
>>> pd.to_numeric(s, downcast="float") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may want to also accept Float
and Integer
as aliases for float
| integer
(separate issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should do that, since those are not actually dtypes, but rather values to a downcast
keyword which eg also accepts "signed"/"unsigned"
Thanks @arw2019 ! |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Picking up #33435.
In
pd.to_numeric
when encountering anIntegerArray
andFloatingArray
(orSeries
built from them) we downcast_data
then reconstruct the array using the downcast_data
and the_mask
from the original array.