-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-4324: [Python] Triage broken type inference logic in presence of a mix of NumPy dtype-having objects and other scalar values #4527
Conversation
@@ -161,7 +169,7 @@ class NumPyDtypeUnifier { | |||
_NUMPY_UNIFY_NOOP(UINT8); | |||
_NUMPY_UNIFY_NOOP(UINT16); | |||
_NUMPY_UNIFY_NOOP(UINT32); | |||
_NUMPY_UNIFY_PROMOTE(FLOAT32); | |||
_NUMPY_UNIFY_PROMOTE_TO(FLOAT32, FLOAT64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Casting 32/64-bit integers to Float32 without a warning was pretty unsafe, that's what these changes are about. There are no tests to check, though, so we should probably open a JIRA to do that as risk mitigation
python/pyarrow/array.pxi
Outdated
values = get_series_values(obj) | ||
values = get_series_values(obj, &is_pandas_object) | ||
if is_pandas_object: | ||
from_pandas = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user passes pandas.Series
to pyarrow.array
, they probably mean from_pandas=True=
, would you all agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds ok to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would only update the docstring then to mention this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default could also be change to None, leaving the option to user to force True or False, regardless the container type (but not sure that is worth it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that (making the default None). I can go ahead and make that change
# max(uint64) is too large for the inferred int64 type | ||
expected += [0, np.iinfo(np.int64).max] | ||
expected += [np_scalar(np.iinfo(np_scalar).min), | ||
np_scalar(np.iinfo(np_scalar).max)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The numpy.uint64 case got fixed in passing, it was buggy before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
python/pyarrow/array.pxi
Outdated
values = get_series_values(obj) | ||
values = get_series_values(obj, &is_pandas_object) | ||
if is_pandas_object: | ||
from_pandas = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would only update the docstring then to mention this.
python/pyarrow/array.pxi
Outdated
values = get_series_values(obj) | ||
values = get_series_values(obj, &is_pandas_object) | ||
if is_pandas_object: | ||
from_pandas = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default could also be change to None, leaving the option to user to force True or False, regardless the container type (but not sure that is worth it)
…g objects and other typed values, pending more serious refactor in ARROW-5564
…rom ARROW-2240 that's been changed
+1 |
In investigating the innocuous bug report from ARROW-4324 I stumbled on a pile of hacks and flawed design around type inference
It turns out there are several issues:
numpy.nan
value which is a PyFloat, not a NumPy float64 scalar[np.float16(1.5), 2.5]
would yieldpa.float16()
output type. YuckIn inserted some hacks to force what I believe to be the correct behavior and fixed a couple unit tests that actually exhibited buggy behavior before (see within). I don't have time to do the "right thing" right now which is to more or less rewrite the hot path of
arrow/python/inference.cc
, so at least this gets the unit tests asserting what is correct so that refactoring will be more productive later.