Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-4324: [Python] Triage broken type inference logic in presence of a mix of NumPy dtype-having objects and other scalar values #4527

Closed
wants to merge 4 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented Jun 12, 2019

In investigating the innocuous bug report from ARROW-4324 I stumbled on a pile of hacks and flawed design around type inference

test_list = [np.dtype('int32').type(10), np.dtype('float32').type(0.5)]
test_array = pa.array(test_list)

# Expected
# test_array
# <pyarrow.lib.DoubleArray object at 0x7f009963bf48>
# [
#   10,
#   0.5
# ]

# Got
# test_array
# <pyarrow.lib.Int32Array object at 0x7f009963bf48>
# [
#   10,
#   0
# ]

It turns out there are several issues:

  • There was a kludge around handling the numpy.nan value which is a PyFloat, not a NumPy float64 scalar
  • Type inference assumed "NaN is null", which should not be hard coded, so I added a flag to switch between pandas semantics and non-pandas
  • Mixing NumPy scalar values and non-NumPy scalars (like our evil friend numpy.nan) caused the output type to be simply incorrect. For example [np.float16(1.5), 2.5] would yield pa.float16() output type. Yuck

In inserted some hacks to force what I believe to be the correct behavior and fixed a couple unit tests that actually exhibited buggy behavior before (see within). I don't have time to do the "right thing" right now which is to more or less rewrite the hot path of arrow/python/inference.cc, so at least this gets the unit tests asserting what is correct so that refactoring will be more productive later.

@wesm wesm requested a review from pitrou June 12, 2019 03:55
@@ -161,7 +169,7 @@ class NumPyDtypeUnifier {
_NUMPY_UNIFY_NOOP(UINT8);
_NUMPY_UNIFY_NOOP(UINT16);
_NUMPY_UNIFY_NOOP(UINT32);
_NUMPY_UNIFY_PROMOTE(FLOAT32);
_NUMPY_UNIFY_PROMOTE_TO(FLOAT32, FLOAT64);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting 32/64-bit integers to Float32 without a warning was pretty unsafe, that's what these changes are about. There are no tests to check, though, so we should probably open a JIRA to do that as risk mitigation

values = get_series_values(obj)
values = get_series_values(obj, &is_pandas_object)
if is_pandas_object:
from_pandas = True
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user passes pandas.Series to pyarrow.array, they probably mean from_pandas=True=, would you all agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds ok to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would only update the docstring then to mention this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default could also be change to None, leaving the option to user to force True or False, regardless the container type (but not sure that is worth it)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that (making the default None). I can go ahead and make that change

# max(uint64) is too large for the inferred int64 type
expected += [0, np.iinfo(np.int64).max]
expected += [np_scalar(np.iinfo(np_scalar).min),
np_scalar(np.iinfo(np_scalar).max)]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numpy.uint64 case got fixed in passing, it was buggy before

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

values = get_series_values(obj)
values = get_series_values(obj, &is_pandas_object)
if is_pandas_object:
from_pandas = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would only update the docstring then to mention this.

values = get_series_values(obj)
values = get_series_values(obj, &is_pandas_object)
if is_pandas_object:
from_pandas = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default could also be change to None, leaving the option to user to force True or False, regardless the container type (but not sure that is worth it)

wesm added 2 commits June 12, 2019 14:48
…g objects and other typed values, pending more serious refactor in ARROW-5564
@wesm
Copy link
Member Author

wesm commented Jun 12, 2019

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants