ARROW-4324: [Python] Triage broken type inference logic in presence of a mix of NumPy dtype-having objects and other scalar values #4527

wesm · 2019-06-12T03:55:20Z

In investigating the innocuous bug report from ARROW-4324 I stumbled on a pile of hacks and flawed design around type inference

test_list = [np.dtype('int32').type(10), np.dtype('float32').type(0.5)]
test_array = pa.array(test_list)

# Expected
# test_array
# <pyarrow.lib.DoubleArray object at 0x7f009963bf48>
# [
#   10,
#   0.5
# ]

# Got
# test_array
# <pyarrow.lib.Int32Array object at 0x7f009963bf48>
# [
#   10,
#   0
# ]

It turns out there are several issues:

There was a kludge around handling the numpy.nan value which is a PyFloat, not a NumPy float64 scalar
Type inference assumed "NaN is null", which should not be hard coded, so I added a flag to switch between pandas semantics and non-pandas
Mixing NumPy scalar values and non-NumPy scalars (like our evil friend numpy.nan) caused the output type to be simply incorrect. For example [np.float16(1.5), 2.5] would yield pa.float16() output type. Yuck

In inserted some hacks to force what I believe to be the correct behavior and fixed a couple unit tests that actually exhibited buggy behavior before (see within). I don't have time to do the "right thing" right now which is to more or less rewrite the hot path of arrow/python/inference.cc, so at least this gets the unit tests asserting what is correct so that refactoring will be more productive later.

wesm · 2019-06-12T03:57:10Z

cpp/src/arrow/python/inference.cc

@@ -161,7 +169,7 @@ class NumPyDtypeUnifier {
      _NUMPY_UNIFY_NOOP(UINT8);
      _NUMPY_UNIFY_NOOP(UINT16);
      _NUMPY_UNIFY_NOOP(UINT32);
-      _NUMPY_UNIFY_PROMOTE(FLOAT32);
+      _NUMPY_UNIFY_PROMOTE_TO(FLOAT32, FLOAT64);


Casting 32/64-bit integers to Float32 without a warning was pretty unsafe, that's what these changes are about. There are no tests to check, though, so we should probably open a JIRA to do that as risk mitigation

wesm · 2019-06-12T03:58:40Z

python/pyarrow/array.pxi

-        values = get_series_values(obj)
+        values = get_series_values(obj, &is_pandas_object)
+        if is_pandas_object:
+            from_pandas = True


If the user passes pandas.Series to pyarrow.array, they probably mean from_pandas=True=, would you all agree?

Sounds ok to me.

I would only update the docstring then to mention this.

The default could also be change to None, leaving the option to user to force True or False, regardless the container type (but not sure that is worth it)

I thought about that (making the default None). I can go ahead and make that change

wesm · 2019-06-12T03:59:29Z

python/pyarrow/tests/test_convert_builtin.py

-        # max(uint64) is too large for the inferred int64 type
-        expected += [0, np.iinfo(np.int64).max]
+    expected += [np_scalar(np.iinfo(np_scalar).min),
+                 np_scalar(np.iinfo(np_scalar).max)]


The numpy.uint64 case got fixed in passing, it was buggy before

jorisvandenbossche

Looks good to me.

jorisvandenbossche · 2019-06-12T14:14:24Z

python/pyarrow/array.pxi

-        values = get_series_values(obj)
+        values = get_series_values(obj, &is_pandas_object)
+        if is_pandas_object:
+            from_pandas = True


I would only update the docstring then to mention this.

jorisvandenbossche · 2019-06-12T14:15:38Z

python/pyarrow/array.pxi

-        values = get_series_values(obj)
+        values = get_series_values(obj, &is_pandas_object)
+        if is_pandas_object:
+            from_pandas = True


The default could also be change to None, leaving the option to user to force True or False, regardless the container type (but not sure that is worth it)

…g objects and other typed values, pending more serious refactor in ARROW-5564

…rom ARROW-2240 that's been changed

…es can be respected

wesm · 2019-06-12T21:52:21Z

+1

wesm requested a review from pitrou June 12, 2019 03:55

wesm commented Jun 12, 2019

View reviewed changes

jorisvandenbossche reviewed Jun 12, 2019

View reviewed changes

wesm added 2 commits June 12, 2019 14:48

Triage type inference logic in presence of a mix of NumPy dtype-havin…

4bc8c81

…g objects and other typed values, pending more serious refactor in ARROW-5564

Remove outdated unit test, add Python unit test that shows behavior f…

e1b8393

…rom ARROW-2240 that's been changed

wesm force-pushed the ARROW-4324 branch from 42171f0 to e1b8393 Compare June 12, 2019 19:58

wesm added 2 commits June 12, 2019 15:04

Set from_pandas to None by default in pyarrow.array so that user wish…

754468a

…es can be respected

Add unit test for passing pandas Series with from_pandas=False

e396958

wesm closed this in 25b4a46 Jun 12, 2019

wesm deleted the ARROW-4324 branch June 12, 2019 22:15

asfimport mentioned this pull request Jun 12, 2019

[Python] Array dtype inference incorrect when created from list of mixed numpy scalars #20895

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-4324: [Python] Triage broken type inference logic in presence of a mix of NumPy dtype-having objects and other scalar values #4527

ARROW-4324: [Python] Triage broken type inference logic in presence of a mix of NumPy dtype-having objects and other scalar values #4527

wesm commented Jun 12, 2019

wesm Jun 12, 2019

wesm Jun 12, 2019

pitrou Jun 12, 2019

jorisvandenbossche Jun 12, 2019

jorisvandenbossche Jun 12, 2019

wesm Jun 12, 2019

wesm Jun 12, 2019

jorisvandenbossche left a comment

jorisvandenbossche Jun 12, 2019

jorisvandenbossche Jun 12, 2019

wesm commented Jun 12, 2019

ARROW-4324: [Python] Triage broken type inference logic in presence of a mix of NumPy dtype-having objects and other scalar values #4527

ARROW-4324: [Python] Triage broken type inference logic in presence of a mix of NumPy dtype-having objects and other scalar values #4527

Conversation

wesm commented Jun 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Jun 12, 2019