-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ARROW-4324: [Python] Triage broken type inference logic in presence o…
…f a mix of NumPy dtype-having objects and other scalar values In investigating the innocuous bug report from ARROW-4324 I stumbled on a pile of hacks and flawed design around type inference ``` test_list = [np.dtype('int32').type(10), np.dtype('float32').type(0.5)] test_array = pa.array(test_list) # Expected # test_array # <pyarrow.lib.DoubleArray object at 0x7f009963bf48> # [ # 10, # 0.5 # ] # Got # test_array # <pyarrow.lib.Int32Array object at 0x7f009963bf48> # [ # 10, # 0 # ] ``` It turns out there are several issues: * There was a kludge around handling the `numpy.nan` value which is a PyFloat, not a NumPy float64 scalar * Type inference assumed "NaN is null", which should not be hard coded, so I added a flag to switch between pandas semantics and non-pandas * Mixing NumPy scalar values and non-NumPy scalars (like our evil friend numpy.nan) caused the output type to be simply incorrect. For example `[np.float16(1.5), 2.5]` would yield `pa.float16()` output type. Yuck In inserted some hacks to force what I believe to be the correct behavior and fixed a couple unit tests that actually exhibited buggy behavior before (see within). I don't have time to do the "right thing" right now which is to more or less rewrite the hot path of `arrow/python/inference.cc`, so at least this gets the unit tests asserting what is correct so that refactoring will be more productive later. Author: Wes McKinney <[email protected]> Closes #4527 from wesm/ARROW-4324 and squashes the following commits: e396958 <Wes McKinney> Add unit test for passing pandas Series with from_pandas=False 754468a <Wes McKinney> Set from_pandas to None by default in pyarrow.array so that user wishes can be respected e1b8393 <Wes McKinney> Remove outdated unit test, add Python unit test that shows behavior from ARROW-2240 that's been changed 4bc8c81 <Wes McKinney> Triage type inference logic in presence of a mix of NumPy dtype-having objects and other typed values, pending more serious refactor in ARROW-5564
- Loading branch information
Showing
9 changed files
with
182 additions
and
87 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.