[Python] Inconsistent cast behavior between array and scalar for int64 #34901

rohanjain101 · 2023-04-05T06:20:41Z

Describe the bug, including details regarding any error messages, version, and platform.

>>> scal = pa.scalar(6312878760374611856, type=pa.int64())
>>> scal.cast(pa.float64())
<pyarrow.DoubleScalar: 6.312878760374612e+18>
>>> arr = pa.array([6312878760374611856], type=pa.int64())
>>> arr.cast(pa.float64())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow\array.pxi", line 926, in pyarrow.lib.Array.cast
  File "C:\pandas2_ps_04323\lib\site-packages\pyarrow\compute.py", line 391, in cast
    return call_function("cast", [arr], options)
  File "pyarrow\_compute.pyx", line 560, in pyarrow._compute.call_function
  File "pyarrow\_compute.pyx", line 355, in pyarrow._compute.Function.call
  File "pyarrow\error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Integer value 6312878760374611856 not in range: -9007199254740992 to 9007199254740992
>>>

Behavior is not consistent in casting between array and scalar. The array behavior of raising does not seem correct, as it seems an int64 should always be able to be casted to float64.

Component(s)

Python

The text was updated successfully, but these errors were encountered:

danepitkin · 2023-04-06T19:40:34Z

Hi @rohanjain101,

I am able to reproduce this example on pyarrow v11 using macos 13.3.

What you are experiencing is the difference between safe vs unsafe casting, since the number you chose probably can not be fully represented in the new type. It is not true that all int64 values can be safely converted to float64. Due to the way precision works in floating point, there are numbers that may be skipped that could otherwise be represented by int64. See https://en.wikipedia.org/wiki/Double-precision_floating-point_format, which states: only Integers from −253 to 253 (−9,007,199,254,740,992 to 9,007,199,254,740,992) can be exactly represented. My guess is the underlying implementation enforces this hard limit, since technically I believe there are some int64 numbers that can be larger and still represented safely when cast to float64 (such as 18,014,398,509,481,984, but not 18,014,398,509,481,983).

>>> arr = pa.array([18014398509481984], type=pa.int64())
>>> arr.cast(pa.float64())
Traceback (most recent call last):
...
pyarrow.lib.ArrowInvalid: Integer value 18014398509481984 not in range: -9007199254740992 to 9007199254740992

It appears the scalar cast defaults to allow unsafe casting, while the array defaults to safe casting. You can allow unsafe casting in the array like this:

>>> arr.cast(pa.float64(), safe=False)
<pyarrow.lib.DoubleArray object at 0x126a40ee0>
[
  6.312878760374612e+18
]

There are no options to choose safe vs unsafe cast in scalar APIs at the moment. The documentation does state the scalar will perform a safe cast, though, which it is not doing: https://arrow.apache.org/docs/python/generated/pyarrow.Int64Scalar.html#pyarrow.Int64Scalar

This is either a bug in scalar safe casting or the documentation is wrong. Ideally, Scalars can also allow you to choose safe vs unsafe casting with an option. Either way, some more investigation is still needed.

rohanjain101 · 2023-04-06T19:56:26Z

@danepitkin thank you for the clarification. In numpy however, the cast succeeds, it seems as if full value is preserved:

np.array([18014398509481984]).astype("float64")
array([1.80143985e+16])

Is their an internal difference in how double values are stored between arrow and numpy that would cause the difference?

danepitkin · 2023-04-06T20:27:25Z

@danepitkin thank you for the clarification. In numpy however, the cast succeeds, it seems as if full value is preserved:

np.array([18014398509481984]).astype("float64")
array([1.80143985e+16])

Is their an internal difference in how double values are stored between arrow and numpy that would cause the difference?

In your example, 18,014,398,509,481,984 can be converted to float64 safely according to the floating point specification so it is not a good example to use. Instead let's try 18,014,398,509,481,983, which is not a multiple of 2 (required by integers between 2^53 and 2^54 for safe conversion).

You will lose data in this numpy cast. (And yes, my guess is they adhere to the floating point spec slightly differently purely based on the different behavior).

>>> np.array([18014398509481983]).astype("float64").astype("int64") == 18014398509481983
array([False])

>>> np.array([18014398509481983]).astype("float64").astype("int64")
array([18014398509481984])

Numpy defaults to unsafe casting (https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html), but it seems it also doesn't perform safety checks properly all of the time.

>>> type(np.array([18014398509481984])[0])
<class 'numpy.int64'>


# Bug? No safety error for int64 -> float64
>>> np.array([18014398509481983]).astype("float64", casting="safe").astype("int64") == 18014398509481983
array([False])


# Good? Errors out on float64 -> int64, but the bug happened in the int64 -> float64 and was somehow propagated..
>>> np.array([18014398509481983]).astype("float64", casting="safe").astype("int64", casting="safe") == 18014398509481983
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

danepitkin · 2023-04-06T20:32:11Z

For pyarrow, we should probably:

Allow both safe and unsafe conversion options for scalar APIs (feature)
Default to safe conversion for scalars, which appears is not happening (bug)
Look into allowing safe conversion from int <-> float for valid numbers larger than 2^53 (feature)

@AlenkaF @jorisvandenbossche what do you think?

rohanjain101 · 2023-04-06T20:34:27Z

But in the example where the cast is safe, for 18,014,398,509,481,984, shouldn't that then succeed in pyarrow if it can be done safely? In my example, the array case is still raising even if the cast is safe. Should it only raise for 18,014,398,509,481,983?

danepitkin · 2023-04-06T20:36:41Z

But in the example where the cast is safe, for 18,014,398,509,481,984, shouldn't that then succeed in pyarrow if it can be done safely? In my example, the array case is still raising even if the cast is safe. Should it only raise for 18,014,398,509,481,983?

If pyarrow were to follow the floating point specification exactly, then yes it would. Right now, it seems to be a limitation of the implementation. You could argue that option (3) above should be a bug instead of a feature.

danepitkin · 2023-04-06T20:47:14Z

I'd recommend filing an issue with numpy about this, too:

>>> import numpy as np
>>> np.__version__
'1.24.2'

# Bug: No safety error for initial int64 -> float64 conversion
>>> np.array([18014398509481983]).astype("float64", casting="safe").astype(str)
array(['1.8014398509481984e+16'], dtype='<U32')

danepitkin · 2023-04-06T20:53:08Z

Thanks for raising this issue by the way. I don't think I expressed that earlier. Your contributions are appreciated!

jorisvandenbossche · 2023-04-11T13:54:15Z

Numpy defaults to unsafe casting (https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html), but it seems it also doesn't perform safety checks properly all of the time.

Sidenote: numpy doesn't really have the same concept of "safe" casting as how we use this in pyarrow. In pyarrow this depends on the values, while in numpy this is just a property of a cast between two dtypes. So to say if a cast from one dtype to another is safe or not, numpy needs to make some generalization/assumption, and so it seems it decided that casting int to float is generally safe (indeed, except for large ints) and casting floats to ints is generally not safe (indeed, except if you have rounded floats):

>>> np.can_cast(np.int64(), np.float64(), casting="safe")
True
>>> np.can_cast(np.float64(), np.int64(), casting="safe")
False

jorisvandenbossche · 2023-04-11T14:13:02Z

For pyarrow, we should probably:

Allow both safe and unsafe conversion options for scalar APIs (feature)

Default to safe conversion for scalars, which appears is not happening (bug)

Yes, fully agreed, I opened a separate issue for specifically this aspect: #35040

Look into allowing safe conversion from int <-> float for valid numbers larger than 2^53 (feature)

For checking the safety of casting int to float, we indeed use this fixed range:

arrow/cpp/src/arrow/compute/kernels/scalar_cast_numeric.cc

Lines 171 to 250 in e488942

    
           // ---------------------------------------------------------------------- 
        
           // Implement fast integer to floating point cast 
        
           // These are the limits for exact representation of whole numbers in floating 
        
           // point numbers 
        
           template <typename T> 
        
           struct FloatingIntegerBound {}; 
        
           template <> 
        
           struct FloatingIntegerBound<float> { 
        
             static const int64_t value = 1LL << 24; 
        
           }; 
        
           template <> 
        
           struct FloatingIntegerBound<double> { 
        
             static const int64_t value = 1LL << 53; 
        
           }; 
        
           template <typename InType, typename OutType, typename InT = typename InType::c_type, 
        
                     typename OutT = typename OutType::c_type, 
        
                     bool IsSigned = is_signed_integer_type<InType>::value> 
        
           Status CheckIntegerFloatTruncateImpl(const ExecValue& input) { 
        
             using InScalarType = typename TypeTraits<InType>::ScalarType; 
        
             const int64_t limit = FloatingIntegerBound<OutT>::value; 
        
             InScalarType bound_lower(IsSigned ? -limit : 0); 
        
             InScalarType bound_upper(limit); 
        
             return CheckIntegersInRange(input.array, bound_lower, bound_upper); 
        
           } 
        
           Status CheckForIntegerToFloatingTruncation(const ExecValue& input, Type::type out_type) { 
        
             switch (input.type()->id()) { 
        
               // Small integers are all exactly representable as whole numbers 
        
               case Type::INT8: 
        
               case Type::INT16: 
        
               case Type::UINT8: 
        
               case Type::UINT16: 
        
                 return Status::OK(); 
        
               case Type::INT32: { 
        
                 if (out_type == Type::DOUBLE) { 
        
                   return Status::OK(); 
        
                 } 
        
                 return CheckIntegerFloatTruncateImpl<Int32Type, FloatType>(input); 
        
               } 
        
               case Type::UINT32: { 
        
                 if (out_type == Type::DOUBLE) { 
        
                   return Status::OK(); 
        
                 } 
        
                 return CheckIntegerFloatTruncateImpl<UInt32Type, FloatType>(input); 
        
               } 
        
               case Type::INT64: { 
        
                 if (out_type == Type::FLOAT) { 
        
                   return CheckIntegerFloatTruncateImpl<Int64Type, FloatType>(input); 
        
                 } else { 
        
                   return CheckIntegerFloatTruncateImpl<Int64Type, DoubleType>(input); 
        
                 } 
        
               } 
        
               case Type::UINT64: { 
        
                 if (out_type == Type::FLOAT) { 
        
                   return CheckIntegerFloatTruncateImpl<UInt64Type, FloatType>(input); 
        
                 } else { 
        
                   return CheckIntegerFloatTruncateImpl<UInt64Type, DoubleType>(input); 
        
                 } 
        
               } 
        
               default: 
        
                 break; 
        
             } 
        
             DCHECK(false); 
        
             return Status::OK(); 
        
           } 
        
           Status CastIntegerToFloating(KernelContext* ctx, const ExecSpan& batch, ExecResult* out) { 
        
             const auto& options = checked_cast<const CastState*>(ctx->state())->options; 
        
             Type::type out_type = out->type()->id(); 
        
             if (!options.allow_float_truncate) { 
        
               RETURN_NOT_OK(CheckForIntegerToFloatingTruncation(batch[0], out_type)); 
        
             } 
        
             CastNumberToNumberUnsafe(batch[0].type()->id(), out_type, batch[0].array, 
        
                                      out->array_span_mutable()); 
        
             return Status::OK(); 
        
           }

I am not fully sure this is something we should change. First, I think this is a lot simpler in implementation to just check for values within the range, compared to checking for certain integers that can still be represented as float outside of that range. But also for the user this seems easier to understand and gives more consistent behaviour? (just everything outside of that range will fail with the default safe=True, and not depending on the exact value)

rohanjain101 · 2023-04-15T22:50:11Z

Regarding 3, the current error is not very clear:

>>> pa.array([18014398509481984], type=pa.float64())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow\array.pxi", line 320, in pyarrow.lib.array
  File "pyarrow\array.pxi", line 39, in pyarrow.lib._sequence_to_array
  File "pyarrow\error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Integer value 18014398509481984 is outside of the range exactly representable by a IEEE 754 double precision value

Since this integer meets IEE 754 specification for what can be represented as a double, should the error also clarify that its an internal limitation?

westonpace · 2023-04-19T15:56:16Z

Look into allowing safe conversion from int <-> float for valid numbers larger than 2^53 (feature)

The options within the C++ lib are very fine-grained already:

  bool allow_int_overflow;
  bool allow_time_truncate;
  bool allow_time_overflow;
  bool allow_decimal_truncate;
  bool allow_float_truncate;
  // Indicate if conversions from Binary/FixedSizeBinary to string must
  // validate the utf8 payload.
  bool allow_invalid_utf8;

If we are discussing ideal behavior then I think something like...

# Allow converting integers to floats when the integer cannot be exactly represented by
# an IEEE-754 float and must be rounded.
bool allow_float_inexact;

...would be very reasonable. If we wanted to be even more extreme 😄 we could have:

enum class IeeeRoundingMode : int8_t {
  TIE_TO_EVEN = 0,
  TIE_AWAY_FROM_ZERO = 1,
  TOWARD_ZERO = 2,
  TOWARD_POSITIVE_INFINITY = 3,
  TOWARD_NEGATIVE_INFINITY = 4,
  ERROR = 5
};

However, all of point number 3 sounds like a separate issue from this one.

jorisvandenbossche · 2023-04-20T10:43:28Z

If we are discussing ideal behavior then I think something like...

# Allow converting integers to floats when the integer cannot be exactly represented by
# an IEEE-754 float and must be rounded.
bool allow_float_inexact;

It's not super clear from the name, but so we already use the existing allow_float_truncate options for the int->float cast (the name suggests to me this is mostly about float->int truncating the float if it is not a "round" float).

>>> pa.array([18014398509481984], type=pa.int64()).cast(pa.float64())
...
ArrowInvalid: Integer value 18014398509481984 not in range: -9007199254740992 to 9007199254740992

>>> pa.array([18014398509481984], type=pa.int64()).cast(options=pc.CastOptions(pa.float64(), allow_float_truncate=True))
<pyarrow.lib.DoubleArray object at 0x7f4e1f7dee00>
[
  1.8014398509481984e+16
]

But your suggestion would be that an option like allow_float_inexact would use a different logic to determine when the cast is allowed (actually check if the result value is inexact, instead of checking it is outside of the generally safe range?)

(personally I would say that allow_float_inexact is a better name for what we currently call allow_float_truncate for the int->float cast)

If we wanted to be even more extreme 😄 we could have:
enum class IeeeRoundingMode : int8_t {
  TIE_TO_EVEN = 0,
  ...

I would say that if a user wants to control the rounding, they should use the round kernel instead of a cast?

westonpace · 2023-04-20T19:13:24Z

I would say that if a user wants to control the rounding, they should use the round kernel instead of a cast?

The rounding kernel allows for conversion from one valid IEEE value to another. This rounding is about going from an infinite precision value that cannot be represented in IEEE to a valid IEEE value.

I'll walk my comment back though.

Technically, IEEE rounding is something that has to be considered in just about any operation (e.g. addition, subtraction) because the infinite-precision result isn't representable.

In practice, we'd probably be better off just saying we always use TIE_TO_EVEN and it's not configurable. This is what every other engine seems to do (TIE_TO_EVEN is the default for most / all modern CPUs).

It's not super clear from the name, but so we already use the existing allow_float_truncate options for the int->float cast (the name suggests to me this is mostly about float->int truncating the float if it is not a "round" float).

I didn't realize this. So this issue is about allowing these sorts of casts when safe=true and the value happens to be representable? I thought this was an ask for a new kind of unsafe cast.

) ### Rationale for this change Scalar cast should use the computer kernel just like Arrays, instead of its own custom implementation. ### Are these changes tested? Added test cases for GH-35370, GH-34901, and GH-35040 ### Are there any user-facing changes? The Scalar.cast() API is enhanced and backwards compatible. * Closes: #35040 Authored-by: Dane Pitkin <[email protected]> Signed-off-by: Alenka Frim <[email protected]>

danepitkin · 2023-05-11T18:39:00Z

So this issue is about allowing these sorts of casts when safe=true and the value happens to be representable?

Yes, from my understanding.

The original issue reported is now fixed (#35395). We can either repurpose this issue for the above feature request, or close this issue and file a new one.

danepitkin · 2023-05-11T18:56:28Z

I captured the feature request to safely cast representable int64s larger than 2^53 to float64 here: #35563.

Closing this issue since the original bug report is resolved in #35395

apache#35395) ### Rationale for this change Scalar cast should use the computer kernel just like Arrays, instead of its own custom implementation. ### Are these changes tested? Added test cases for apacheGH-35370, apacheGH-34901, and apacheGH-35040 ### Are there any user-facing changes? The Scalar.cast() API is enhanced and backwards compatible. * Closes: apache#35040 Authored-by: Dane Pitkin <[email protected]> Signed-off-by: Alenka Frim <[email protected]>

rohanjain101 added the Type: bug label Apr 5, 2023

github-actions bot added the Component: Python label Apr 5, 2023

rohanjain101 mentioned this issue Apr 6, 2023

BUG: Error not raised for an unsafe cast when using casting="safe" numpy/numpy#23546

Open

AlenkaF changed the title ~~Inconsistent cast behavior between array and scalar for int64~~ [Python] Inconsistent cast behavior between array and scalar for int64 Apr 11, 2023

jorisvandenbossche mentioned this issue Apr 11, 2023

[C++] Change Scalar::CastTo to do safe cast by default and allow to specify cast options ? #35040

Closed

danepitkin added a commit to danepitkin/arrow that referenced this issue May 10, 2023

Add tests for apacheGH-35040 and apacheGH-34901

689fe59

danepitkin mentioned this issue May 10, 2023

GH-35040: [Python] Pyarrow scalar cast should use compute kernel #35395

Merged

danepitkin added a commit to danepitkin/arrow that referenced this issue May 11, 2023

Add tests for apacheGH-35040 and apacheGH-34901

65c388a

danepitkin mentioned this issue May 11, 2023

[C++] Allow safe cast from int64 to float64 in compute kernel for numerics that can be accurately represented above 2^53 #35563

Open

danepitkin closed this as completed May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Inconsistent cast behavior between array and scalar for int64 #34901

[Python] Inconsistent cast behavior between array and scalar for int64 #34901

rohanjain101 commented Apr 5, 2023

danepitkin commented Apr 6, 2023

rohanjain101 commented Apr 6, 2023

danepitkin commented Apr 6, 2023

danepitkin commented Apr 6, 2023 •

edited

Loading

rohanjain101 commented Apr 6, 2023

danepitkin commented Apr 6, 2023

danepitkin commented Apr 6, 2023 •

edited

Loading

danepitkin commented Apr 6, 2023

jorisvandenbossche commented Apr 11, 2023

jorisvandenbossche commented Apr 11, 2023

rohanjain101 commented Apr 15, 2023 •

edited

Loading

westonpace commented Apr 19, 2023

jorisvandenbossche commented Apr 20, 2023

westonpace commented Apr 20, 2023

danepitkin commented May 11, 2023 •

edited

Loading

danepitkin commented May 11, 2023

[Python] Inconsistent cast behavior between array and scalar for int64 #34901

[Python] Inconsistent cast behavior between array and scalar for int64 #34901

Comments

rohanjain101 commented Apr 5, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

danepitkin commented Apr 6, 2023

rohanjain101 commented Apr 6, 2023

danepitkin commented Apr 6, 2023

danepitkin commented Apr 6, 2023 • edited Loading

rohanjain101 commented Apr 6, 2023

danepitkin commented Apr 6, 2023

danepitkin commented Apr 6, 2023 • edited Loading

danepitkin commented Apr 6, 2023

jorisvandenbossche commented Apr 11, 2023

jorisvandenbossche commented Apr 11, 2023

rohanjain101 commented Apr 15, 2023 • edited Loading

westonpace commented Apr 19, 2023

jorisvandenbossche commented Apr 20, 2023

westonpace commented Apr 20, 2023

danepitkin commented May 11, 2023 • edited Loading

danepitkin commented May 11, 2023

danepitkin commented Apr 6, 2023 •

edited

Loading

danepitkin commented Apr 6, 2023 •

edited

Loading

rohanjain101 commented Apr 15, 2023 •

edited

Loading

danepitkin commented May 11, 2023 •

edited

Loading