Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Inconsistent cast behavior between array and scalar for int64 #34901

Closed
rohanjain101 opened this issue Apr 5, 2023 · 16 comments
Closed

Comments

@rohanjain101
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

>>> scal = pa.scalar(6312878760374611856, type=pa.int64())
>>> scal.cast(pa.float64())
<pyarrow.DoubleScalar: 6.312878760374612e+18>
>>> arr = pa.array([6312878760374611856], type=pa.int64())
>>> arr.cast(pa.float64())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow\array.pxi", line 926, in pyarrow.lib.Array.cast
  File "C:\pandas2_ps_04323\lib\site-packages\pyarrow\compute.py", line 391, in cast
    return call_function("cast", [arr], options)
  File "pyarrow\_compute.pyx", line 560, in pyarrow._compute.call_function
  File "pyarrow\_compute.pyx", line 355, in pyarrow._compute.Function.call
  File "pyarrow\error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Integer value 6312878760374611856 not in range: -9007199254740992 to 9007199254740992
>>>

Behavior is not consistent in casting between array and scalar. The array behavior of raising does not seem correct, as it seems an int64 should always be able to be casted to float64.

Component(s)

Python

@danepitkin
Copy link
Member

Hi @rohanjain101,

I am able to reproduce this example on pyarrow v11 using macos 13.3.

What you are experiencing is the difference between safe vs unsafe casting, since the number you chose probably can not be fully represented in the new type. It is not true that all int64 values can be safely converted to float64. Due to the way precision works in floating point, there are numbers that may be skipped that could otherwise be represented by int64. See https://en.wikipedia.org/wiki/Double-precision_floating-point_format, which states: only Integers from −253 to 253 (−9,007,199,254,740,992 to 9,007,199,254,740,992) can be exactly represented. My guess is the underlying implementation enforces this hard limit, since technically I believe there are some int64 numbers that can be larger and still represented safely when cast to float64 (such as 18,014,398,509,481,984, but not 18,014,398,509,481,983).

>>> arr = pa.array([18014398509481984], type=pa.int64())
>>> arr.cast(pa.float64())
Traceback (most recent call last):
...
pyarrow.lib.ArrowInvalid: Integer value 18014398509481984 not in range: -9007199254740992 to 9007199254740992

It appears the scalar cast defaults to allow unsafe casting, while the array defaults to safe casting. You can allow unsafe casting in the array like this:

>>> arr.cast(pa.float64(), safe=False)
<pyarrow.lib.DoubleArray object at 0x126a40ee0>
[
  6.312878760374612e+18
]

There are no options to choose safe vs unsafe cast in scalar APIs at the moment. The documentation does state the scalar will perform a safe cast, though, which it is not doing: https://arrow.apache.org/docs/python/generated/pyarrow.Int64Scalar.html#pyarrow.Int64Scalar

This is either a bug in scalar safe casting or the documentation is wrong. Ideally, Scalars can also allow you to choose safe vs unsafe casting with an option. Either way, some more investigation is still needed.

@rohanjain101
Copy link
Author

@danepitkin thank you for the clarification. In numpy however, the cast succeeds, it seems as if full value is preserved:

np.array([18014398509481984]).astype("float64")
array([1.80143985e+16])

Is their an internal difference in how double values are stored between arrow and numpy that would cause the difference?

@danepitkin
Copy link
Member

@danepitkin thank you for the clarification. In numpy however, the cast succeeds, it seems as if full value is preserved:

np.array([18014398509481984]).astype("float64")
array([1.80143985e+16])

Is their an internal difference in how double values are stored between arrow and numpy that would cause the difference?

In your example, 18,014,398,509,481,984 can be converted to float64 safely according to the floating point specification so it is not a good example to use. Instead let's try 18,014,398,509,481,983, which is not a multiple of 2 (required by integers between 2^53 and 2^54 for safe conversion).

You will lose data in this numpy cast. (And yes, my guess is they adhere to the floating point spec slightly differently purely based on the different behavior).

>>> np.array([18014398509481983]).astype("float64").astype("int64") == 18014398509481983
array([False])

>>> np.array([18014398509481983]).astype("float64").astype("int64")
array([18014398509481984])

Numpy defaults to unsafe casting (https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html), but it seems it also doesn't perform safety checks properly all of the time.

>>> type(np.array([18014398509481984])[0])
<class 'numpy.int64'>


# Bug? No safety error for int64 -> float64
>>> np.array([18014398509481983]).astype("float64", casting="safe").astype("int64") == 18014398509481983
array([False])


# Good? Errors out on float64 -> int64, but the bug happened in the int64 -> float64 and was somehow propagated..
>>> np.array([18014398509481983]).astype("float64", casting="safe").astype("int64", casting="safe") == 18014398509481983
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

@danepitkin
Copy link
Member

danepitkin commented Apr 6, 2023

For pyarrow, we should probably:

  1. Allow both safe and unsafe conversion options for scalar APIs (feature)
  2. Default to safe conversion for scalars, which appears is not happening (bug)
  3. Look into allowing safe conversion from int <-> float for valid numbers larger than 2^53 (feature)

@AlenkaF @jorisvandenbossche what do you think?

@rohanjain101
Copy link
Author

But in the example where the cast is safe, for 18,014,398,509,481,984, shouldn't that then succeed in pyarrow if it can be done safely? In my example, the array case is still raising even if the cast is safe. Should it only raise for 18,014,398,509,481,983?

@danepitkin
Copy link
Member

But in the example where the cast is safe, for 18,014,398,509,481,984, shouldn't that then succeed in pyarrow if it can be done safely? In my example, the array case is still raising even if the cast is safe. Should it only raise for 18,014,398,509,481,983?

If pyarrow were to follow the floating point specification exactly, then yes it would. Right now, it seems to be a limitation of the implementation. You could argue that option (3) above should be a bug instead of a feature.

@danepitkin
Copy link
Member

danepitkin commented Apr 6, 2023

I'd recommend filing an issue with numpy about this, too:

>>> import numpy as np
>>> np.__version__
'1.24.2'

# Bug: No safety error for initial int64 -> float64 conversion
>>> np.array([18014398509481983]).astype("float64", casting="safe").astype(str)
array(['1.8014398509481984e+16'], dtype='<U32')

@danepitkin
Copy link
Member

Thanks for raising this issue by the way. I don't think I expressed that earlier. Your contributions are appreciated!

@AlenkaF AlenkaF changed the title Inconsistent cast behavior between array and scalar for int64 [Python] Inconsistent cast behavior between array and scalar for int64 Apr 11, 2023
@jorisvandenbossche
Copy link
Member

Numpy defaults to unsafe casting (https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html), but it seems it also doesn't perform safety checks properly all of the time.

Sidenote: numpy doesn't really have the same concept of "safe" casting as how we use this in pyarrow. In pyarrow this depends on the values, while in numpy this is just a property of a cast between two dtypes. So to say if a cast from one dtype to another is safe or not, numpy needs to make some generalization/assumption, and so it seems it decided that casting int to float is generally safe (indeed, except for large ints) and casting floats to ints is generally not safe (indeed, except if you have rounded floats):

>>> np.can_cast(np.int64(), np.float64(), casting="safe")
True
>>> np.can_cast(np.float64(), np.int64(), casting="safe")
False

@jorisvandenbossche
Copy link
Member

For pyarrow, we should probably:

  1. Allow both safe and unsafe conversion options for scalar APIs (feature)
  2. Default to safe conversion for scalars, which appears is not happening (bug)

Yes, fully agreed, I opened a separate issue for specifically this aspect: #35040

  1. Look into allowing safe conversion from int <-> float for valid numbers larger than 2^53 (feature)

For checking the safety of casting int to float, we indeed use this fixed range:

// ----------------------------------------------------------------------
// Implement fast integer to floating point cast
// These are the limits for exact representation of whole numbers in floating
// point numbers
template <typename T>
struct FloatingIntegerBound {};
template <>
struct FloatingIntegerBound<float> {
static const int64_t value = 1LL << 24;
};
template <>
struct FloatingIntegerBound<double> {
static const int64_t value = 1LL << 53;
};
template <typename InType, typename OutType, typename InT = typename InType::c_type,
typename OutT = typename OutType::c_type,
bool IsSigned = is_signed_integer_type<InType>::value>
Status CheckIntegerFloatTruncateImpl(const ExecValue& input) {
using InScalarType = typename TypeTraits<InType>::ScalarType;
const int64_t limit = FloatingIntegerBound<OutT>::value;
InScalarType bound_lower(IsSigned ? -limit : 0);
InScalarType bound_upper(limit);
return CheckIntegersInRange(input.array, bound_lower, bound_upper);
}
Status CheckForIntegerToFloatingTruncation(const ExecValue& input, Type::type out_type) {
switch (input.type()->id()) {
// Small integers are all exactly representable as whole numbers
case Type::INT8:
case Type::INT16:
case Type::UINT8:
case Type::UINT16:
return Status::OK();
case Type::INT32: {
if (out_type == Type::DOUBLE) {
return Status::OK();
}
return CheckIntegerFloatTruncateImpl<Int32Type, FloatType>(input);
}
case Type::UINT32: {
if (out_type == Type::DOUBLE) {
return Status::OK();
}
return CheckIntegerFloatTruncateImpl<UInt32Type, FloatType>(input);
}
case Type::INT64: {
if (out_type == Type::FLOAT) {
return CheckIntegerFloatTruncateImpl<Int64Type, FloatType>(input);
} else {
return CheckIntegerFloatTruncateImpl<Int64Type, DoubleType>(input);
}
}
case Type::UINT64: {
if (out_type == Type::FLOAT) {
return CheckIntegerFloatTruncateImpl<UInt64Type, FloatType>(input);
} else {
return CheckIntegerFloatTruncateImpl<UInt64Type, DoubleType>(input);
}
}
default:
break;
}
DCHECK(false);
return Status::OK();
}
Status CastIntegerToFloating(KernelContext* ctx, const ExecSpan& batch, ExecResult* out) {
const auto& options = checked_cast<const CastState*>(ctx->state())->options;
Type::type out_type = out->type()->id();
if (!options.allow_float_truncate) {
RETURN_NOT_OK(CheckForIntegerToFloatingTruncation(batch[0], out_type));
}
CastNumberToNumberUnsafe(batch[0].type()->id(), out_type, batch[0].array,
out->array_span_mutable());
return Status::OK();
}

I am not fully sure this is something we should change. First, I think this is a lot simpler in implementation to just check for values within the range, compared to checking for certain integers that can still be represented as float outside of that range. But also for the user this seems easier to understand and gives more consistent behaviour? (just everything outside of that range will fail with the default safe=True, and not depending on the exact value)

@rohanjain101
Copy link
Author

rohanjain101 commented Apr 15, 2023

Regarding 3, the current error is not very clear:

>>> pa.array([18014398509481984], type=pa.float64())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow\array.pxi", line 320, in pyarrow.lib.array
  File "pyarrow\array.pxi", line 39, in pyarrow.lib._sequence_to_array
  File "pyarrow\error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Integer value 18014398509481984 is outside of the range exactly representable by a IEEE 754 double precision value

Since this integer meets IEE 754 specification for what can be represented as a double, should the error also clarify that its an internal limitation?

@westonpace
Copy link
Member

Look into allowing safe conversion from int <-> float for valid numbers larger than 2^53 (feature)

The options within the C++ lib are very fine-grained already:

  bool allow_int_overflow;
  bool allow_time_truncate;
  bool allow_time_overflow;
  bool allow_decimal_truncate;
  bool allow_float_truncate;
  // Indicate if conversions from Binary/FixedSizeBinary to string must
  // validate the utf8 payload.
  bool allow_invalid_utf8;

If we are discussing ideal behavior then I think something like...

# Allow converting integers to floats when the integer cannot be exactly represented by
# an IEEE-754 float and must be rounded.
bool allow_float_inexact;

...would be very reasonable. If we wanted to be even more extreme 😄 we could have:

enum class IeeeRoundingMode : int8_t {
  TIE_TO_EVEN = 0,
  TIE_AWAY_FROM_ZERO = 1,
  TOWARD_ZERO = 2,
  TOWARD_POSITIVE_INFINITY = 3,
  TOWARD_NEGATIVE_INFINITY = 4,
  ERROR = 5
};

However, all of point number 3 sounds like a separate issue from this one.

@jorisvandenbossche
Copy link
Member

If we are discussing ideal behavior then I think something like...

# Allow converting integers to floats when the integer cannot be exactly represented by
# an IEEE-754 float and must be rounded.
bool allow_float_inexact;

It's not super clear from the name, but so we already use the existing allow_float_truncate options for the int->float cast (the name suggests to me this is mostly about float->int truncating the float if it is not a "round" float).

>>> pa.array([18014398509481984], type=pa.int64()).cast(pa.float64())
...
ArrowInvalid: Integer value 18014398509481984 not in range: -9007199254740992 to 9007199254740992

>>> pa.array([18014398509481984], type=pa.int64()).cast(options=pc.CastOptions(pa.float64(), allow_float_truncate=True))
<pyarrow.lib.DoubleArray object at 0x7f4e1f7dee00>
[
  1.8014398509481984e+16
]

But your suggestion would be that an option like allow_float_inexact would use a different logic to determine when the cast is allowed (actually check if the result value is inexact, instead of checking it is outside of the generally safe range?)

(personally I would say that allow_float_inexact is a better name for what we currently call allow_float_truncate for the int->float cast)

If we wanted to be even more extreme 😄 we could have:

enum class IeeeRoundingMode : int8_t {
  TIE_TO_EVEN = 0,
  ...

I would say that if a user wants to control the rounding, they should use the round kernel instead of a cast?

@westonpace
Copy link
Member

I would say that if a user wants to control the rounding, they should use the round kernel instead of a cast?

The rounding kernel allows for conversion from one valid IEEE value to another. This rounding is about going from an infinite precision value that cannot be represented in IEEE to a valid IEEE value.

I'll walk my comment back though.

Technically, IEEE rounding is something that has to be considered in just about any operation (e.g. addition, subtraction) because the infinite-precision result isn't representable.

In practice, we'd probably be better off just saying we always use TIE_TO_EVEN and it's not configurable. This is what every other engine seems to do (TIE_TO_EVEN is the default for most / all modern CPUs).

It's not super clear from the name, but so we already use the existing allow_float_truncate options for the int->float cast (the name suggests to me this is mostly about float->int truncating the float if it is not a "round" float).

I didn't realize this. So this issue is about allowing these sorts of casts when safe=true and the value happens to be representable? I thought this was an ask for a new kind of unsafe cast.

danepitkin added a commit to danepitkin/arrow that referenced this issue May 10, 2023
danepitkin added a commit to danepitkin/arrow that referenced this issue May 11, 2023
AlenkaF pushed a commit that referenced this issue May 11, 2023
)

### Rationale for this change

Scalar cast should use the computer kernel just like Arrays, instead of its own custom implementation.

### Are these changes tested?

Added test cases for GH-35370, GH-34901, and GH-35040

### Are there any user-facing changes?

The Scalar.cast() API is enhanced and backwards compatible. 
* Closes: #35040

Authored-by: Dane Pitkin <[email protected]>
Signed-off-by: Alenka Frim <[email protected]>
@danepitkin
Copy link
Member

danepitkin commented May 11, 2023

So this issue is about allowing these sorts of casts when safe=true and the value happens to be representable?

Yes, from my understanding.

The original issue reported is now fixed (#35395). We can either repurpose this issue for the above feature request, or close this issue and file a new one.

@danepitkin
Copy link
Member

I captured the feature request to safely cast representable int64s larger than 2^53 to float64 here: #35563.

Closing this issue since the original bug report is resolved in #35395

ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
apache#35395)

### Rationale for this change

Scalar cast should use the computer kernel just like Arrays, instead of its own custom implementation.

### Are these changes tested?

Added test cases for apacheGH-35370, apacheGH-34901, and apacheGH-35040

### Are there any user-facing changes?

The Scalar.cast() API is enhanced and backwards compatible. 
* Closes: apache#35040

Authored-by: Dane Pitkin <[email protected]>
Signed-off-by: Alenka Frim <[email protected]>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
apache#35395)

### Rationale for this change

Scalar cast should use the computer kernel just like Arrays, instead of its own custom implementation.

### Are these changes tested?

Added test cases for apacheGH-35370, apacheGH-34901, and apacheGH-35040

### Are there any user-facing changes?

The Scalar.cast() API is enhanced and backwards compatible. 
* Closes: apache#35040

Authored-by: Dane Pitkin <[email protected]>
Signed-off-by: Alenka Frim <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants