-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Inconsistent cast behavior between array and scalar for int64 #34901
Comments
Hi @rohanjain101, I am able to reproduce this example on pyarrow v11 using macos 13.3. What you are experiencing is the difference between safe vs unsafe casting, since the number you chose probably can not be fully represented in the new type. It is not true that all int64 values can be safely converted to float64. Due to the way precision works in floating point, there are numbers that may be skipped that could otherwise be represented by int64. See https://en.wikipedia.org/wiki/Double-precision_floating-point_format, which states: only
It appears the scalar cast defaults to allow unsafe casting, while the array defaults to safe casting. You can allow unsafe casting in the array like this:
There are no options to choose safe vs unsafe cast in scalar APIs at the moment. The documentation does state the scalar will perform a safe cast, though, which it is not doing: https://arrow.apache.org/docs/python/generated/pyarrow.Int64Scalar.html#pyarrow.Int64Scalar This is either a bug in scalar safe casting or the documentation is wrong. Ideally, Scalars can also allow you to choose safe vs unsafe casting with an option. Either way, some more investigation is still needed. |
@danepitkin thank you for the clarification. In numpy however, the cast succeeds, it seems as if full value is preserved:
Is their an internal difference in how double values are stored between arrow and numpy that would cause the difference? |
In your example, 18,014,398,509,481,984 can be converted to float64 safely according to the floating point specification so it is not a good example to use. Instead let's try 18,014,398,509,481,983, which is not a multiple of 2 (required by integers between 2^53 and 2^54 for safe conversion). You will lose data in this numpy cast. (And yes, my guess is they adhere to the floating point spec slightly differently purely based on the different behavior).
Numpy defaults to unsafe casting (https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html), but it seems it also doesn't perform safety checks properly all of the time.
|
For pyarrow, we should probably:
@AlenkaF @jorisvandenbossche what do you think? |
But in the example where the cast is safe, for 18,014,398,509,481,984, shouldn't that then succeed in pyarrow if it can be done safely? In my example, the array case is still raising even if the cast is safe. Should it only raise for 18,014,398,509,481,983? |
If pyarrow were to follow the floating point specification exactly, then yes it would. Right now, it seems to be a limitation of the implementation. You could argue that option (3) above should be a bug instead of a feature. |
I'd recommend filing an issue with numpy about this, too:
|
Thanks for raising this issue by the way. I don't think I expressed that earlier. Your contributions are appreciated! |
Sidenote: numpy doesn't really have the same concept of "safe" casting as how we use this in pyarrow. In pyarrow this depends on the values, while in numpy this is just a property of a cast between two dtypes. So to say if a cast from one dtype to another is safe or not, numpy needs to make some generalization/assumption, and so it seems it decided that casting int to float is generally safe (indeed, except for large ints) and casting floats to ints is generally not safe (indeed, except if you have rounded floats): >>> np.can_cast(np.int64(), np.float64(), casting="safe")
True
>>> np.can_cast(np.float64(), np.int64(), casting="safe")
False |
Yes, fully agreed, I opened a separate issue for specifically this aspect: #35040
For checking the safety of casting int to float, we indeed use this fixed range: arrow/cpp/src/arrow/compute/kernels/scalar_cast_numeric.cc Lines 171 to 250 in e488942
I am not fully sure this is something we should change. First, I think this is a lot simpler in implementation to just check for values within the range, compared to checking for certain integers that can still be represented as float outside of that range. But also for the user this seems easier to understand and gives more consistent behaviour? (just everything outside of that range will fail with the default |
Regarding 3, the current error is not very clear:
Since this integer meets IEE 754 specification for what can be represented as a double, should the error also clarify that its an internal limitation? |
The options within the C++ lib are very fine-grained already:
If we are discussing ideal behavior then I think something like...
...would be very reasonable. If we wanted to be even more extreme 😄 we could have:
However, all of point number 3 sounds like a separate issue from this one. |
It's not super clear from the name, but so we already use the existing >>> pa.array([18014398509481984], type=pa.int64()).cast(pa.float64())
...
ArrowInvalid: Integer value 18014398509481984 not in range: -9007199254740992 to 9007199254740992
>>> pa.array([18014398509481984], type=pa.int64()).cast(options=pc.CastOptions(pa.float64(), allow_float_truncate=True))
<pyarrow.lib.DoubleArray object at 0x7f4e1f7dee00>
[
1.8014398509481984e+16
] But your suggestion would be that an option like (personally I would say that
I would say that if a user wants to control the rounding, they should use the round kernel instead of a cast? |
The rounding kernel allows for conversion from one valid IEEE value to another. This rounding is about going from an infinite precision value that cannot be represented in IEEE to a valid IEEE value. I'll walk my comment back though. Technically, IEEE rounding is something that has to be considered in just about any operation (e.g. addition, subtraction) because the infinite-precision result isn't representable. In practice, we'd probably be better off just saying we always use TIE_TO_EVEN and it's not configurable. This is what every other engine seems to do (TIE_TO_EVEN is the default for most / all modern CPUs).
I didn't realize this. So this issue is about allowing these sorts of casts when |
) ### Rationale for this change Scalar cast should use the computer kernel just like Arrays, instead of its own custom implementation. ### Are these changes tested? Added test cases for GH-35370, GH-34901, and GH-35040 ### Are there any user-facing changes? The Scalar.cast() API is enhanced and backwards compatible. * Closes: #35040 Authored-by: Dane Pitkin <[email protected]> Signed-off-by: Alenka Frim <[email protected]>
Yes, from my understanding. The original issue reported is now fixed (#35395). We can either repurpose this issue for the above feature request, or close this issue and file a new one. |
apache#35395) ### Rationale for this change Scalar cast should use the computer kernel just like Arrays, instead of its own custom implementation. ### Are these changes tested? Added test cases for apacheGH-35370, apacheGH-34901, and apacheGH-35040 ### Are there any user-facing changes? The Scalar.cast() API is enhanced and backwards compatible. * Closes: apache#35040 Authored-by: Dane Pitkin <[email protected]> Signed-off-by: Alenka Frim <[email protected]>
apache#35395) ### Rationale for this change Scalar cast should use the computer kernel just like Arrays, instead of its own custom implementation. ### Are these changes tested? Added test cases for apacheGH-35370, apacheGH-34901, and apacheGH-35040 ### Are there any user-facing changes? The Scalar.cast() API is enhanced and backwards compatible. * Closes: apache#35040 Authored-by: Dane Pitkin <[email protected]> Signed-off-by: Alenka Frim <[email protected]>
Describe the bug, including details regarding any error messages, version, and platform.
Behavior is not consistent in casting between array and scalar. The array behavior of raising does not seem correct, as it seems an int64 should always be able to be casted to float64.
Component(s)
Python
The text was updated successfully, but these errors were encountered: