-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Cast from string to long should return null if string is not a valid long #5110
Comments
I assume you mean |
Oops, yes, that is what I meant. |
This is working as intended and documented here:
@kkraus14 @harrism should this behavior be changed? Looks like Pandas throws an exception and does not convert these rows to null:
We may want to introduce a new parameter for the specific behavior of convert out-of-range values to null. |
I think the last solution David mentions is probably the best way. Similar to @kkraus14 's answer here #5160 (comment) We can enable users to check that their data is all within range and get a mask of the out of range values. This way, well-formed data doesn't pay the (likely high) performance cost of the checks at cast time. |
Thanks for the responses. I think "introduce a new parameter for the specific behavior of convert out-of-range values to null" would be the best option in this case. The existing |
Passing strings with floating point notation including decimals to a string-to-integer conversion function is not supported in C++, Java, or Python. C++
The C++ std behavior is also the behavior of Python and Java throw exceptions if the decimal or other invalid character is included in the string.
Java:
I would recommend using the |
Yes, java does not support it directly, but Spark does. Isn't it wonderful? But that is not a big deal we can always run a regexp to strip off the |
Sounds like you should convert to floats first and then cast to integers. This will likely be less of a penalty then using regex. |
So @davidwendt what is the decision, or what still needs to be decided? |
To all: I'm working on this. The current I propose to return a pair of return values instead. Similar to many C++ std APIs, I propose to have this API (for individual string row)
In the returning pair, the first value is the parsed integer, while the second value is a boolean indicating whether the parsing was successful or not. We can re-generate null mask for the column base on such boolean values for all rows. Please let me know if you have any suggestion. |
I think previous discussion pointed to the approach in #7080 where we have a libcudf should not match Spark behavior or Pandas behavior, but instead should provide the necessary primitives for downstream usage to reasonably implement their desired behavior. |
We will have that API separately, just to check if a string is a valid integer. That API is nearly as expensive as parsing an integer. |
That complicates the implementation of the existing |
Thanks, Jake.
|
…eger conversion (#7642) This PR addresses #5110, #7080, and rework #7094. It adds the function `cudf::strings::is_integer` that can check if strings can be correctly converted into integer values. Underflow and overflow are also taken into account. Note that this `cudf::strings::is_integer` is different from the existing `cudf::strings::string::is_integer`, which only checks for pattern and does not care about under/overflow. Examples: ``` s = { "eee", "-200", "-100", "127", "128", "1.5", NULL} is_integer(s, INT8) = { 0, 0, 1, 1, 0, 0, NULL} is_integer(s, INT32) = { 0, 1, 1, 1, 1, 0, NULL} ``` Authors: - Nghia Truong (@ttnghia) Approvers: - David (@davidwendt) - Jake Hemstad (@jrhemstad) - Mark Harris (@harrism) URL: #7642
@andygrove does PR #7642 resolve this issue? |
Yes, from the plugin we can call the new |
Fixed by #7642 |
Describe the bug
When casting from string to long (signed 64 bit in Java) I would expect the cast to return null if the provided string is less than Long.MIN_VALUE or greater than Long.MAX_VALUE. Currently, it returns an incorrect value.
For example, casting
"9223372036854775808"
(Long.MAX_VALUE + 1) to a long returns-9223372036854775808
(Long.MIN_VALUE).Steps/Code to reproduce bug
Add the following test to ColumnVectorTest.
Expected behavior
Casting a string to long where the string is not a valid long in the range Long.MIN_VALUE to Long.MAX_VALUE should return null rather than overflow.
Environment overview (please complete the following information)
Environment details
Additional context
None.
The text was updated successfully, but these errors were encountered: