Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert String to DecimalType without casting to FloatType [databricks] #4081
Convert String to DecimalType without casting to FloatType [databricks] #4081
Changes from 4 commits
a9a6ccc
459c907
4de3c8b
877afea
7a238fc
67fcb1b
1a64d7c
38dc136
aadbf45
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the known edge cases? would be very nice to know what we have to include this as it is expensive. Looking at the code it appears that is_fixed_point cuts off early if it sees something that it does not expect, so it might be nice to have a follow on issue to actually fix that, either in CUDF or in Spark specific code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is it cutting off early?
Are you saying if I pass c = ["", "1.2", "3", ""] and if the boolean vector is initialized to true
d = c.is_fixed_point() = [false, true, true, true]
basically everything after the first value in
d
is bogus?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I was wrong. Reading through the code it looked like the check ignored anything after it saw something it didn't expect, but that is not true.
It looks like "1.5ABC" will result in a false being returned. Which if that is true, then I don't think we need the regular expression check at all any more. That is what triggered this? Why do we need the regexp. What "edge cases" does it cover that are not covered by the existing type check code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right we don't need the regex check anymore as the cudf is reporting everything we need. This check is still relevant in case of a float because it needs to convert the "infinity" => "inf"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in ANSI mode is this not an error? Does the regular expression not match this, because it sure looks like the regexp would error out on anything that has any white space in it at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good point. ANSI doesn't like spaces, and throws an ansi exception. I will file an issue for Floats as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually an unnecessary check as
\r
is being checked as a string which would be caught by the regex check.