Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert String to DecimalType without casting to FloatType [databricks] #4081

Merged
merged 9 commits into from
Nov 18, 2021

Conversation

razajafri
Copy link
Collaborator

@razajafri razajafri commented Nov 11, 2021

  • Added a castStringToDecimal method

Fixes #2019
depends on /rapidsai/cudf#9658

Signed-off-by: Raza Jafri <[email protected]>
Signed-off-by: Raza Jafri <[email protected]>
@razajafri razajafri changed the title Sp 2019 string decimal Convert String to DecimalType without casting to FloatType Nov 11, 2021
@razajafri razajafri self-assigned this Nov 11, 2021
@razajafri razajafri marked this pull request as draft November 11, 2021 05:17
@razajafri razajafri changed the base branch from branch-21.12 to branch-22.02 November 15, 2021 18:59
Signed-off-by: Raza Jafri <[email protected]>
Signed-off-by: Raza Jafri <[email protected]>
@razajafri
Copy link
Collaborator Author

@revans2 PTAL

@sameerz sameerz added the feature request New feature or request label Nov 16, 2021
@sameerz sameerz added this to the Nov 15 - Nov 26 milestone Nov 16, 2021
input: ColumnView,
ansiEnabled: Boolean): ColumnVector = {

// This regex gets applied to filter out known edge cases that would result in incorrect
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the known edge cases? would be very nice to know what we have to include this as it is expensive. Looking at the code it appears that is_fixed_point cuts off early if it sees something that it does not expect, so it might be nice to have a follow on issue to actually fix that, either in CUDF or in Spark specific code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is it cutting off early?

Are you saying if I pass c = ["", "1.2", "3", ""] and if the boolean vector is initialized to true
d = c.is_fixed_point() = [false, true, true, true]

basically everything after the first value in d is bogus?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I was wrong. Reading through the code it looked like the check ignored anything after it saw something it didn't expect, but that is not true.

It looks like "1.5ABC" will result in a false being returned. Which if that is true, then I don't think we need the regular expression check at all any more. That is what triggered this? Why do we need the regexp. What "edge cases" does it cover that are not covered by the existing type check code?

Copy link
Collaborator Author

@razajafri razajafri Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right we don't need the regex check anymore as the cudf is reporting everything we need. This check is still relevant in case of a float because it needs to convert the "infinity" => "inf"

withResource(input.strip()) { stripped =>
withResource(GpuScalar.from(null, DataTypes.StringType)) { nullString =>
// filter out strings containing breaking whitespace
val withoutWhitespace = withResource(ColumnVector.fromStrings("\r", "\n")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in ANSI mode is this not an error? Does the regular expression not match this, because it sure looks like the regexp would error out on anything that has any white space in it at all?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good point. ANSI doesn't like spaces, and throws an ansi exception. I will file an issue for Floats as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually an unnecessary check as \r is being checked as a string which would be caught by the regex check.

@razajafri razajafri marked this pull request as ready for review November 16, 2021 16:59
@razajafri
Copy link
Collaborator Author

build

@razajafri razajafri changed the title Convert String to DecimalType without casting to FloatType Convert String to DecimalType without casting to FloatType [databricks] Nov 17, 2021
@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

build

@revans2 revans2 merged commit 94d6e36 into NVIDIA:branch-22.02 Nov 18, 2021
@razajafri razajafri deleted the SP-2019-string-decimal branch November 18, 2021 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] The result of casting string to decimal is not identical to Apache Spark
4 participants