Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Refactor string conversion check #7557

Closed
ttnghia opened this issue Mar 10, 2021 · 10 comments · Fixed by #7599
Closed

[FEA] Refactor string conversion check #7557

ttnghia opened this issue Mar 10, 2021 · 10 comments · Fixed by #7599
Assignees
Labels
0 - Blocked Cannot progress due to external reasons feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Mar 10, 2021

Currently, there are functions to check whether a string is a valid representation of a number (integer/fixed point/float etc). However, those functions are scattered around and their purposes are inconsistent.

  • In cudf/strings/string.cuh and cudf/strings/char_types.hpp, there are is_integer and is_float functions, which check whether a string has the correct pattern so it can be converted into a valid number. However, those functions do not do bound check.
  • In strings/convert/convert_integer.hpp, there is function is_hex to check if a string can be converted to a hex number. Again, no bound check.
  • In cudf/strings/convert/convert_fixed_point.hpp, there is function is_fixed_point which does both pattern check and bound check.

I want to refactor/reorganize those functions to enforce consistency. We should either group them together in char_types.hpp, or should put them in their corresponding strings/conver/convert_xxx places. In addition, since we do bound check for fixed point numbers, we should also support bound check for the other types. If not, we should add something to indicate whether a function supports bound check or not. Otherwise, by simply calling is_integer or is_fixed_point we cannot know which function does bound check and which one does not.

@ttnghia ttnghia added feature request New feature or request Needs Triage Need team to review and classify labels Mar 10, 2021
@ttnghia ttnghia added libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Mar 10, 2021
@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 10, 2021

There are other feature requests such as "add is_valid_integer" (#7094, #7080, #5110). As such, refactoring and adding bound check support for is_integer is necessary. If we don't do that but instead add a new function like is_valid_element (#7094) then things will become more and more diverge.

@kkraus14 kkraus14 removed the Needs Triage Need team to review and classify label Mar 10, 2021
@ttnghia ttnghia added the 0 - Blocked Cannot progress due to external reasons label Mar 10, 2021
@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 10, 2021

I need to work on a function to check if a string is a valid integer. That means, combining bound check with the current is_integer function. Thus, I need your comments/suggestions/instructions to resolve this out before I can move forward.

@davidwendt
Copy link
Contributor

I would like to move the cudf::strings::is_integer() code to the existing convert/convert_integers.hpp/cu and the cudf::strings::is_float() code to the existing convert/convert_floats.hpp/cu.

I would also like to remove the cudf::strings::all_integer() and cudf::strings::all_float(). These are not being used.

If the cudf::strings::is_integer() is to check for overlfow, then that check would be done outside of the device function cudf::strings::string::is_integer() (defined in string.cuh). The same goes for cudf::strings::is_float() the overflow check would be done outside of the device function cudf::strings::string::is_float().

If not, we should add something to indicate whether a function supports bound check or not.

The documentation should indicate if overflow checking is done or not. I don't believe we should be changing all the is_ functions to do overflow checking arbitrarily.

@ttnghia ttnghia self-assigned this Mar 10, 2021
@davidwendt
Copy link
Contributor

Since the current cudf::strings::to_float() converts any overflow to infinity (or -infinity) does Spark actually need cudf::strings::is_float() to check for overflow?

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 15, 2021

Good point. Checking overflow for float is more difficult than for integers. @revans2, @andygrove?

@ttnghia ttnghia changed the title [FEA] Refactor char_types and strings/convert [FEA] Refactor string conversion check Mar 15, 2021
@ttnghia ttnghia linked a pull request Mar 15, 2021 that will close this issue
@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 15, 2021

A PR for this has been submitted---just move the is_integer and is_float function around (#7599) according to @davidwendt 's suggestion. The new functions for bound checking will be added later in a separate PR.

@revans2
Copy link
Contributor

revans2 commented Mar 15, 2021

I just confirmed that for the Spark use case we don't care about overflow checking on floating point values. Even in ANSI operation is enabled when you overflow a floating point value Inf or -Inf is returned. Similar for numbers that are too small for a float 0.0 is returned. The only checking we care about is making sure that the format is correct.

@davidwendt
Copy link
Contributor

The current cudf::strings::is_integer() does not take an data_type parameter.
Would it be important to check for overflow on all integer types/sizes or could we just hardcode the overflow check to INT64 ?

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 16, 2021

I planned to rewrite the is_integer function to take a data_type parameter similar to is_fixed_point(). My implementation is also based on it.

@davidwendt
Copy link
Contributor

I planned to rewrite the is_integer function to take a data_type parameter similar to is_fixed_point(). My implementation is also based on it.

This will effect the cython/python code that currently uses it. Also, there are 8 integer types which would need to be type-dispatched. (Fixed-point only has two). That is alot of extra generated code if Spark only cares about INT64 say.

rapids-bot bot pushed a commit that referenced this issue Mar 17, 2021
This addresses #7557.

In summary:
 * Move `cudf::strings::is_integer()` code from `strings/chars_types.*` to `strings/convert/convert_integers.hpp/cu`
 * Move `cudf::strings::is_float()` code from `strings/chars_types.*` to `strings/convert/convert_floats.hpp/cu`
 * Remove `cudf::strings::all_integer()` and `cudf::strings::all_float()`

Authors:
  - Nghia Truong (@ttnghia)

Approvers:
  - GALI PREM SAGAR (@galipremsagar)
  - Jason Lowe (@jlowe)
  - Jake Hemstad (@jrhemstad)
  - David (@davidwendt)

URL: #7599
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Blocked Cannot progress due to external reasons feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants