[FEA] Refactor string conversion check #7557

ttnghia · 2021-03-10T21:12:31Z

Currently, there are functions to check whether a string is a valid representation of a number (integer/fixed point/float etc). However, those functions are scattered around and their purposes are inconsistent.

In cudf/strings/string.cuh and cudf/strings/char_types.hpp, there are is_integer and is_float functions, which check whether a string has the correct pattern so it can be converted into a valid number. However, those functions do not do bound check.
In strings/convert/convert_integer.hpp, there is function is_hex to check if a string can be converted to a hex number. Again, no bound check.
In cudf/strings/convert/convert_fixed_point.hpp, there is function is_fixed_point which does both pattern check and bound check.

I want to refactor/reorganize those functions to enforce consistency. We should either group them together in char_types.hpp, or should put them in their corresponding strings/conver/convert_xxx places. In addition, since we do bound check for fixed point numbers, we should also support bound check for the other types. If not, we should add something to indicate whether a function supports bound check or not. Otherwise, by simply calling is_integer or is_fixed_point we cannot know which function does bound check and which one does not.

The text was updated successfully, but these errors were encountered:

ttnghia · 2021-03-10T21:27:45Z

There are other feature requests such as "add is_valid_integer" (#7094, #7080, #5110). As such, refactoring and adding bound check support for is_integer is necessary. If we don't do that but instead add a new function like is_valid_element (#7094) then things will become more and more diverge.

ttnghia · 2021-03-10T22:08:40Z

I need to work on a function to check if a string is a valid integer. That means, combining bound check with the current is_integer function. Thus, I need your comments/suggestions/instructions to resolve this out before I can move forward.

davidwendt · 2021-03-10T22:29:42Z

I would like to move the cudf::strings::is_integer() code to the existing convert/convert_integers.hpp/cu and the cudf::strings::is_float() code to the existing convert/convert_floats.hpp/cu.

I would also like to remove the cudf::strings::all_integer() and cudf::strings::all_float(). These are not being used.

If the cudf::strings::is_integer() is to check for overlfow, then that check would be done outside of the device function cudf::strings::string::is_integer() (defined in string.cuh). The same goes for cudf::strings::is_float() the overflow check would be done outside of the device function cudf::strings::string::is_float().

If not, we should add something to indicate whether a function supports bound check or not.

The documentation should indicate if overflow checking is done or not. I don't believe we should be changing all the is_ functions to do overflow checking arbitrarily.

davidwendt · 2021-03-15T15:12:46Z

Since the current cudf::strings::to_float() converts any overflow to infinity (or -infinity) does Spark actually need cudf::strings::is_float() to check for overflow?

ttnghia · 2021-03-15T15:42:17Z

Good point. Checking overflow for float is more difficult than for integers. @revans2, @andygrove?

ttnghia · 2021-03-15T16:00:00Z

A PR for this has been submitted---just move the is_integer and is_float function around (#7599) according to @davidwendt 's suggestion. The new functions for bound checking will be added later in a separate PR.

revans2 · 2021-03-15T20:13:58Z

I just confirmed that for the Spark use case we don't care about overflow checking on floating point values. Even in ANSI operation is enabled when you overflow a floating point value Inf or -Inf is returned. Similar for numbers that are too small for a float 0.0 is returned. The only checking we care about is making sure that the format is correct.

davidwendt · 2021-03-15T23:58:20Z

The current cudf::strings::is_integer() does not take an data_type parameter.
Would it be important to check for overflow on all integer types/sizes or could we just hardcode the overflow check to INT64 ?

ttnghia · 2021-03-16T00:28:43Z

I planned to rewrite the is_integer function to take a data_type parameter similar to is_fixed_point(). My implementation is also based on it.

davidwendt · 2021-03-16T12:46:51Z

I planned to rewrite the is_integer function to take a data_type parameter similar to is_fixed_point(). My implementation is also based on it.

This will effect the cython/python code that currently uses it. Also, there are 8 integer types which would need to be type-dispatched. (Fixed-point only has two). That is alot of extra generated code if Spark only cares about INT64 say.

@ttnghia

This addresses #7557. In summary: * Move `cudf::strings::is_integer()` code from `strings/chars_types.*` to `strings/convert/convert_integers.hpp/cu` * Move `cudf::strings::is_float()` code from `strings/chars_types.*` to `strings/convert/convert_floats.hpp/cu` * Remove `cudf::strings::all_integer()` and `cudf::strings::all_float()` Authors: - Nghia Truong (@ttnghia) Approvers: - GALI PREM SAGAR (@galipremsagar) - Jason Lowe (@jlowe) - Jake Hemstad (@jrhemstad) - David (@davidwendt) URL: #7599

ttnghia added feature request New feature or request Needs Triage Need team to review and classify labels Mar 10, 2021

ttnghia added libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Mar 10, 2021

ttnghia mentioned this issue Mar 10, 2021

add is_valid_integer format check API #7094

Closed

kkraus14 removed the Needs Triage Need team to review and classify label Mar 10, 2021

ttnghia added the 0 - Blocked Cannot progress due to external reasons label Mar 10, 2021

ttnghia self-assigned this Mar 10, 2021

ttnghia changed the title ~~[FEA] Refactor char_types and strings/convert~~ [FEA] Refactor string conversion check Mar 15, 2021

ttnghia mentioned this issue Mar 15, 2021

Refactor string conversion check #7599

Merged

ttnghia linked a pull request Mar 15, 2021 that will close this issue

Refactor string conversion check #7599

Merged

rapids-bot bot closed this as completed in #7599 Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Refactor string conversion check #7557

[FEA] Refactor string conversion check #7557

ttnghia commented Mar 10, 2021 •

edited

Loading

ttnghia commented Mar 10, 2021

ttnghia commented Mar 10, 2021 •

edited

Loading

davidwendt commented Mar 10, 2021

davidwendt commented Mar 15, 2021

ttnghia commented Mar 15, 2021

ttnghia commented Mar 15, 2021 •

edited

Loading

revans2 commented Mar 15, 2021

davidwendt commented Mar 15, 2021

ttnghia commented Mar 16, 2021

davidwendt commented Mar 16, 2021

[FEA] Refactor string conversion check #7557

[FEA] Refactor string conversion check #7557

Comments

ttnghia commented Mar 10, 2021 • edited Loading

ttnghia commented Mar 10, 2021

ttnghia commented Mar 10, 2021 • edited Loading

davidwendt commented Mar 10, 2021

davidwendt commented Mar 15, 2021

ttnghia commented Mar 15, 2021

ttnghia commented Mar 15, 2021 • edited Loading

revans2 commented Mar 15, 2021

davidwendt commented Mar 15, 2021

ttnghia commented Mar 16, 2021

davidwendt commented Mar 16, 2021

ttnghia commented Mar 10, 2021 •

edited

Loading

ttnghia commented Mar 10, 2021 •

edited

Loading

ttnghia commented Mar 15, 2021 •

edited

Loading