-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Create is_integer/is_float functions for checking characters before calling to_integers/to_floats #4863
[REVIEW] Create is_integer/is_float functions for checking characters before calling to_integers/to_floats #4863
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-0.14 #4863 +/- ##
============================================
Coverage 88.50% 88.50%
============================================
Files 54 54
Lines 10124 10124
============================================
Hits 8960 8960
Misses 1164 1164
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Good! I just had a couple of questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is really nice David! 😃
Small improvements to markdown of Doxygen comments is only suggestion.
APIs added to help fix #2707 and tangentially #4850 and similar issues.
Generally, most libcudf APIs operate in bad-data-in/bad-data-out mode and do not perform checking on individual column elements since this can be unnecessarily costly to performance. And even operations that do validate data values normally provide a parameter to skip that checking. The Python cudf layer tries to report invalid data rather than just return bad data per consistency with Pandas and other Python libraries.
This PR adds two APIs (
is_integer()
andis_float()
) that can be used optionally before callingto_integers()
andto_floats()
respectively to check the characters are valid for conversion. This allows the Python code to report an appropriate error and even identify the failing strings.Note: The existing
isnumeric()
andisdecimal()
are insufficient because neither checks for sign character, decimal pointer, scientific notation as appropriate.The new APIs will return a column of booleans that indicate which strings are invalid for conversion.
This PR will also includes tests for these APIs.