Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Create is_integer/is_float functions for checking characters before calling to_integers/to_floats #4863

Merged
merged 17 commits into from
Apr 20, 2020

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Apr 9, 2020

APIs added to help fix #2707 and tangentially #4850 and similar issues.
Generally, most libcudf APIs operate in bad-data-in/bad-data-out mode and do not perform checking on individual column elements since this can be unnecessarily costly to performance. And even operations that do validate data values normally provide a parameter to skip that checking. The Python cudf layer tries to report invalid data rather than just return bad data per consistency with Pandas and other Python libraries.

This PR adds two APIs (is_integer() and is_float()) that can be used optionally before calling to_integers() and to_floats() respectively to check the characters are valid for conversion. This allows the Python code to report an appropriate error and even identify the failing strings.

Note: The existing isnumeric() and isdecimal() are insufficient because neither checks for sign character, decimal pointer, scientific notation as appropriate.

The new APIs will return a column of booleans that indicate which strings are invalid for conversion.
This PR will also includes tests for these APIs.

@davidwendt davidwendt requested a review from a team as a code owner April 9, 2020 17:22
@davidwendt davidwendt self-assigned this Apr 9, 2020
@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. labels Apr 9, 2020
@davidwendt davidwendt changed the title [WIP] Create is_integer/is_float functions for checking characters before calling to_integers/to_floats [REVIEW] Create is_integer/is_float functions for checking characters before calling to_integers/to_floats Apr 9, 2020
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Apr 9, 2020
@codecov
Copy link

codecov bot commented Apr 9, 2020

Codecov Report

Merging #4863 into branch-0.14 will not change coverage.
The diff coverage is 80.76%.

Impacted file tree graph

@@             Coverage Diff              @@
##           branch-0.14    #4863   +/-   ##
============================================
  Coverage        88.50%   88.50%           
============================================
  Files               54       54           
  Lines            10124    10124           
============================================
  Hits              8960     8960           
  Misses            1164     1164           
Impacted Files Coverage Δ
python/cudf/cudf/utils/gpu_utils.py 54.54% <ø> (ø)
python/cudf/cudf/_lib/nvtx/nvtx.py 75.86% <55.55%> (ø)
python/cudf/cudf/_lib/nvtx/__init__.py 75.00% <75.00%> (ø)
python/cudf/cudf/core/indexing.py 95.65% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 683391f...271b601. Read the comment docs.

Copy link
Contributor

@cwharris cwharris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good! I just had a couple of questions.

cpp/src/strings/char_types/char_types.cu Outdated Show resolved Hide resolved
cpp/src/strings/char_types/char_types.cu Outdated Show resolved Hide resolved
cpp/src/strings/char_types/char_types.cu Outdated Show resolved Hide resolved
cpp/src/strings/char_types/char_types.cu Outdated Show resolved Hide resolved
cpp/src/strings/char_types/char_types.cu Outdated Show resolved Hide resolved
cpp/src/strings/char_types/char_types.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@codereport codereport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is really nice David! 😃
Small improvements to markdown of Doxygen comments is only suggestion.

cpp/include/cudf/strings/char_types/char_types.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/char_types/char_types.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/string.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/strings/string.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/strings/string.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/strings/string.cuh Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Astype on String and CategoricalColumns coerces all values to be 0
5 participants