Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sequence_type parameter to cudf::strings::title function #8602

Merged

Conversation

davidwendt
Copy link
Contributor

Closes #8596 #8597

Adds a sequence_type parameter to help control how words are delimited within a string when processing the title() logic.

std::unique_ptr<column> title(
  strings_column_view const& input,
  string_character_types sequence_type = string_character_types::ALPHA,
  rmm::mr::device_memory_resource* mr  = rmm::mr::get_current_device_resource());

The default ALPHA type preserves the original behavior which matches with the Pandas str.title() where the first character after a non-ALPHA is upper-cased and the rest are lower-cased. Likewise, specifying ALPHANUM will treat a sequence of alphanumeric characters as a word and upper-case only after a non-ALPHANUM (e.g. whitespace).

The sequence_type type is declared in the char_types.hpp header and so these types can be reused in this API.

The following usage should satisfy the #8597 feature request.

   cudf::strings_column_view view(input); // input strings column
   auto result = cudf::strings::title(view, cudf::string_character_types::ALPHANUM);

The ALPHANUM set contains all alphanumeric character types which leaves out only SPACE which is the set of whitespace characters.

The gtests for cudf::strings::title() have been updated to include test-cases using ALPHANUM.

@davidwendt davidwendt added feature request New feature or request 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) non-breaking Non-breaking change labels Jun 24, 2021
@davidwendt davidwendt self-assigned this Jun 24, 2021
@davidwendt davidwendt requested a review from a team as a code owner June 24, 2021 17:40
@codecov
Copy link

codecov bot commented Jun 24, 2021

Codecov Report

Merging #8602 (fa32c39) into branch-21.08 (58438c0) will increase coverage by 0.36%.
The diff coverage is n/a.

❗ Current head fa32c39 differs from pull request most recent head 95a82fc. Consider uploading reports for the commit 95a82fc to get more accurate results
Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.08    #8602      +/-   ##
================================================
+ Coverage         82.63%   83.00%   +0.36%     
================================================
  Files               109      109              
  Lines             17869    18222     +353     
================================================
+ Hits              14766    15125     +359     
+ Misses             3103     3097       -6     
Impacted Files Coverage Δ
python/cudf/cudf/core/abc.py 86.36% <0.00%> (-1.45%) ⬇️
python/cudf/cudf/io/feather.py 100.00% <0.00%> (ø)
python/cudf/cudf/comm/serialize.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/applyutils.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/fuzzer.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/hash_vocab_utils.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_csv.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_orc.py 100.00% <0.00%> (ø)
... and 44 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 58438c0...95a82fc. Read the comment docs.

Copy link
Contributor

@rgsl888prabhu rgsl888prabhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small suggestion, rest looks good.

cpp/include/cudf/strings/capitalize.hpp Outdated Show resolved Hide resolved
@davidwendt
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit a2863b9 into rapidsai:branch-21.08 Jun 28, 2021
@davidwendt davidwendt deleted the strings-title-with-spaces branch June 28, 2021 11:52
@davidwendt davidwendt added breaking Breaking change and removed non-breaking Non-breaking change labels Jun 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] title function documentation does not match implemenation
4 participants