Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use single kernel to extract all groups in cudf::strings::extract #9358

Merged
merged 11 commits into from
Oct 13, 2021

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Oct 4, 2021

This is a less ambitious version of #8460 which had to be reverted in #8575 because it did not work with greedy quantifiers. The change here involves calling the underlying reprog_device::extract to retrieve each group result within a single kernel rather than launching a kernel for each group. The output is placed contiguously in a 2d span (wrapped uvector) and a permutation iterator is used to build the output columns (one column per group).

Like it's predecessor, the performance improvement is mostly when specifying more than 1 group in the regex pattern. The benchmark results showed no change for single groups but was 2x faster for multiple groups over long (8K) strings and up to 4x faster for multiple groups over many (16M) strings.

The benchmark test for extract was also updated to better report the number of groups being used when measuring results.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 4, 2021
@davidwendt davidwendt self-assigned this Oct 4, 2021
@codecov
Copy link

codecov bot commented Oct 4, 2021

Codecov Report

Merging #9358 (b4ef52b) into branch-21.12 (ab4bfaa) will decrease coverage by 0.04%.
The diff coverage is 0.00%.

❗ Current head b4ef52b differs from pull request most recent head f43c546. Consider uploading reports for the commit f43c546 to get more accurate results
Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9358      +/-   ##
================================================
- Coverage         10.79%   10.74%   -0.05%     
================================================
  Files               116      116              
  Lines             18869    19082     +213     
================================================
+ Hits               2036     2051      +15     
- Misses            16833    17031     +198     
Impacted Files Coverage Δ
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_lib/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/core/_base_index.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/categorical.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/column.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/datetime.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/lists.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/numerical.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/string.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/struct.py 0.00% <0.00%> (ø)
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a743ce8...f43c546. Read the comment docs.

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Oct 6, 2021
@davidwendt davidwendt marked this pull request as ready for review October 6, 2021 15:59
@davidwendt davidwendt requested a review from a team as a code owner October 6, 2021 15:59
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a couple suggestions.

cpp/src/strings/extract.cu Outdated Show resolved Hide resolved
cpp/src/strings/extract.cu Outdated Show resolved Hide resolved
@harrism
Copy link
Member

harrism commented Oct 13, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit a4f6c6d into rapidsai:branch-21.12 Oct 13, 2021
@davidwendt davidwendt deleted the strings-extract-opt2 branch October 13, 2021 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants