-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use single kernel to extract all groups in cudf::strings::extract #9358
Use single kernel to extract all groups in cudf::strings::extract #9358
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-21.12 #9358 +/- ##
================================================
- Coverage 10.79% 10.74% -0.05%
================================================
Files 116 116
Lines 18869 19082 +213
================================================
+ Hits 2036 2051 +15
- Misses 16833 17031 +198
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just a couple suggestions.
@gpucibot merge |
This is a less ambitious version of #8460 which had to be reverted in #8575 because it did not work with greedy quantifiers. The change here involves calling the underlying
reprog_device::extract
to retrieve each group result within a single kernel rather than launching a kernel for each group. The output is placed contiguously in a 2d span (wrapped uvector) and a permutation iterator is used to build the output columns (one column per group).Like it's predecessor, the performance improvement is mostly when specifying more than 1 group in the regex pattern. The benchmark results showed no change for single groups but was 2x faster for multiple groups over long (8K) strings and up to 4x faster for multiple groups over many (16M) strings.
The benchmark test for extract was also updated to better report the number of groups being used when measuring results.