-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a move_to function to cudf::string_view::const_iterator #13428
Merged
rapids-bot
merged 17 commits into
rapidsai:branch-23.08
from
davidwendt:move-to-string-iterator
Jun 20, 2023
Merged
Add a move_to function to cudf::string_view::const_iterator #13428
rapids-bot
merged 17 commits into
rapidsai:branch-23.08
from
davidwendt:move-to-string-iterator
Jun 20, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
davidwendt
added
2 - In Progress
Currently a work in progress
libcudf
Affects libcudf (C++/CUDA) code.
strings
strings issues (C++ and Python)
improvement
Improvement / enhancement to an existing function
non-breaking
Non-breaking change
labels
May 24, 2023
3 tasks
davidwendt
changed the title
Add move_to function to cudf::string_view iterator
Add move_to function to cudf::string_view::const_iterator
May 24, 2023
davidwendt
added
3 - Ready for Review
Ready for review by team
and removed
2 - In Progress
Currently a work in progress
labels
May 25, 2023
davidwendt
changed the title
Add move_to function to cudf::string_view::const_iterator
Add a move_to function to cudf::string_view::const_iterator
Jun 13, 2023
3 tasks
bdice
approved these changes
Jun 14, 2023
karthikeyann
approved these changes
Jun 14, 2023
/merge |
rapids-bot bot
pushed a commit
that referenced
this pull request
Jun 23, 2023
…ings (#13322) Changes the internal regex logic to minimize character counting to help performance with longer strings. The improvement applies mainly to libcudf regex functions that return strings (i.e. extract, replace, split). The changes here also improve the internal device APIs for clarity to improve maintenance. The most significant change makes the position variables input-only and returning an optional pair to indicate a successful match. There are some more optimizations that are possible here where character positions are passed back and forth that could be replaced with byte positions to further reduce counting. Initial measurements showed this noticeably slowed down small strings so more analysis is required before continuing this optimization. Reference: #13480 ### More Detail First, there is a change to some internal regex function signatures. Notable the `reprog_device::find()` and `reprog_device::extract()` member functions declared in `cpp/src/strings/regex/regex.cuh` that are used by all the libcudf regex functions. The in/out parameters are now input-only parameters (pass by value) and the return is an optional pair that includes the match result. Also, the `begin` parameter is now an iterator and the `end` parameter now has a default. This change requires updating all the definitions and uses of the `find` and `extract` member functions. Using an iterator as the `begin` parameter allows for some optimizations in the calling code to minimize character counting that may be needed for processing multi-byte UTF-8 characters. Rather than using the `cudf::string_view::byte_offset()` member function to convert character positions to byte positions, an iterator can be incremented as we traverse through the string which helps reduce some character counting. So the changes here involve removing some calls to `byte_offset()` and incrementing (really moving) iterators with a pattern like `itr += (new_pos - itr.position());` There is another PR #13428 to make a `move_to` iterator member function. It is possible to reduce the character counting even more as mentioned above but further optimization requires some deeper analysis. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Mark Harris (https://github.com/harrism) - MithunR (https://github.com/mythrocks) URL: #13322
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
3 - Ready for Review
Ready for review by team
improvement
Improvement / enhancement to an existing function
libcudf
Affects libcudf (C++/CUDA) code.
non-breaking
Non-breaking change
strings
strings issues (C++ and Python)
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Adds a
move_to()
function thecudf::string_view::const_iterator
class to help minimize character counting when creating and incrementing the iterator on multi-byte UTF8 characters.The function simply moves the iterator from the current character position to the given one. This is just a shortcut for the form
This pattern is repeated many times in #13322 and likely future PRs that require the same behavior.
The PR also includes an update to the
string_view::begin()
to set the byte-offset directly rather than waste instructions calculating it.Checklist