Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a move_to function to cudf::string_view::const_iterator #13428

Merged
merged 17 commits into from
Jun 20, 2023

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented May 24, 2023

Description

Adds a move_to() function the cudf::string_view::const_iterator class to help minimize character counting when creating and incrementing the iterator on multi-byte UTF8 characters.
The function simply moves the iterator from the current character position to the given one. This is just a shortcut for the form

itr += (new_position - itr.position());

This pattern is repeated many times in #13322 and likely future PRs that require the same behavior.
The PR also includes an update to the string_view::begin() to set the byte-offset directly rather than waste instructions calculating it.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 24, 2023
@davidwendt davidwendt self-assigned this May 24, 2023
@github-actions github-actions bot added the Python Affects Python cuDF API. label May 24, 2023
@github-actions github-actions bot removed the Python Affects Python cuDF API. label May 24, 2023
@davidwendt davidwendt changed the title Add move_to function to cudf::string_view iterator Add move_to function to cudf::string_view::const_iterator May 24, 2023
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels May 25, 2023
@davidwendt davidwendt marked this pull request as ready for review June 7, 2023 12:16
@davidwendt davidwendt requested a review from a team as a code owner June 7, 2023 12:16
@davidwendt davidwendt requested a review from vyasr June 7, 2023 12:16
@davidwendt davidwendt changed the title Add move_to function to cudf::string_view::const_iterator Add a move_to function to cudf::string_view::const_iterator Jun 13, 2023
@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit ee14056 into rapidsai:branch-23.08 Jun 20, 2023
@davidwendt davidwendt deleted the move-to-string-iterator branch June 20, 2023 14:41
rapids-bot bot pushed a commit that referenced this pull request Jun 23, 2023
…ings (#13322)

Changes the internal regex logic to minimize character counting to help performance with longer strings. The improvement applies mainly to libcudf regex functions that return strings (i.e. extract, replace, split). The changes here also improve the internal device APIs for clarity to improve maintenance. The most significant change makes the position variables input-only and returning an optional pair to indicate a successful match.

There are some more optimizations that are possible here where character positions are passed back and forth that could be replaced with byte positions to further reduce counting. Initial measurements showed this noticeably slowed down small strings so more analysis is required before continuing this optimization. 

Reference: #13480

### More Detail

First, there is a change to some internal regex function signatures. Notable the `reprog_device::find()` and `reprog_device::extract()` member functions declared in `cpp/src/strings/regex/regex.cuh` that are used by all the libcudf regex functions. The in/out parameters are now input-only parameters (pass by value) and the return is an optional pair that includes the match result. Also, the `begin` parameter is now an iterator and the `end` parameter now has a default. This change requires updating all the definitions and uses of the `find` and `extract` member functions.

Using an iterator as the `begin` parameter allows for some optimizations in the calling code to minimize character counting that may be needed for processing multi-byte UTF-8 characters. Rather than using the `cudf::string_view::byte_offset()` member function to convert character positions to byte positions, an iterator can be incremented as we traverse through the string which helps reduce some character counting. So the changes here involve removing some calls to `byte_offset()` and incrementing (really moving) iterators with a pattern like `itr += (new_pos - itr.position());` There is another PR #13428 to make a `move_to` iterator member function.

It is possible to reduce the character counting even more as mentioned above but further optimization requires some deeper analysis.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - MithunR (https://github.com/mythrocks)

URL: #13322
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants