-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for single-line regex anchors ^/$ in contains_re #9482
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-21.12 #9482 +/- ##
================================================
- Coverage 10.79% 10.66% -0.13%
================================================
Files 116 117 +1
Lines 18869 19729 +860
================================================
+ Hits 2036 2104 +68
- Misses 16833 17625 +792
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving ops-codeowner
file changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good to me. A few minor suggestions and a couple questions due to my unfamiliarity with this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. I have some suggestions for improvement, but overall this looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks @davidwendt.
@gpucibot merge |
Closes #9439
The
^
(begin anchor) and$
(end anchor) apply to beginning of line (BOL) and end of line (EOL) respectively. This means that they cannot be used to match on strings containing embedded new-line ('\n') characters when desiring the anchors only match just the beginning and end of the string as a whole.Many regex engines support a flag for overriding the behavior of the BOL/EOL anchors: Python, Java, C++. This PR introduces a similar flag parameter to the
cudf::strings::contains_re
,cudf::strings::matches_re
andcudf::strings::count_re
APIs to tell the regex engine how to interpret the anchor characters in the given regex pattern.Additional information about these anchors can also be found here: https://www.regular-expressions.info/anchors.html
The current default behavior of the libcudf regex is to interpret BOL/EOL as similar to the
MULTILINE
flag. This behavior doesn't match the engines/languages listed above. So for consistency the default is reversed requiring this PR to be a breaking change.Also, the new
flags
parameter added to the above APIs requires this to be a breaking change. An additional flag (DOTALL) is included in this PR since the internal regex code supports it but only needed a path for the caller to specify the behavior. TheDOTALL
flag is also a feature of the above languages. When specified, the dot '.' pattern includes embedded new-line characters in its matching character set.