-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Regex: Provide option for ^ and $ to only match beginning and end of input string #9439
Labels
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Python
Affects Python cuDF API.
strings
strings issues (C++ and Python)
Comments
andygrove
added
feature request
New feature or request
Needs Triage
Need team to review and classify
labels
Oct 14, 2021
Python regex library includes a MULTILINE flag that can be passed on various regex functions (e.g. This feature looks like it may be possible to add into libcudf by adding a similar flags parameter to the |
davidwendt
added
Python
Affects Python cuDF API.
libcudf
Affects libcudf (C++/CUDA) code.
strings
strings issues (C++ and Python)
labels
Oct 14, 2021
rapids-bot bot
pushed a commit
that referenced
this issue
Nov 1, 2021
Closes #9439 The `^` (begin anchor) and `$` (end anchor) apply to beginning of line (BOL) and end of line (EOL) respectively. This means that they cannot be used to match on strings containing embedded new-line ('\n') characters when desiring the anchors only match just the beginning and end of the string as a whole. Many regex engines support a flag for overriding the behavior of the BOL/EOL anchors: [Python](https://docs.python.org/3/library/re.html#re.MULTILINE), [Java](https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#MULTILINE), [C++](https://en.cppreference.com/w/cpp/regex/basic_regex/constants). This PR introduces a similar flag parameter to the `cudf::strings::contains_re`, `cudf::strings::matches_re` and `cudf::strings::count_re` APIs to tell the regex engine how to interpret the anchor characters in the given regex pattern. Additional information about these anchors can also be found here: https://www.regular-expressions.info/anchors.html The current default behavior of the libcudf regex is to interpret BOL/EOL as similar to the `MULTILINE` flag. This behavior doesn't match the engines/languages listed above. So for consistency the default is reversed requiring this PR to be a breaking change. Also, the new `flags` parameter added to the above APIs requires this to be a breaking change. An additional flag (DOTALL) is included in this PR since the internal regex code supports it but only needed a path for the caller to specify the behavior. The `DOTALL` flag is also a feature of the above languages. When specified, the dot '.' pattern includes embedded new-line characters in its matching character set. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9482
rapids-bot bot
pushed a commit
that referenced
this issue
Nov 2, 2021
Closes #7904 Depends on #9439 Adds support to accepting a pattern argument in cudf `str.replace` that is built using `re.compile`. The resulting `re.Pattern` has a member `pattern` that is the regex string that can be passed to libcudf. There is also `flags` member that can be used for libcudf APIs that accept a `regex_flags` parameter along with the `pattern` string. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - https://github.com/brandon-b-miller - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #9573
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Python
Affects Python cuDF API.
strings
strings issues (C++ and Python)
Is your feature request related to a problem? Please describe.
I would like the ability to use
contains_re
in a way that is compatible with Java and Python's default handling of multi-line inputs. The default cuDF behavior is that^
will match at the beginning of the input and also match after each newline in the input. The default Python and Java/Spark behavior is to only match at the start of the input.Python default behavior:
cuDF default behavior:
Describe the solution you'd like
I would like to be able to specify how
^
and$
behave with multi-line inputs.Describe alternatives you've considered
In the RAPIDS Accelerator for Apache Spark, we could potentially parse the regex and translate
^
and$
to\A
and\Z
.Additional context
This requirement is being driven by NVIDIA/spark-rapids#3797
The text was updated successfully, but these errors were encountered: