Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Regex: Provide option for ^ and $ to only match beginning and end of input string #9439

Closed
andygrove opened this issue Oct 14, 2021 · 1 comment · Fixed by #9482
Closed
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. strings strings issues (C++ and Python)

Comments

@andygrove
Copy link
Contributor

Is your feature request related to a problem? Please describe.
I would like the ability to use contains_re in a way that is compatible with Java and Python's default handling of multi-line inputs. The default cuDF behavior is that ^ will match at the beginning of the input and also match after each newline in the input. The default Python and Java/Spark behavior is to only match at the start of the input.

Python default behavior:

>>> print(re.compile('^A').search("A\nB"))
<re.Match object; span=(0, 1), match='A'>
>>> print(re.compile('^B').search("A\nB"))
None

cuDF default behavior:

>>> print(cudf.Series(['A\nB']).str.contains('^A'))
0    True
dtype: bool
>>> print(cudf.Series(['A\nB']).str.contains('^B'))
0    True
dtype: bool

Describe the solution you'd like
I would like to be able to specify how ^ and $ behave with multi-line inputs.

Describe alternatives you've considered
In the RAPIDS Accelerator for Apache Spark, we could potentially parse the regex and translate ^ and $ to \A and \Z.

Additional context
This requirement is being driven by NVIDIA/spark-rapids#3797

@andygrove andygrove added feature request New feature or request Needs Triage Need team to review and classify labels Oct 14, 2021
@davidwendt
Copy link
Contributor

davidwendt commented Oct 14, 2021

Python regex library includes a MULTILINE flag that can be passed on various regex functions (e.g. regex.match) to specify how the ^ and $ instructions are interpretted.

This feature looks like it may be possible to add into libcudf by adding a similar flags parameter to the cudf::strings APIs (and Python cudf strings APIs) that accept regex patterns.

@davidwendt davidwendt self-assigned this Oct 14, 2021
@davidwendt davidwendt added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Oct 14, 2021
@beckernick beckernick removed the Needs Triage Need team to review and classify label Oct 25, 2021
@rapids-bot rapids-bot bot closed this as completed in #9482 Nov 1, 2021
rapids-bot bot pushed a commit that referenced this issue Nov 1, 2021
Closes #9439 

The `^` (begin anchor) and `$` (end anchor) apply to beginning of line (BOL) and end of line (EOL) respectively. This means that they cannot be used to match on strings containing embedded new-line ('\n') characters when desiring the anchors only match just the beginning and end of the string as a whole.

Many regex engines support a flag for overriding the behavior of the BOL/EOL anchors: [Python](https://docs.python.org/3/library/re.html#re.MULTILINE), [Java](https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#MULTILINE), [C++](https://en.cppreference.com/w/cpp/regex/basic_regex/constants). This PR introduces a similar flag parameter to the `cudf::strings::contains_re`, `cudf::strings::matches_re` and `cudf::strings::count_re` APIs to tell the regex engine how to interpret the anchor characters in the given regex pattern. 

Additional information about these anchors can also be found here: https://www.regular-expressions.info/anchors.html

The current default behavior of the libcudf regex is to interpret BOL/EOL as similar to the `MULTILINE` flag. This behavior doesn't match the engines/languages listed above. So for consistency the default is reversed requiring this PR to be a breaking change.

Also, the new `flags` parameter added to the above APIs requires this to be a breaking change. An additional flag (DOTALL) is included in this PR since the internal regex code supports it but only needed a path for the caller to specify the behavior. The `DOTALL` flag is also a feature of the above languages. When specified, the dot '.' pattern includes embedded new-line characters in its matching character set.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #9482
rapids-bot bot pushed a commit that referenced this issue Nov 2, 2021
Closes #7904 

Depends on  #9439 

Adds support to accepting a pattern argument in cudf `str.replace` that is built using `re.compile`. The resulting `re.Pattern` has a member `pattern` that is the regex string that can be passed to libcudf. There is also `flags` member that can be used for libcudf APIs that accept a `regex_flags` parameter along with the `pattern` string.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - https://github.com/brandon-b-miller
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #9573
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants