Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for single-line regex anchors ^/$ in contains_re #9482

Merged
merged 23 commits into from
Nov 1, 2021

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Oct 20, 2021

Closes #9439

The ^ (begin anchor) and $ (end anchor) apply to beginning of line (BOL) and end of line (EOL) respectively. This means that they cannot be used to match on strings containing embedded new-line ('\n') characters when desiring the anchors only match just the beginning and end of the string as a whole.

Many regex engines support a flag for overriding the behavior of the BOL/EOL anchors: Python, Java, C++. This PR introduces a similar flag parameter to the cudf::strings::contains_re, cudf::strings::matches_re and cudf::strings::count_re APIs to tell the regex engine how to interpret the anchor characters in the given regex pattern.

Additional information about these anchors can also be found here: https://www.regular-expressions.info/anchors.html

The current default behavior of the libcudf regex is to interpret BOL/EOL as similar to the MULTILINE flag. This behavior doesn't match the engines/languages listed above. So for consistency the default is reversed requiring this PR to be a breaking change.

Also, the new flags parameter added to the above APIs requires this to be a breaking change. An additional flag (DOTALL) is included in this PR since the internal regex code supports it but only needed a path for the caller to specify the behavior. The DOTALL flag is also a feature of the above languages. When specified, the dot '.' pattern includes embedded new-line characters in its matching character set.

@davidwendt davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) breaking Breaking change labels Oct 20, 2021
@davidwendt davidwendt self-assigned this Oct 20, 2021
@github-actions github-actions bot added the conda label Oct 20, 2021
@codecov
Copy link

codecov bot commented Oct 20, 2021

Codecov Report

Merging #9482 (3e9c390) into branch-21.12 (ab4bfaa) will decrease coverage by 0.12%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9482      +/-   ##
================================================
- Coverage         10.79%   10.66%   -0.13%     
================================================
  Files               116      117       +1     
  Lines             18869    19729     +860     
================================================
+ Hits               2036     2104      +68     
- Misses            16833    17625     +792     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/sorting.py 92.90% <0.00%> (-1.21%) ⬇️
python/cudf/cudf/io/csv.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/hdf.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/abc.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/dlpack.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
... and 66 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f41e05f...3e9c390. Read the comment docs.

@github-actions github-actions bot added the Python Affects Python cuDF API. label Oct 20, 2021
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Oct 27, 2021
@davidwendt davidwendt marked this pull request as ready for review October 27, 2021 17:26
Copy link
Member

@ajschmidt8 ajschmidt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving ops-codeowner file changes

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me. A few minor suggestions and a couple questions due to my unfamiliarity with this code.

python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
cpp/src/strings/regex/regexec.cu Show resolved Hide resolved
cpp/tests/strings/contains_tests.cpp Show resolved Hide resolved
cpp/src/strings/regex/regcomp.cpp Outdated Show resolved Hide resolved
cpp/src/strings/regex/regcomp.cpp Outdated Show resolved Hide resolved
@davidwendt davidwendt requested a review from vyasr October 29, 2021 11:53
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. I have some suggestions for improvement, but overall this looks good.

cpp/include/cudf/strings/contains.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/contains.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/contains.hpp Outdated Show resolved Hide resolved
cpp/src/strings/regex/regcomp.cpp Outdated Show resolved Hide resolved
cpp/src/strings/regex/regcomp.cpp Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
@davidwendt davidwendt requested a review from bdice October 29, 2021 15:47
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks @davidwendt.

@vyasr
Copy link
Contributor

vyasr commented Nov 1, 2021

@gpucibot merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Regex: Provide option for ^ and $ to only match beginning and end of input string
4 participants