Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple new-line characters in regex APIs #15961

Merged
merged 65 commits into from
Sep 17, 2024

Conversation

davidwendt
Copy link
Contributor

Description

Add support for multiple new-line characters for BOL (^ / \A) and EOL ($ / \Z):

  • \n line-feed (already supported)
  • \r carriage-return
  • \u0085 next line (NEL)
  • \u2028 line separator
  • \u2029 paragraph separator

Reference #15746

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 10, 2024
@davidwendt davidwendt self-assigned this Jun 10, 2024
@NVnavkumar
Copy link
Contributor

Benchmark numbers from Spark-RAPIDS

Spark-RAPIDS uses a transpiler to handle the Java-based line-terminators

(based on methodology I used here #15746 (comment)).

Basically, using regexp_extract from Spark, I made 4 test runs. The first 2 test runs use cuDF 24.08, and the second 2 test runs using this branch. Between the 2 runs for each branch, the first run tests with the Spark-RAPIDS transpiler turned on (to measure the performance of the cudf regex engine with an expanded regex), and the second run tests with the transpiler turned off (to measure the performance of the cudf regex engine with just the original regex).

I chose a relatively simple pattern (boo:+)$ (which involves a single capture group and the $ to match the end of the string. I chose extract since that's the case where the Spark-RAPIDS regexp transpiler code since it tests how extract handles the potential non-capturing group for $ transpilation. The strings that this regex is run against all end in newline (\n), so the results would match for each of these runs.


1) Spark-RAPIDS Transpiled Regex (cuDF 24.08)
Name	Start	Duration	TID	Category
extract	20.8801s	6.062 ms	3368239	
extract	20.8801s	6.084 ms 3368238	
extract	20.9176s	5.029 ms	3368236	
extract	20.9165s	4.797 ms	3368237	
----------------------------------------
avg = 5.493 ms

2) Spark-RAPIDS Non-transpiled Regex (cuDF 24.08) 
Name	Start	Duration	TID	Category
extract	20.8900s    2.755 ms	3373723	
extract	20.8901s	    2.727 ms	3373722	
extract	20.9348s	2.099 ms	3373721	
extract	20.9346s	1.930 ms	3373724	
----------------------------------------
avg = 2.37775 ms

3) Spark-RAPIDS Simplified Transpiled Regex (cuDF regex-new-lines) 
   - estimated 
Name	Start	Duration	TID	Category
extract	20.2914s	4.051 ms	3383707	
extract	20.2914s	4.045 ms	3383706	
extract	20.3376s	2.788 ms	3383704	
extract	20.3377s	2.754 ms	3383705	
----------------------------------------
avg = 3.4095 ms

4) Non-transpiled Regex (cuDF regex-new-lines) 
Name	Start	Duration	TID	Category
extract	20.2504s	3.832 ms	3395376	
extract	20.2506s	3.711 ms	3395379	
extract	20.2844s	2.797 ms	3395378	
extract	20.2843s	2.726 ms	3395377	
----------------------------------------
avg = 3.2665 ms

Note when using this branch, I currently simplified the the $ transpilation from (?:\r|\u0085|\u2028|\u2029|\r\n)?$ to \r?$. This is a tentative fix, but it will require more testing. Ideally if this mode is put behind a configuration flag, that would make this transition easiest, since the Spark-RAPIDS fix will need significant correctness testing in order to confirm that this transpilation will work.

I can conclude that from these measurements, this branch would most likely improve Spark performance: (5.493 ms -> 3.4095 ms is about a 38% performance improvement) if we can use the more simplified $ transpilation. The original cudf regex engine performance would take a 37% performance hit in this mode, so this would encourage the use of flag here.

Let me know if any more information else is needed

@davidwendt davidwendt changed the base branch from branch-24.08 to branch-24.10 July 22, 2024 22:14
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Aug 26, 2024
@davidwendt davidwendt marked this pull request as ready for review August 28, 2024 14:24
@davidwendt davidwendt requested a review from a team as a code owner August 28, 2024 14:24
cpp/doxygen/regex.md Outdated Show resolved Hide resolved
cpp/src/strings/regex/regex.inl Show resolved Hide resolved
@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit e98e109 into rapidsai:branch-24.10 Sep 17, 2024
98 checks passed
@davidwendt davidwendt deleted the regex-new-lines branch September 17, 2024 17:19
rapids-bot bot pushed a commit that referenced this pull request Oct 23, 2024
This PR introduces the necessary changes to the cuDF jni to support the issue described in [NVIDIA/spark-rapids#11554](NVIDIA/spark-rapids#11554). For further information, refer to the details in the [comment](NVIDIA/spark-rapids#11554 (comment)).

Issue #15961 adds support for handling multiple line delimiters. This PR extends that functionality to JNI, which was previously missing, and also includes a test to validate the changes.

Authors:
  - Suraj Aralihalli (https://github.com/SurajAralihalli)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #17139
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants