-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multiple new-line characters in regex APIs #15961
Support multiple new-line characters in regex APIs #15961
Conversation
Benchmark numbers from Spark-RAPIDSSpark-RAPIDS uses a transpiler to handle the Java-based line-terminators (based on methodology I used here #15746 (comment)). Basically, using regexp_extract from Spark, I made 4 test runs. The first 2 test runs use cuDF 24.08, and the second 2 test runs using this branch. Between the 2 runs for each branch, the first run tests with the Spark-RAPIDS transpiler turned on (to measure the performance of the cudf regex engine with an expanded regex), and the second run tests with the transpiler turned off (to measure the performance of the cudf regex engine with just the original regex). I chose a relatively simple pattern
Note when using this branch, I currently simplified the the $ transpilation from I can conclude that from these measurements, this branch would most likely improve Spark performance: (5.493 ms -> 3.4095 ms is about a 38% performance improvement) if we can use the more simplified Let me know if any more information else is needed |
/merge |
This PR introduces the necessary changes to the cuDF jni to support the issue described in [NVIDIA/spark-rapids#11554](NVIDIA/spark-rapids#11554). For further information, refer to the details in the [comment](NVIDIA/spark-rapids#11554 (comment)). Issue #15961 adds support for handling multiple line delimiters. This PR extends that functionality to JNI, which was previously missing, and also includes a test to validate the changes. Authors: - Suraj Aralihalli (https://github.com/SurajAralihalli) Approvers: - MithunR (https://github.com/mythrocks) - Robert (Bobby) Evans (https://github.com/revans2) URL: #17139
Description
Add support for multiple new-line characters for BOL (
^
/\A
) and EOL ($
/\Z
):\n
line-feed (already supported)\r
carriage-return\u0085
next line (NEL)\u2028
line separator\u2029
paragraph separatorReference #15746
Checklist