Add additional regexp function regexp_count() #12080

xinlifoobar · 2024-08-20T13:49:36Z

Which issue does this PR close?

Closes #12079 and part of #11946

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

…gexp_count

datafusion/functions/src/regex/regexpcount.rs

xinlifoobar · 2024-08-22T06:18:40Z

datafusion/sqllogictest/test_files/regexp.slt

+
+# NULL tests
+
+query I


This is slightly different from PostgreSQL. Datafusion treat NULL literary as StringArray of 1 element NULL instead of null array or empty array

…gexp_count

xinlifoobar · 2024-08-29T05:39:51Z

Finalizing the details takes a lot more effort than expected. Would you like to take a look? Thanks! @alamb @jayzhan211

Benchmark

regexp_count_1000 string
                        time:   [6.6158 ms 6.6634 ms 6.7108 ms]

regexp_count_1000 utf8view
                        time:   [6.7117 ms 6.7647 ms 6.8183 ms]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

regexp_like_1000        time:   [3.7056 ms 3.7170 ms 3.7289 ms]
                        change: [-5.9843% -5.0861% -4.1952%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

regexp_match_1000       time:   [4.4132 ms 4.4287 ms 4.4466 ms]
                        change: [-9.8318% -8.4331% -7.0280%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

regexp_replace_1000     time:   [3.4351 ms 3.4697 ms 3.5142 ms]
                        change: [-13.734% -11.848% -10.019%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

xinlifoobar · 2024-08-29T05:50:56Z

datafusion/functions/Cargo.toml

@@ -52,7 +52,7 @@ encoding_expressions = ["base64", "hex"]
 # enable math functions
 math_expressions = []
 # enable regular expressions
-regex_expressions = ["regex"]
+regex_expressions = ["regex", "string_expressions"]


Have to do this for StringArrayType. Maybe we should consider relocating it to a common package?

I would be very open to moving the StringArrayType, StringArrayBuilder, StringViewArrayBuilder, etc to a file in functions such as string_array.rs as they are used in multiple modules (unicode, regex, likely datetime in the future, etc) and are quite useful in general for any external UDF that might need them.

xinlifoobar · 2024-08-29T05:53:05Z

datafusion/functions/src/regex/regexpcount.rs

+                )));
+            }
+
+            let mut regex_cache = HashMap::new();


Do we want a global regex cache?

I would be quite hesitant about adding any global cache for a UDF.

Omega359 · 2024-09-28T22:00:41Z

datafusion/functions/src/regex/regexpcount.rs

+            signature: Signature::one_of(
+                vec![
+                    Uniform(2, vec![Utf8, LargeUtf8, Utf8View]),
+                    Exact(vec![Utf8, Utf8, Int64]),


You may want to invert the order of these:

Planner attempts coercion to the target type starting with the most preferred candidate. For example, given input `(Utf8View, Utf8)`, it first tries coercing to `(Utf8View, Utf8View)`. If that fails, it proceeds to `(Utf8, Utf8)`.

Omega359 · 2024-09-28T22:05:07Z

docs/source/user-guide/sql/scalar_functions.md

 - [regexp_like](#regexp_like)
 - [regexp_match](#regexp_match)
 - [regexp_replace](#regexp_replace)

 [pcre-like]: https://en.wikibooks.org/wiki/Regular_Expressions/Perl-Compatible_Regular_Expressions
 [syntax]: https://docs.rs/regex/latest/regex/#syntax

+### `regexp_count`
+
+Returns the number of matchs that a [regular expression] has in a string.


Omega359 · 2024-09-28T22:06:49Z

Thanks for the very nice PR @xinlifoobar ! I'll try and take the time to do a full review of this PR next week if no one beats me to it.

Omega359 · 2024-09-30T21:55:55Z

datafusion/functions/src/regex/regexpcount.rs

+            if values.len() != regex_array.len() {
+                return Err(ArrowError::ComputeError(format!(
+                    "regex_array must be the same length as values array; got {} and {}",
+                    values.len(),


I would suggest aligning the parameters with the text by having regex_array.len() first

Omega359 · 2024-09-30T22:03:36Z

datafusion/functions/src/regex/regexpcount.rs

+                    Exact(vec![LargeUtf8, LargeUtf8, Int64, LargeUtf8]),
+                    Exact(vec![Utf8View, Utf8View, Int64]),
+                    Exact(vec![Utf8View, Utf8View, Int64, Utf8View]),
+                ],


A note on this signature. The postgresql version of this has start as optional - I think it would be nice to allow that as well in this UDF. The UDF could just use a scalar of 1 if it's not present to make the existing count functions work as is.
Sorry, I missed the Uniform portion of the signature - I see the tests cover this as well :)

Omega359 · 2024-09-30T22:05:25Z

datafusion/functions/src/regex/regexpcount.rs

+            ));
+        }
+
+        let find_slice = value.chars().skip(start as usize - 1).collect::<String>();


I wonder if it would be worth it to have a fast path if start is 1? Untested suggestion:

Suggested change

let find_slice = value.chars().skip(start as usize - 1).collect::<String>();

let count = if start == 1 { pattern.find_iter(value).count() } else {

let find_slice = value.chars().skip(start as usize - 1).collect::<String>()

pattern.find_iter(find_slice.as_str()).count()

} ;

Omega359 · 2024-10-02T22:06:36Z

I think this is a good PR and worthy of merging into main. My only thoughts are some small things noted in my comments and the fact that counts seems to be twice as slow as like and replace. I was able to reproduce the benchmark but unfortunately I cannot run flamegraph on my machine (perf and WSL is a black art) so I wasn't really able to narrow down the cause.

Omega359 · 2024-10-10T15:51:49Z

@alamb - since @xinlifoobar seems to be dormant my thoughts on this PR is to merge it in and file a couple of tickets to improve it, primarily the performance discrepancy.

alamb · 2024-10-11T14:38:32Z

Makes sense to me -- thank you @Omega359 -- would you be willing to make a new PR (merged/ rebased to fix the conflicts on main)? I can help with the review / merge / file follow on tickets.

Omega359 · 2024-10-11T14:42:45Z

Sure thing. I'll hopefully work on it this weekend after #12149

alamb · 2024-10-18T20:20:26Z

Merged in #12970

xinlifoobar added 5 commits August 20, 2024 21:39

Implement regexp_ccount

f72c11f

Merge branch 'main' of github.com:apache/datafusion into dev/xinli/re…

682a50a

…gexp_count

Update document

ee23b97

fix check

d5b63f4

add more tests

2acd148

github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) functions labels Aug 20, 2024

Merge branch 'main' of github.com:apache/datafusion into dev/xinli/re…

a3563ee

…gexp_count

xinlifoobar commented Aug 21, 2024

View reviewed changes

datafusion/functions/src/regex/regexpcount.rs Outdated Show resolved Hide resolved

Update the world to 1.80

27a6fc6

github-actions bot added development-process Related to development process of DataFusion core Core DataFusion crate substrait proto Related to proto crate labels Aug 21, 2024

xinlifoobar added 2 commits August 21, 2024 21:33

Fix doc format

d17e45d

Add null tests

ee14adf

xinlifoobar commented Aug 22, 2024

View reviewed changes

Add uft8 support and bench

08343dd

xinlifoobar force-pushed the dev/xinli/regexp_count branch from 45afcf0 to 08343dd Compare August 22, 2024 06:53

xinlifoobar marked this pull request as draft August 23, 2024 09:33

xinlifoobar added 3 commits August 28, 2024 22:27

Refactoring regexp_count

218ff7b

Merge branch 'main' of github.com:apache/datafusion into dev/xinli/re…

0333ec4

…gexp_count

Refactoring regexp_count

07312be

github-actions bot removed core Core DataFusion crate substrait proto Related to proto crate labels Aug 29, 2024

Revert ci change

4eb7e6b

xinlifoobar marked this pull request as ready for review August 29, 2024 04:15

github-actions bot removed the development-process Related to development process of DataFusion label Aug 29, 2024

Fix ci

cb13556

xinlifoobar commented Aug 29, 2024

View reviewed changes

This was referenced Sep 9, 2024

DataFusion weekly project plan (Andrew Lamb) - Sep 9, 2024 #12391

Closed

DataFusion weekly project plan (Andrew Lamb) - Sep 2, 2024 #12336

Closed

alamb mentioned this pull request Sep 16, 2024

DataFusion weekly project plan (Andrew Lamb) - Sep 16, 2024 #12494

Closed

8 tasks

Omega359 reviewed Sep 28, 2024

View reviewed changes

Omega359 reviewed Sep 30, 2024

View reviewed changes

This was referenced Oct 12, 2024

Move StringArrayType, StringViewArrayBuilder, StringArrayBuilder & LargeStringArrayBuilder from functions/string/common.rs to a more common crate/module #12898

Closed

feat: Add regexp_count function #12970

Merged

alamb closed this Oct 18, 2024

alamb mentioned this pull request Oct 18, 2024

Improve performance of regexp_count #13011

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add additional regexp function regexp_count() #12080

Add additional regexp function regexp_count() #12080

xinlifoobar commented Aug 20, 2024 •

edited

Loading

xinlifoobar Aug 22, 2024

xinlifoobar commented Aug 29, 2024

xinlifoobar Aug 29, 2024

Omega359 Sep 30, 2024 •

edited

Loading

xinlifoobar Aug 29, 2024

Omega359 Sep 28, 2024

Omega359 Sep 28, 2024

Omega359 Sep 28, 2024

Omega359 commented Sep 28, 2024

Omega359 Sep 30, 2024

Omega359 Sep 30, 2024 •

edited

Loading

Omega359 Sep 30, 2024 •

edited

Loading

Omega359 commented Oct 2, 2024

Omega359 commented Oct 10, 2024

alamb commented Oct 11, 2024

Omega359 commented Oct 11, 2024

alamb commented Oct 18, 2024

Add additional regexp function regexp_count() #12080

Add additional regexp function regexp_count() #12080

Conversation

xinlifoobar commented Aug 20, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

xinlifoobar Aug 22, 2024

Choose a reason for hiding this comment

xinlifoobar commented Aug 29, 2024

xinlifoobar Aug 29, 2024

Choose a reason for hiding this comment

Omega359 Sep 30, 2024 • edited Loading

Choose a reason for hiding this comment

xinlifoobar Aug 29, 2024

Choose a reason for hiding this comment

Omega359 Sep 28, 2024

Choose a reason for hiding this comment

Omega359 Sep 28, 2024

Choose a reason for hiding this comment

Omega359 Sep 28, 2024

Choose a reason for hiding this comment

Omega359 commented Sep 28, 2024

Omega359 Sep 30, 2024

Choose a reason for hiding this comment

Omega359 Sep 30, 2024 • edited Loading

Choose a reason for hiding this comment

Omega359 Sep 30, 2024 • edited Loading

Choose a reason for hiding this comment

Omega359 commented Oct 2, 2024

Omega359 commented Oct 10, 2024

alamb commented Oct 11, 2024

Omega359 commented Oct 11, 2024

alamb commented Oct 18, 2024

xinlifoobar commented Aug 20, 2024 •

edited

Loading

Omega359 Sep 30, 2024 •

edited

Loading

Omega359 Sep 30, 2024 •

edited

Loading

Omega359 Sep 30, 2024 •

edited

Loading