-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add additional regexp function regexp_count() #12080
Conversation
|
||
# NULL tests | ||
|
||
query I |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slightly different from PostgreSQL. Datafusion treat NULL literary as StringArray of 1 element NULL
instead of null array or empty array
45afcf0
to
08343dd
Compare
Finalizing the details takes a lot more effort than expected. Would you like to take a look? Thanks! @alamb @jayzhan211 Benchmark regexp_count_1000 string
time: [6.6158 ms 6.6634 ms 6.7108 ms]
regexp_count_1000 utf8view
time: [6.7117 ms 6.7647 ms 6.8183 ms]
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
regexp_like_1000 time: [3.7056 ms 3.7170 ms 3.7289 ms]
change: [-5.9843% -5.0861% -4.1952%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild
regexp_match_1000 time: [4.4132 ms 4.4287 ms 4.4466 ms]
change: [-9.8318% -8.4331% -7.0280%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
regexp_replace_1000 time: [3.4351 ms 3.4697 ms 3.5142 ms]
change: [-13.734% -11.848% -10.019%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
1 (1.00%) high mild
6 (6.00%) high severe |
@@ -52,7 +52,7 @@ encoding_expressions = ["base64", "hex"] | |||
# enable math functions | |||
math_expressions = [] | |||
# enable regular expressions | |||
regex_expressions = ["regex"] | |||
regex_expressions = ["regex", "string_expressions"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have to do this for StringArrayType
. Maybe we should consider relocating it to a common package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be very open to moving the StringArrayType, StringArrayBuilder, StringViewArrayBuilder, etc to a file in functions such as string_array.rs as they are used in multiple modules (unicode, regex, likely datetime in the future, etc) and are quite useful in general for any external UDF that might need them.
))); | ||
} | ||
|
||
let mut regex_cache = HashMap::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want a global regex cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be quite hesitant about adding any global cache for a UDF.
signature: Signature::one_of( | ||
vec![ | ||
Uniform(2, vec![Utf8, LargeUtf8, Utf8View]), | ||
Exact(vec![Utf8, Utf8, Int64]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to invert the order of these:
Planner attempts coercion to the target type starting with the most preferred candidate.
For example, given input `(Utf8View, Utf8)`, it first tries coercing to `(Utf8View, Utf8View)`.
If that fails, it proceeds to `(Utf8, Utf8)`.
- [regexp_like](#regexp_like) | ||
- [regexp_match](#regexp_match) | ||
- [regexp_replace](#regexp_replace) | ||
|
||
[pcre-like]: https://en.wikibooks.org/wiki/Regular_Expressions/Perl-Compatible_Regular_Expressions | ||
[syntax]: https://docs.rs/regex/latest/regex/#syntax | ||
|
||
### `regexp_count` | ||
|
||
Returns the number of matchs that a [regular expression] has in a string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
matches*
Thanks for the very nice PR @xinlifoobar ! I'll try and take the time to do a full review of this PR next week if no one beats me to it. |
if values.len() != regex_array.len() { | ||
return Err(ArrowError::ComputeError(format!( | ||
"regex_array must be the same length as values array; got {} and {}", | ||
values.len(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest aligning the parameters with the text by having regex_array.len() first
Exact(vec![LargeUtf8, LargeUtf8, Int64, LargeUtf8]), | ||
Exact(vec![Utf8View, Utf8View, Int64]), | ||
Exact(vec![Utf8View, Utf8View, Int64, Utf8View]), | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A note on this signature. The postgresql version of this has start as optional - I think it would be nice to allow that as well in this UDF. The UDF could just use a scalar of 1 if it's not present to make the existing count functions work as is.
Sorry, I missed the Uniform portion of the signature - I see the tests cover this as well :)
)); | ||
} | ||
|
||
let find_slice = value.chars().skip(start as usize - 1).collect::<String>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be worth it to have a fast path if start is 1? Untested suggestion:
let find_slice = value.chars().skip(start as usize - 1).collect::<String>(); | |
let count = if start == 1 { pattern.find_iter(value).count() } else { | |
let find_slice = value.chars().skip(start as usize - 1).collect::<String>() | |
pattern.find_iter(find_slice.as_str()).count() | |
} ; |
I think this is a good PR and worthy of merging into main. My only thoughts are some small things noted in my comments and the fact that counts seems to be twice as slow as like and replace. I was able to reproduce the benchmark but unfortunately I cannot run flamegraph on my machine (perf and WSL is a black art) so I wasn't really able to narrow down the cause. |
@alamb - since @xinlifoobar seems to be dormant my thoughts on this PR is to merge it in and file a couple of tickets to improve it, primarily the performance discrepancy. |
Makes sense to me -- thank you @Omega359 -- would you be willing to make a new PR (merged/ rebased to fix the conflicts on main)? I can help with the review / merge / file follow on tickets. |
Sure thing. I'll hopefully work on it this weekend after #12149 |
Merged in #12970 |
Which issue does this PR close?
Closes #12079 and part of #11946
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?