-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: precompile literal regex pattern #11455
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution 👍🏼
Before regex is compiled once a batch (default 8k rows), this PR makes it once per execution.
I'm not familiar with the overhead of regular expression, if there is some case compiling a complicated regex is way more expensive than evaluating it, this PR would be necessary.
I left some minor suggestions.
let right = lit($B_SCALAR); | ||
|
||
// verify that we can construct the expression | ||
let expression = binary(left, $OP, right, &schema).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do an extra check: the regex is compiled inside the expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR @zhuliquan and @2010YOUY01 for the review
I am concerned about special casing BinaryExpr in this way. I left some thoughts. Let me know what you think
pub struct BinaryExpr { | ||
left: Arc<dyn PhysicalExpr>, | ||
op: Operator, | ||
right: Arc<dyn PhysicalExpr>, | ||
/// Specifies whether an error is returned on overflow or not | ||
fail_on_overflow: bool, | ||
/// Only used when evaluating literal regex expressions. Example regex expression: c1 ~ '^a' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see why you proceeded down this route. However, I think it is somewhat strange to have something so regexp specific attached to BinaryExpr
I wonder if it is possible to convert regex expressions like ~
to function calls and then use the existing function call API https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.simplify to rewrite to a new regular expression if the argument is a constant
Here is the regexp_like function: https://github.com/apache/datafusion/blob/main/datafusion/functions/src/regex/regexplike.rs
Though now that I look at this I wonder why there are separate implementations for ~
and regexp_like
🤔
@alamb I thought a idea that we can attach lru-cache (key is regex, value is result of compiled regex) to struct #[derive(Debug)]
pub struct RegexpLikeFunc {
lru_cache: LruCache<String, regex::Regex>,
signature: Signature,
}
impl Default for RegexpLikeFunc {
fn default() -> Self {
Self::new()
}
}
impl RegexpLikeFunc {
pub fn new() -> Self {
use DataType::*;
Self {
signature: Signature::one_of(
vec![
Exact(vec![Utf8, Utf8]),
Exact(vec![LargeUtf8, Utf8]),
Exact(vec![Utf8, Utf8, Utf8]),
Exact(vec![LargeUtf8, Utf8, Utf8]),
],
Volatility::Immutable,
),
lru_cache: LruCache::new(NonZeroUsize::new(1024).unwrap()),
}
}
} We can use cache like way in it's make full use of result of compiled regex. and this way can be applied for scalar or array two cases. |
de5735b
to
a0d3598
Compare
pub struct BinaryExpr { | ||
left: Arc<dyn PhysicalExpr>, | ||
op: Operator, | ||
right: Arc<dyn PhysicalExpr>, | ||
/// Specifies whether an error is returned on overflow or not | ||
fail_on_overflow: bool, | ||
/// Only used when evaluating literal regex expressions. Example regex expression: c1 ~ '^a' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a bit of a leaky abstraction or hack just for this one special case (= partial evaluation of regexes). If wonder if we could do something more elegant that would also work for other similar operators:
- introduce a new struct
CompiledRegex
that holds the complied regex. - implement
PhysicalExpr
forCompiledRegex
- change the regex evaluation of the binary expression to check if the RHS (=
BinaryExpr::right
isCompiledExpr
and use this pre-compiled value - change the optimizer pass to just replace the RHS instead attaching the cached data to
BinaryExpr
Another approach would be to lower/pre-compile the regex expressions to a custom data type, but currently arrow-rs doesn't support that, see apache/arrow-rs#4472
Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look |
cb2a242
to
7db4213
Compare
it's help for saving compile time for each evaluating of record batch
Which issue does this PR close?
#11146
Closes #.
Rationale for this change
it's help for saving compile time for each evaluating of record batch
What changes are included in this PR?
1、precompile literal regex pattern when build binary expr
2、using precompiled regex for match when evaluating.
Are these changes tested?
yes
Are there any user-facing changes?
no