-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve regexp_match performance by avoiding cloning Regex #8631
Changes from 2 commits
a47e771
190fa60
467905b
adde667
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,7 +25,8 @@ use arrow::array::{ | |
new_null_array, Array, ArrayDataBuilder, ArrayRef, BufferBuilder, GenericStringArray, | ||
OffsetSizeTrait, | ||
}; | ||
use arrow::compute; | ||
use arrow_array::builder::{GenericStringBuilder, ListBuilder}; | ||
use arrow_schema::ArrowError; | ||
use datafusion_common::{arrow_datafusion_err, plan_err}; | ||
use datafusion_common::{ | ||
cast::as_generic_string_array, internal_err, DataFusionError, Result, | ||
|
@@ -58,7 +59,7 @@ pub fn regexp_match<T: OffsetSizeTrait>(args: &[ArrayRef]) -> Result<ArrayRef> { | |
2 => { | ||
let values = as_generic_string_array::<T>(&args[0])?; | ||
let regex = as_generic_string_array::<T>(&args[1])?; | ||
compute::regexp_match(values, regex, None).map_err(|e| arrow_datafusion_err!(e)) | ||
_regexp_match(values, regex, None).map_err(|e| arrow_datafusion_err!(e)) | ||
} | ||
3 => { | ||
let values = as_generic_string_array::<T>(&args[0])?; | ||
|
@@ -69,7 +70,7 @@ pub fn regexp_match<T: OffsetSizeTrait>(args: &[ArrayRef]) -> Result<ArrayRef> { | |
Some(f) if f.iter().any(|s| s == Some("g")) => { | ||
plan_err!("regexp_match() does not support the \"global\" option") | ||
}, | ||
_ => compute::regexp_match(values, regex, flags).map_err(|e| arrow_datafusion_err!(e)), | ||
_ => _regexp_match(values, regex, flags).map_err(|e| arrow_datafusion_err!(e)), | ||
} | ||
} | ||
other => internal_err!( | ||
|
@@ -78,6 +79,82 @@ pub fn regexp_match<T: OffsetSizeTrait>(args: &[ArrayRef]) -> Result<ArrayRef> { | |
} | ||
} | ||
|
||
/// TODO: Remove this once it is included in arrow-rs new release. | ||
viirya marked this conversation as resolved.
Show resolved
Hide resolved
|
||
fn _regexp_match<OffsetSize: OffsetSizeTrait>( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will go to submit this fix and benchmark to arrow-rs later. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
array: &GenericStringArray<OffsetSize>, | ||
regex_array: &GenericStringArray<OffsetSize>, | ||
flags_array: Option<&GenericStringArray<OffsetSize>>, | ||
) -> std::result::Result<ArrayRef, ArrowError> { | ||
let mut patterns: std::collections::HashMap<String, Regex> = | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is awesome -- I think we can do better still (this will still recompile each regex per batch I think -- we can do it once per plan) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, it is per batch (not per row). Definitely if we do the regex compilation per query plan, it will be better. 🚀 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Btw, to clarify it if the long PR description is overlooked by reviewers. This hash map is already in the arrow-rs kernel. The actual fix here is to avoid expensive cloning of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @alamb are you thinking to have some static shared memory pool per query to make static values to be calculated only once per query? we can prob do that with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
std::collections::HashMap::new(); | ||
let builder: GenericStringBuilder<OffsetSize> = | ||
GenericStringBuilder::with_capacity(0, 0); | ||
let mut list_builder = ListBuilder::new(builder); | ||
|
||
let complete_pattern = match flags_array { | ||
Some(flags) => Box::new(regex_array.iter().zip(flags.iter()).map( | ||
|(pattern, flags)| { | ||
pattern.map(|pattern| match flags { | ||
Some(value) => format!("(?{value}){pattern}"), | ||
None => pattern.to_string(), | ||
}) | ||
}, | ||
)) as Box<dyn Iterator<Item = Option<String>>>, | ||
None => Box::new( | ||
regex_array | ||
.iter() | ||
.map(|pattern| pattern.map(|pattern| pattern.to_string())), | ||
), | ||
}; | ||
|
||
array | ||
.iter() | ||
.zip(complete_pattern) | ||
.map(|(value, pattern)| { | ||
match (value, pattern) { | ||
// Required for Postgres compatibility: | ||
// SELECT regexp_match('foobarbequebaz', ''); = {""} | ||
(Some(_), Some(pattern)) if pattern == *"" => { | ||
list_builder.values().append_value(""); | ||
list_builder.append(true); | ||
} | ||
(Some(value), Some(pattern)) => { | ||
let existing_pattern = patterns.get(&pattern); | ||
let re = match existing_pattern { | ||
Some(re) => re, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is actual fix. Getting rid of cloning |
||
None => { | ||
let re = Regex::new(pattern.as_str()).map_err(|e| { | ||
ArrowError::ComputeError(format!( | ||
"Regular expression did not compile: {e:?}" | ||
)) | ||
})?; | ||
patterns.insert(pattern.clone(), re); | ||
patterns.get(&pattern).unwrap() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And here too. |
||
} | ||
}; | ||
match re.captures(value) { | ||
Some(caps) => { | ||
let mut iter = caps.iter(); | ||
if caps.len() > 1 { | ||
iter.next(); | ||
} | ||
for m in iter.flatten() { | ||
list_builder.values().append_value(m.as_str()); | ||
} | ||
|
||
list_builder.append(true); | ||
} | ||
None => list_builder.append(false), | ||
} | ||
} | ||
_ => list_builder.append(false), | ||
} | ||
Ok(()) | ||
}) | ||
.collect::<std::result::Result<Vec<()>, ArrowError>>()?; | ||
Ok(Arc::new(list_builder.finish())) | ||
} | ||
|
||
/// replace POSIX capture groups (like \1) with Rust Regex group (like ${1}) | ||
/// used by regexp_replace | ||
fn regex_replace_posix_groups(replacement: &str) -> String { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we're also are not supporting the scalar version (I.e. regex being a literatal value)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you did some testing already apache/arrow-rs#5235 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have bigger "plans" -- #8051 to specialize the constant arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, for the ScalarUDF in DataFusion, we probably can rely on further optimization like #8051. Per kernel perspective, I will probably submit a Datum version later as mentioned apache/arrow-rs#5235 (comment).