-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization: make the most of Hint::AcceptsSingular when call make_scalar_function to Improve performance #10054
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JasonLi-cn -- this looks neat.
In terms of performance, I left a sugestion about how to improve this specific PR -- let me know what you think
@@ -88,10 +105,13 @@ pub fn overlay<T: OffsetSizeTrait>(args: &[ArrayRef]) -> Result<ArrayRef> { | |||
let characters_array = as_generic_string_array::<T>(&args[1])?; | |||
let pos_num = as_int64_array(&args[2])?; | |||
|
|||
let characters_array_iter = adaptive_array_iter(characters_array.iter()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the benchmark, there can be about a 12% performance improvement 🚀 when there are 3 scalar args; however, when all args are arrays, there is a 2% ~ 3% performance loss 😔. Does anyone have any ideas on how to avoid this loss?
One reason it may be getting slower is that there is overhead now during iteration (where it has to switch on Left/Right
in each element
One thing to try would be to avoid the Left/Right
iterator by instantiating different loops depending on the input types
So something like
fn<CI, PI> overlay3(string_array: &StringArray, characters_array_iter: CI, pos_num_iter: PI) -> Result<ArrayRef>
where
CI: Iterator<Item = char>,
PI: Iterator<Item =usize>
{
let result = string_array
.iter()
.zip(characters_array_iter)
.zip(pos_num_iter)
.map(|((string, characters), start_pos)| {
...
}
And then you need to instantiate versions based on lengths:
match (character_array.len(), pos_num.len()) {
(1, 1) => {
let c_value = character_array.next().expect("Contains a value");
let p_value = pos_num.next().expect("Contains a value");
overlay3(character_array, std::iter::repeat(c_value, len), std;:iter::repeat(p_value, len))
(1, _) => {
let c_value = character_array.next().expect("Contains a value");
overlay3(character_array, std::iter::repeat(c_value, len), pos_num.iter())
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb for your suggestion. However, if we were to make changes according to your suggestion, many functions would require significant modifications, such as lpad
, rpad
, strpos
, and so on.
Here, I am seeking a universal and elegant method: based on the current function implementations, with only minimal modifications, we can optimize for the Scalar case without degrading the performance for Arrays.
(Maybe my thinking is not correct, and I need your help to provide suggestions for modification 🙏)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb for your suggestion. However, if we were to make changes according to your suggestion, many functions would require significant modifications, such as
lpad
,rpad
,strpos
, and so on.
Yes, I agree with this assesment
Here, I am seeking a universal and elegant method: based on the current function implementations, with only minimal modifications, we can optimize for the Scalar case without degrading the performance for Arrays.
Thank you -- this makes sense.
I think the only way to do this is to have compile time specialized implementations for the cases you want to optimize.
(Maybe my thinking is not correct, and I need your help to provide suggestions for modification 🙏)
I suspect the solution will look something like a scalar (rust) function that actually implements the the operation and then a macro / generic that instantiates the function in different ways depending on argument type
For example
/// do the actual operation, calling `output` for each string produced
fn overlay3<F: Fn(&str)> (output: F, input: &str, len: usize, pos: usize) {...}
And then make a macro or other templated functin that instantiates overlay3
in different loops depending on the arguments (columns, scalars, etc).
You should be able to avoid code duplication in the source code, though we will need a copy a runtime
Hoepfully that helps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Referring to the suggestions you provided, I attempted to make modifications, and according to the benchmark results, there is no significant change before and after optimization when the arg is an Array.
However, I think the current implementation of the macro is not very good. Please help provide better suggestions for modification.
New benchmark
Gnuplot not found, using plotters backend
4args_with_3scalars/overlay/1024
time: [36.730 µs 36.969 µs 37.206 µs]
change: [-12.359% -11.851% -11.372%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
4args_with_3scalars/overlay/4096
time: [139.93 µs 140.35 µs 140.81 µs]
change: [-13.155% -12.784% -12.412%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe
4args_with_3scalars/overlay/8192
time: [284.61 µs 285.42 µs 286.43 µs]
change: [-12.164% -11.692% -11.219%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
4args_without_scalar/overlay/1024
time: [36.599 µs 36.683 µs 36.765 µs]
change: [-0.1637% +0.1015% +0.3769%] (p = 0.48 > 0.05)
No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
4args_without_scalar/overlay/4096
time: [142.05 µs 142.41 µs 142.84 µs]
change: [-1.5247% -1.2324% -0.9312%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe
4args_without_scalar/overlay/8192
time: [291.79 µs 292.42 µs 293.19 µs]
change: [-2.9181% -1.9757% -1.1922%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) high mild
2 (2.00%) high severe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @JasonLi-cn -- I will try and take another look later this week
.zip(characters_array.iter()) | ||
.zip(pos_num.iter()) | ||
.zip(characters_array_iter) | ||
.zip(pos_num_iter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While looking at the innter loop of this function, you can probably get a major boost by avoiding String::with_capacoty
and instead reuse the same String
on each element. It might also be possible to skip the copy by using StringBuilder directly rather than copying to a temp string
0ac14ba
to
ad1a80d
Compare
Marking as a draft to make it clear this PR isn't waiting on feedback anymore (I am trying to reduce the outstanding PR review queue) |
Is this PR not satisfactory enough for performance? Do I need to make further improvements to this PR? 🤔 |
Sorry @JasonLi-cn -- marking as ready for review so we can give it another look |
It seems like this PR is failing a CI check: https://github.com/apache/datafusion/actions/runs/8771259487/job/24068788239?pr=10054 |
I originally intended to modify the Cargo.lock after it was approved. 😅 |
It doesn't matter. For this PR, I personally think it is valuable, but in terms of code implementation, I am not sure if there is a better solution, which needs the help of friends in the community. 🙏 |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Closes #10053
Rationale for this change
Benchmark
overlay
functionDiscussion🤔
According to the benchmark, there can be about a 12% performance improvement 🚀 when there are 3 scalar args; however, when all args are arrays, there is a 2% ~ 3% performance loss 😔. Does anyone have any ideas on how to avoid this loss?
What changes are included in this PR?
Currently, this approach has only been validated based on the
overlay
function. If everyone thinks it's appropriate, it can be applied to other functions as well.Are these changes tested?
Are there any user-facing changes?