Optimization: make the most of Hint::AcceptsSingular when call make_scalar_function to Improve performance #10054

JasonLi-cn · 2024-04-12T08:17:28Z

Which issue does this PR close?

Rationale for this change

Benchmark

`overlay` function

Gnuplot not found, using plotters backend
4args_with_3scalars/overlay/1024
                        time:   [36.540 µs 36.560 µs 36.587 µs]
                        change: [-12.476% -12.385% -12.284%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

4args_with_3scalars/overlay/4096
                        time:   [141.54 µs 141.65 µs 141.82 µs]
                        change: [-12.584% -12.425% -12.231%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe

4args_with_3scalars/overlay/8192
                        time:   [282.77 µs 283.42 µs 284.20 µs]
                        change: [-13.658% -13.353% -12.997%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

4args_without_scalar/overlay/1024
                        time:   [37.682 µs 37.717 µs 37.757 µs]
                        change: [+1.8225% +1.9984% +2.1754%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe

4args_without_scalar/overlay/4096
                        time:   [147.68 µs 147.77 µs 147.91 µs]
                        change: [+2.7595% +3.0052% +3.2531%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 20 outliers among 100 measurements (20.00%)
  4 (4.00%) high mild
  16 (16.00%) high severe

4args_without_scalar/overlay/8192
                        time:   [298.84 µs 299.63 µs 300.27 µs]
                        change: [+1.6336% +2.0972% +2.5438%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Discussion🤔

According to the benchmark, there can be about a 12% performance improvement 🚀 when there are 3 scalar args; however, when all args are arrays, there is a 2% ~ 3% performance loss 😔. Does anyone have any ideas on how to avoid this loss?

What changes are included in this PR?

Currently, this approach has only been validated based on the overlay function. If everyone thinks it's appropriate, it can be applied to other functions as well.

Are these changes tested?

Are there any user-facing changes?

alamb

Thanks @JasonLi-cn -- this looks neat.

In terms of performance, I left a sugestion about how to improve this specific PR -- let me know what you think

alamb · 2024-04-14T15:37:11Z

datafusion/functions/src/string/overlay.rs

@@ -88,10 +105,13 @@ pub fn overlay<T: OffsetSizeTrait>(args: &[ArrayRef]) -> Result<ArrayRef> {
            let characters_array = as_generic_string_array::<T>(&args[1])?;
            let pos_num = as_int64_array(&args[2])?;

+            let characters_array_iter = adaptive_array_iter(characters_array.iter());


According to the benchmark, there can be about a 12% performance improvement 🚀 when there are 3 scalar args; however, when all args are arrays, there is a 2% ~ 3% performance loss 😔. Does anyone have any ideas on how to avoid this loss?

One reason it may be getting slower is that there is overhead now during iteration (where it has to switch on Left/Right in each element

One thing to try would be to avoid the Left/Right iterator by instantiating different loops depending on the input types

So something like

fn<CI, PI> overlay3(string_array: &StringArray, characters_array_iter: CI, pos_num_iter: PI) -> Result<ArrayRef> where CI: Iterator<Item = char>, PI: Iterator<Item =usize> { let result = string_array .iter() .zip(characters_array_iter) .zip(pos_num_iter) .map(|((string, characters), start_pos)| { ... }

And then you need to instantiate versions based on lengths:

match (character_array.len(), pos_num.len()) { (1, 1) => { let c_value = character_array.next().expect("Contains a value"); let p_value = pos_num.next().expect("Contains a value"); overlay3(character_array, std::iter::repeat(c_value, len), std;:iter::repeat(p_value, len)) (1, _) => { let c_value = character_array.next().expect("Contains a value"); overlay3(character_array, std::iter::repeat(c_value, len), pos_num.iter()) ... }

Thank you @alamb for your suggestion. However, if we were to make changes according to your suggestion, many functions would require significant modifications, such as lpad, rpad, strpos, and so on.
Here, I am seeking a universal and elegant method: based on the current function implementations, with only minimal modifications, we can optimize for the Scalar case without degrading the performance for Arrays.
(Maybe my thinking is not correct, and I need your help to provide suggestions for modification 🙏)

Thank you @alamb for your suggestion. However, if we were to make changes according to your suggestion, many functions would require significant modifications, such as lpad, rpad, strpos, and so on.

Yes, I agree with this assesment

Here, I am seeking a universal and elegant method: based on the current function implementations, with only minimal modifications, we can optimize for the Scalar case without degrading the performance for Arrays.

Thank you -- this makes sense.

I think the only way to do this is to have compile time specialized implementations for the cases you want to optimize.

(Maybe my thinking is not correct, and I need your help to provide suggestions for modification 🙏)

I suspect the solution will look something like a scalar (rust) function that actually implements the the operation and then a macro / generic that instantiates the function in different ways depending on argument type

For example

/// do the actual operation, calling `output` for each string produced fn overlay3<F: Fn(&str)> (output: F, input: &str, len: usize, pos: usize) {...}

And then make a macro or other templated functin that instantiates overlay3 in different loops depending on the arguments (columns, scalars, etc).

You should be able to avoid code duplication in the source code, though we will need a copy a runtime

Hoepfully that helps

Referring to the suggestions you provided, I attempted to make modifications, and according to the benchmark results, there is no significant change before and after optimization when the arg is an Array.
However, I think the current implementation of the macro is not very good. Please help provide better suggestions for modification.

New benchmark

Gnuplot not found, using plotters backend 4args_with_3scalars/overlay/1024 time: [36.730 µs 36.969 µs 37.206 µs] change: [-12.359% -11.851% -11.372%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild 4args_with_3scalars/overlay/4096 time: [139.93 µs 140.35 µs 140.81 µs] change: [-13.155% -12.784% -12.412%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe 4args_with_3scalars/overlay/8192 time: [284.61 µs 285.42 µs 286.43 µs] change: [-12.164% -11.692% -11.219%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild 4args_without_scalar/overlay/1024 time: [36.599 µs 36.683 µs 36.765 µs] change: [-0.1637% +0.1015% +0.3769%] (p = 0.48 > 0.05) No change in performance detected. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild 4args_without_scalar/overlay/4096 time: [142.05 µs 142.41 µs 142.84 µs] change: [-1.5247% -1.2324% -0.9312%] (p = 0.00 < 0.05) Change within noise threshold. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high severe 4args_without_scalar/overlay/8192 time: [291.79 µs 292.42 µs 293.19 µs] change: [-2.9181% -1.9757% -1.1922%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) high mild 2 (2.00%) high severe

thanks @JasonLi-cn -- I will try and take another look later this week

alamb · 2024-04-14T15:43:33Z

datafusion/functions/src/string/overlay.rs

-                .zip(characters_array.iter())
-                .zip(pos_num.iter())
+                .zip(characters_array_iter)
+                .zip(pos_num_iter)


While looking at the innter loop of this function, you can probably get a major boost by avoiding String::with_capacoty and instead reuse the same String on each element. It might also be possible to skip the copy by using StringBuilder directly rather than copying to a temp string

alamb · 2024-05-15T19:08:32Z

Marking as a draft to make it clear this PR isn't waiting on feedback anymore (I am trying to reduce the outstanding PR review queue)

JasonLi-cn · 2024-05-16T04:59:52Z

Marking as a draft to make it clear this PR isn't waiting on feedback anymore (I am trying to reduce the outstanding PR review queue)

Is this PR not satisfactory enough for performance? Do I need to make further improvements to this PR? 🤔

alamb · 2024-05-16T18:32:15Z

Sorry @JasonLi-cn -- marking as ready for review so we can give it another look

alamb · 2024-05-16T18:32:42Z

It seems like this PR is failing a CI check: https://github.com/apache/datafusion/actions/runs/8771259487/job/24068788239?pr=10054

JasonLi-cn · 2024-05-17T06:33:11Z

It seems like this PR is failing a CI check: https://github.com/apache/datafusion/actions/runs/8771259487/job/24068788239?pr=10054

I originally intended to modify the Cargo.lock after it was approved. 😅

JasonLi-cn · 2024-05-17T06:36:45Z

Sorry @JasonLi-cn -- marking as ready for review so we can give it another look

It doesn't matter. For this PR, I personally think it is valuable, but in terms of code implementation, I am not sure if there is a better solution, which needs the help of friends in the community. 🙏

github-actions · 2024-07-17T01:51:24Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions · 2024-09-16T02:00:28Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

alamb reviewed Apr 14, 2024

View reviewed changes

JasonLi-cn added 4 commits April 17, 2024 10:02

feat: add create_adaptive_array_iter macro

c8b2ac5

feat: add adaptive_array_iter function

f7e6102

chore: pass clippy

d1a17ba

feat: update overlay

ad1a80d

JasonLi-cn force-pushed the feature/make_the_most_of_Hint_AcceptsSingular branch from 0ac14ba to ad1a80d Compare April 21, 2024 08:13

This was referenced Apr 23, 2024

DataFusion weekly project plan (Andrew Lamb) - April 22, 2024 #10172

Closed

DataFusion weekly project plan (Andrew Lamb) - April 29, 2024 #10283

Closed

alamb marked this pull request as draft May 15, 2024 19:08

alamb marked this pull request as ready for review May 16, 2024 18:32

github-actions bot added the Stale PR has not had any activity for some time label Jul 17, 2024

JasonLi-cn marked this pull request as draft July 17, 2024 02:13

github-actions bot removed the Stale PR has not had any activity for some time label Jul 18, 2024

github-actions bot added the Stale PR has not had any activity for some time label Sep 16, 2024

github-actions bot closed this Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization: make the most of Hint::AcceptsSingular when call make_scalar_function to Improve performance #10054

Optimization: make the most of Hint::AcceptsSingular when call make_scalar_function to Improve performance #10054

JasonLi-cn commented Apr 12, 2024

alamb left a comment

alamb Apr 14, 2024

JasonLi-cn Apr 15, 2024

alamb Apr 16, 2024

JasonLi-cn Apr 21, 2024

alamb Apr 23, 2024

alamb Apr 14, 2024

alamb commented May 15, 2024

JasonLi-cn commented May 16, 2024

alamb commented May 16, 2024

alamb commented May 16, 2024

JasonLi-cn commented May 17, 2024

JasonLi-cn commented May 17, 2024

github-actions bot commented Jul 17, 2024

github-actions bot commented Sep 16, 2024

Optimization: make the most of Hint::AcceptsSingular when call make_scalar_function to Improve performance #10054

Optimization: make the most of Hint::AcceptsSingular when call make_scalar_function to Improve performance #10054

Conversation

JasonLi-cn commented Apr 12, 2024

Which issue does this PR close?

Rationale for this change

Benchmark

overlay function

Discussion🤔

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

alamb Apr 14, 2024

Choose a reason for hiding this comment

JasonLi-cn Apr 15, 2024

Choose a reason for hiding this comment

alamb Apr 16, 2024

Choose a reason for hiding this comment

JasonLi-cn Apr 21, 2024

Choose a reason for hiding this comment

New benchmark

alamb Apr 23, 2024

Choose a reason for hiding this comment

alamb Apr 14, 2024

Choose a reason for hiding this comment

alamb commented May 15, 2024

JasonLi-cn commented May 16, 2024

alamb commented May 16, 2024

alamb commented May 16, 2024

JasonLi-cn commented May 17, 2024

JasonLi-cn commented May 17, 2024

github-actions bot commented Jul 17, 2024

github-actions bot commented Sep 16, 2024

`overlay` function