Speed up the `substring` kernel by about 2x #1512

HaoYang670 · 2022-03-31T12:40:46Z

Which issue does this PR close?

Closes #1511.

Rationale for this change

Improve the performance and do some clean-up work.

What changes are included in this PR?

calculate new_offsets first to estimate the memory needed for buffer
clean up the code: remove redundant comparisons.

Benchmark

chip: Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz

substring               time:   [19.806 ms 19.824 ms 19.844 ms]                      
                        change: [-44.909% -44.829% -44.743%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) high mild
  9 (9.00%) high severe

About 2X speed-up

Are there any user-facing changes?

No.

Signed-off-by: remzi <[email protected]>

Dandandan · 2022-03-31T13:12:52Z

arrow/src/compute/kernels/substring.rs

+        offsets
+            .windows(2)
+            .map(|pair| (pair[1] + start).max(pair[0]))
+            .collect()


Might be better to do new_starts/new_length in one iteration over offsets and use unzip to materialize into two Vecs?

Thinking about it, I think actually materializing new_length to a Vec is not needed. It probably is faster (and saves allocating this Vec) to compute the length again when calculating new_values.

I tried to remove .collect, actually. But you know, closure values in if-else branch can get no two closures, even if identical, have the same type error. And I did not find a good solution except adding .collect
rust-lang/rust#87961

Yes, that is not possible in Rust (I think only by Boxing or manually inlining / macros).

Calculating the length in two places with something like let length = *end - *start within both the new_offsets new_values doesn´t seem too bad in terms of duplication?

Yes, that is not possible in Rust (I think only by Boxing or manually inlining / macros).

Calculating the length in two places with something like let length = *end - *start within both the new_offsets new_values doesn´t seem too bad in terms of duplication?

One way is to allow the closure not to capture the environment by adding more parameters, and the compiler will downcast closure to function. This works for calculating new_starts, but doesn't work for new_length where we must capture Some(length) = length

My later comment mentioned not allocating the new vec but not allocating the array at all (by moving the calculation to the iteration in new_values). Is that something you want to try?

Thinking about it, I think actually materializing new_length to a Vec is not needed. It probably is faster (and saves allocating this Vec) to compute the length again when calculating new_values.

HaoYang670 · 2022-03-31T13:48:36Z

arrow/src/compute/kernels/substring.rs

-        let length_i: OffsetSize = offsets[i + 1] - offsets[i];
-        // compute where we should start slicing this entry
-        let start = offsets[i]
-            + if start >= OffsetSize::zero() {


We don't need to compare start with zero multiple time. So I move this outside the for loop

HaoYang670 · 2022-03-31T13:50:03Z

arrow/src/compute/kernels/substring.rs

-        let start = start.max(offsets[i]).min(offsets[i + 1]);
-        // compute the length of the slice
-        let length: OffsetSize = length
-            .unwrap_or(length_i)


Same reason! we don't need to unwrap length in each loop. So move this outside the loop.

Dandandan · 2022-03-31T16:05:33Z

arrow/src/compute/kernels/substring.rs

+            .map(|(start, length)| {
+                (start.to_usize().unwrap(), length.to_usize().unwrap())
+            })
+            .map(|(start, length)| &data[start..start + length])


Suggested change

.map(|(start, length)| {

(start.to_usize().unwrap(), length.to_usize().unwrap())

})

.map(|(start, length)| &data[start..start + length])

.map(|(start, length)| {

let start = start.to_usize().unwrap();

let length = length.to_usize().unwrap();

&data[start..start + length]

})

Signed-off-by: remzi <[email protected]>

codecov-commenter · 2022-04-01T04:27:06Z

Codecov Report

Merging #1512 (f66990a) into master (15c87ae) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1512      +/-   ##
==========================================
+ Coverage   82.72%   82.75%   +0.02%     
==========================================
  Files         189      190       +1     
  Lines       54561    54707     +146     
==========================================
+ Hits        45137    45274     +137     
- Misses       9424     9433       +9

Impacted Files	Coverage Δ
arrow/src/compute/kernels/substring.rs	`98.31% <100.00%> (+0.07%)`	⬆️
arrow/src/array/equal/utils.rs	`74.45% <0.00%> (-3.24%)`	⬇️
parquet/src/encodings/encoding.rs	`93.37% <0.00%> (-0.19%)`	⬇️
arrow/src/array/builder.rs	`86.68% <0.00%> (-0.05%)`	⬇️
arrow/src/array/equal/union.rs	`100.00% <0.00%> (ø)`
arrow/src/array/transform/mod.rs	`86.46% <0.00%> (+0.11%)`	⬆️
parquet_derive/src/parquet_field.rs	`66.21% <0.00%> (+0.22%)`	⬆️
arrow/src/array/equal/mod.rs	`93.98% <0.00%> (+0.65%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 15c87ae...f66990a. Read the comment docs.

Signed-off-by: remzi <[email protected]>

HaoYang670 · 2022-04-01T08:11:04Z

Hi @Dandandan ! These are some changes since your last review:

Use dynamic Fn trait with Box, so that there is no repeated code.
Give up using purely functional way. Because it is not suitable in this context. mut vec and for each are more straightforward.
We have to materialize new_offset, new_length(new_end now) and new_start before calculating new_value. Because we need new_offset to estimate the size of array. And new_offset depends on new_length which depends on new_start.
dyn Fn may introduce just a little performance penalty. I have updated the benchmark.

HaoYang670 · 2022-04-01T08:12:44Z

arrow/src/compute/kernels/substring.rs

-                length_i + start
-            };
+    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> = if start
+        >= zero


cargo fmt likes this weird format.

HaoYang670 · 2022-04-01T08:14:44Z

arrow/src/compute/kernels/substring.rs

-        length_so_far += length;
+    // start and end offsets for each substring
+    let mut new_starts_ends: Vec<(OffsetSize, OffsetSize)> =
+        Vec::with_capacity(array.len());


If we don't materialize new_starts_ends here, we have to calculate them twice when calculating new_values

It is a time-space trade-off, but not the major factor affecting performance

HaoYang670 · 2022-04-01T08:28:27Z

arrow/src/compute/kernels/substring.rs


-        new_offsets.push(length_so_far);
+    offsets.windows(2).for_each(|pair| {


Purely functional style is not appropriate because new_offsets and new_starts_ends have different number of elements. After trying some different ways, I find that mut vec + for each is more readable and efficient.

HaoYang670 · 2022-04-06T12:12:41Z

Hi @Dandandan ! What's your opinion?

alamb

Thank you @HaoYang670

I spent some time reading this code but I ran out of time to properly review it. I am worried there are corner cases in here that might not be covered by tests -- did you review test coverage and is this properly covered?

I think some comments might help me understand the code more quickly as well.

I will try and find some more time later this week or next to review it more

HaoYang670 · 2022-04-07T01:18:31Z

Thank you very much @alamb. I have reviewed the Codedev report and all code branches have been covered.

alamb · 2022-04-07T20:31:26Z

Thanks @HaoYang670 and @Dandandan

HaoYang670 added 2 commits March 31, 2022 20:18

speed up substring

baa4d2d

Signed-off-by: remzi <[email protected]>

add comments

f3095c9

Signed-off-by: remzi <[email protected]>

github-actions bot added the arrow Changes to the arrow crate label Mar 31, 2022

HaoYang670 changed the title ~~Speed up substring~~ Speed up the substring kernel Mar 31, 2022

Dandandan reviewed Mar 31, 2022

View reviewed changes

HaoYang670 commented Mar 31, 2022

View reviewed changes

Dandandan reviewed Mar 31, 2022

View reviewed changes

HaoYang670 added 2 commits April 1, 2022 10:22

reformat code

5a08e6e

Signed-off-by: remzi <[email protected]>

use trait opject to simplify the code

afb56c8

Signed-off-by: remzi <[email protected]>

HaoYang670 added 2 commits April 1, 2022 16:02

reformat code

d9c5d69

Signed-off-by: remzi <[email protected]>

fmt code

f66990a

Signed-off-by: remzi <[email protected]>

HaoYang670 commented Apr 1, 2022

View reviewed changes

HaoYang670 requested a review from Dandandan April 2, 2022 09:13

alamb reviewed Apr 6, 2022

View reviewed changes

Dandandan approved these changes Apr 7, 2022

View reviewed changes

alamb changed the title ~~Speed up the substring kernel~~ Speed up the substring kernel by about 2x Apr 7, 2022

alamb merged commit 497521f into apache:master Apr 7, 2022

HaoYang670 deleted the speed_up_substring branch April 7, 2022 23:13

HaoYang670 mentioned this pull request Apr 10, 2022

Epic: enhance the substring kernel #1531

Open

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up the `substring` kernel by about 2x #1512

Speed up the `substring` kernel by about 2x #1512

HaoYang670 commented Mar 31, 2022 •

edited

Loading

Dandandan Mar 31, 2022

Dandandan Mar 31, 2022 •

edited

Loading

HaoYang670 Mar 31, 2022 •

edited

Loading

Dandandan Mar 31, 2022

HaoYang670 Apr 1, 2022 •

edited

Loading

Dandandan Apr 1, 2022

HaoYang670 Mar 31, 2022

HaoYang670 Mar 31, 2022

Dandandan Mar 31, 2022

HaoYang670 Apr 1, 2022

codecov-commenter commented Apr 1, 2022 •

edited

Loading

HaoYang670 commented Apr 1, 2022 •

edited

Loading

HaoYang670 Apr 1, 2022

HaoYang670 Apr 1, 2022

HaoYang670 Apr 1, 2022

HaoYang670 Apr 1, 2022

HaoYang670 commented Apr 6, 2022

alamb left a comment

HaoYang670 commented Apr 7, 2022

alamb commented Apr 7, 2022


		new_offsets.push(length_so_far);
		offsets.windows(2).for_each(\|pair\| {

Speed up the substring kernel by about 2x #1512

Speed up the substring kernel by about 2x #1512

Conversation

HaoYang670 commented Mar 31, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Benchmark

Are there any user-facing changes?

Choose a reason for hiding this comment

Dandandan Mar 31, 2022 • edited Loading

Choose a reason for hiding this comment

HaoYang670 Mar 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HaoYang670 Apr 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Apr 1, 2022 • edited Loading

Codecov Report

HaoYang670 commented Apr 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HaoYang670 commented Apr 6, 2022

alamb left a comment

Choose a reason for hiding this comment

HaoYang670 commented Apr 7, 2022

alamb commented Apr 7, 2022

Speed up the `substring` kernel by about 2x #1512

Speed up the `substring` kernel by about 2x #1512

HaoYang670 commented Mar 31, 2022 •

edited

Loading

Dandandan Mar 31, 2022 •

edited

Loading

HaoYang670 Mar 31, 2022 •

edited

Loading

HaoYang670 Apr 1, 2022 •

edited

Loading

codecov-commenter commented Apr 1, 2022 •

edited

Loading

HaoYang670 commented Apr 1, 2022 •

edited

Loading