Improve `take` kernel performance on primitive arrays, fix bad null index handling (#4404) #4405

tustvold · 2023-06-13T11:13:44Z

Which issue does this PR close?

Rationale for this change

Fixes #4404 and improves performance significantly

Using the benchmarks in #4403

take i32 512            time:   [259.96 ns 260.19 ns 260.43 ns]
                        change: [-33.771% -33.654% -33.538%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

take i32 1024           time:   [411.64 ns 411.83 ns 412.05 ns]
                        change: [-31.098% -31.057% -31.012%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

take i32 null indices 1024
                        time:   [556.36 ns 556.65 ns 556.94 ns]
                        change: [-14.529% -14.467% -14.407%] (p = 0.00 < 0.05)
                        Performance has improved.

take i32 null values 1024
                        time:   [1.3158 µs 1.3167 µs 1.3176 µs]
                        change: [-41.607% -41.530% -41.446%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

take i32 null values null indices 1024
                        time:   [1.6743 µs 1.6761 µs 1.6775 µs]
                        change: [-43.953% -43.710% -43.467%] (p = 0.00 < 0.05)
                        Performance has improved.

take check bounds i32 512
                        time:   [381.76 ns 381.99 ns 382.22 ns]
                        change: [-25.746% -25.670% -25.599%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

take check bounds i32 1024
                        time:   [662.78 ns 663.02 ns 663.26 ns]
                        change: [-22.074% -21.942% -21.848%] (p = 0.00 < 0.05)
                        Performance has improved.

take bool 512           time:   [525.34 ns 525.72 ns 526.11 ns]
                        change: [-6.8248% -6.5847% -6.3665%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) low severe
  1 (1.00%) high mild
  1 (1.00%) high severe

take bool 1024          time:   [885.39 ns 886.30 ns 887.29 ns]
                        change: [-12.935% -12.734% -12.536%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

take bool null indices 1024
                        time:   [1.0396 µs 1.0401 µs 1.0406 µs]
                        change: [-51.963% -51.923% -51.885%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

take bool null values 1024
                        time:   [1.7840 µs 1.7853 µs 1.7871 µs]
                        change: [-2.8921% -2.7540% -2.6122%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

take bool null values null indices 1024
                        time:   [2.1861 µs 2.1878 µs 2.1897 µs]
                        change: [-45.707% -45.587% -45.456%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)

What changes are included in this PR?

Are there any user-facing changes?

Previously a negative index would return an error even when TakeOptions::check_bound was false. The code will now consistently panic on out of bounds errors, regardless of if that is the result of a wrapping conversion of a negative numbers. This yields a non-trivial speedup, as the additional branch seemed to cause LLVM some issues.

tustvold · 2023-06-13T11:14:05Z

arrow-buffer/src/buffer/boolean.rs

@@ -128,6 +128,7 @@ impl BooleanBuffer {
    /// # Panics
    ///
    /// Panics if `i >= self.len()`
+    #[inline]


This is vital for the performance of take_bits

tustvold · 2023-06-13T11:14:29Z

arrow-select/src/take.rs

-    let output_slice = output_buffer.as_slice_mut();
-
-    let indices_has_nulls = indices.null_count() > 0;
+#[inline(never)]


I had issues with LLVM inlining and then not optimising correctly, this just forces it to not be stupid

Likewise this may be worth a comment in the code as well

tustvold · 2023-06-13T11:14:52Z

arrow-select/src/take.rs

@@ -2155,4 +1977,19 @@ mod tests {
            UInt32Array::from(vec![9, 10, 11, 6, 7, 8, 3, 4, 5, 6, 7, 8, 0, 1, 2])
        );
    }
+
+    #[test]
+    fn test_take_null_indices() {


Test for #4404

tustvold · 2023-06-13T11:15:46Z

arrow-buffer/src/buffer/scalar.rs

@@ -140,6 +140,12 @@ impl<T: ArrowNativeType> From<Vec<T>> for ScalarBuffer<T> {
    }
 }

+impl<T: ArrowNativeType> FromIterator<T> for ScalarBuffer<T> {
+    fn from_iter<I: IntoIterator<Item = T>>(iter: I) -> Self {
+        iter.into_iter().collect::<Vec<_>>().into()


An important thing to note is that Vec: FromIterator has a specialization for TrustedLen iterators, such as those from slices. This allows us to not need Buffer::try_from_trusted_len_iter

Can you please add this (very interesting) information as a comment inline?

Dandandan · 2023-06-13T13:16:04Z

arrow-select/src/take.rs

+            }
+        }),
+        None => indices.values().iter().enumerate().for_each(|(i, index)| {
+            if values.value(index.as_usize()) {


Wasn't there a faster method to create a bitmap rather than doing set_bit in sequence?

In theory BooleanBuffer::collect_bool should be faster, in this case it turned out to be slower for some reason - likely something to do with what LLVM is doing with the bound checks

Yeah collect_bool

ok, thanks @tustvold

alamb · 2023-06-13T13:29:33Z

arrow-buffer/src/buffer/scalar.rs

@@ -140,6 +140,12 @@ impl<T: ArrowNativeType> From<Vec<T>> for ScalarBuffer<T> {
    }
 }

+impl<T: ArrowNativeType> FromIterator<T> for ScalarBuffer<T> {
+    fn from_iter<I: IntoIterator<Item = T>>(iter: I) -> Self {
+        iter.into_iter().collect::<Vec<_>>().into()


Can you please add this (very interesting) information as a comment inline?

alamb · 2023-06-13T13:31:48Z

arrow-select/src/take.rs

-    let output_slice = output_buffer.as_slice_mut();
-
-    let indices_has_nulls = indices.null_count() > 0;
+#[inline(never)]


Likewise this may be worth a comment in the code as well

alamb · 2023-06-13T13:34:32Z

arrow-select/src/take.rs

+    values: Option<&NullBuffer>,
+    indices: &PrimitiveArray<I>,
+) -> Option<NullBuffer> {
+    match values.filter(|n| n.null_count() > 0) {


it is certainly neat to see nice Rust code like this and then know rustc / LLVM did the right thing to make it fast

Improve take primitive performance (apache#4404)

a60f9c8

github-actions bot added the arrow Changes to the arrow crate label Jun 13, 2023

tustvold commented Jun 13, 2023

View reviewed changes

Remove unnecessary trait bounds

f581339

Dandandan reviewed Jun 13, 2023

View reviewed changes

alamb changed the title ~~Improve take primitive performance (#4404)~~ Improve take kernel performance on primitive arrays, fix bad null index handling (#4404) Jun 13, 2023

Dandandan approved these changes Jun 13, 2023

View reviewed changes

tustvold mentioned this pull request Jun 13, 2023

Remove Binary Dictionary Arithmetic Support #4407

Merged

tustvold merged commit 700bd33 into apache:master Jun 13, 2023

alamb reviewed Jun 13, 2023

View reviewed changes

tustvold mentioned this pull request Jun 14, 2023

Faster PrimitiveArray::from_iter_values #4413

Closed

alamb mentioned this pull request Jun 16, 2023

Take Kernel Handles Nullable Indices Incorrectly #4404

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `take` kernel performance on primitive arrays, fix bad null index handling (#4404) #4405

Improve `take` kernel performance on primitive arrays, fix bad null index handling (#4404) #4405

tustvold commented Jun 13, 2023 •

edited

Loading

tustvold Jun 13, 2023

tustvold Jun 13, 2023

alamb Jun 13, 2023

tustvold Jun 13, 2023

tustvold Jun 13, 2023

alamb Jun 13, 2023

Dandandan Jun 13, 2023

tustvold Jun 13, 2023

Dandandan Jun 13, 2023

Dandandan Jun 13, 2023

alamb Jun 13, 2023

alamb Jun 13, 2023

alamb Jun 13, 2023

Improve take kernel performance on primitive arrays, fix bad null index handling (#4404) #4405

Improve take kernel performance on primitive arrays, fix bad null index handling (#4404) #4405

Conversation

tustvold commented Jun 13, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Improve `take` kernel performance on primitive arrays, fix bad null index handling (#4404) #4405

Improve `take` kernel performance on primitive arrays, fix bad null index handling (#4404) #4405

tustvold commented Jun 13, 2023 •

edited

Loading