Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve take kernel performance on primitive arrays, fix bad null index handling (#4404) #4405

Merged
merged 2 commits into from
Jun 13, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jun 13, 2023

Which issue does this PR close?

Closes #4404

Rationale for this change

Fixes #4404 and improves performance significantly

Using the benchmarks in #4403

take i32 512            time:   [259.96 ns 260.19 ns 260.43 ns]
                        change: [-33.771% -33.654% -33.538%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

take i32 1024           time:   [411.64 ns 411.83 ns 412.05 ns]
                        change: [-31.098% -31.057% -31.012%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

take i32 null indices 1024
                        time:   [556.36 ns 556.65 ns 556.94 ns]
                        change: [-14.529% -14.467% -14.407%] (p = 0.00 < 0.05)
                        Performance has improved.

take i32 null values 1024
                        time:   [1.3158 µs 1.3167 µs 1.3176 µs]
                        change: [-41.607% -41.530% -41.446%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

take i32 null values null indices 1024
                        time:   [1.6743 µs 1.6761 µs 1.6775 µs]
                        change: [-43.953% -43.710% -43.467%] (p = 0.00 < 0.05)
                        Performance has improved.

take check bounds i32 512
                        time:   [381.76 ns 381.99 ns 382.22 ns]
                        change: [-25.746% -25.670% -25.599%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

take check bounds i32 1024
                        time:   [662.78 ns 663.02 ns 663.26 ns]
                        change: [-22.074% -21.942% -21.848%] (p = 0.00 < 0.05)
                        Performance has improved.

take bool 512           time:   [525.34 ns 525.72 ns 526.11 ns]
                        change: [-6.8248% -6.5847% -6.3665%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) low severe
  1 (1.00%) high mild
  1 (1.00%) high severe

take bool 1024          time:   [885.39 ns 886.30 ns 887.29 ns]
                        change: [-12.935% -12.734% -12.536%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

take bool null indices 1024
                        time:   [1.0396 µs 1.0401 µs 1.0406 µs]
                        change: [-51.963% -51.923% -51.885%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

take bool null values 1024
                        time:   [1.7840 µs 1.7853 µs 1.7871 µs]
                        change: [-2.8921% -2.7540% -2.6122%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

take bool null values null indices 1024
                        time:   [2.1861 µs 2.1878 µs 2.1897 µs]
                        change: [-45.707% -45.587% -45.456%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)

What changes are included in this PR?

Are there any user-facing changes?

Previously a negative index would return an error even when TakeOptions::check_bound was false. The code will now consistently panic on out of bounds errors, regardless of if that is the result of a wrapping conversion of a negative numbers. This yields a non-trivial speedup, as the additional branch seemed to cause LLVM some issues.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 13, 2023
@@ -128,6 +128,7 @@ impl BooleanBuffer {
/// # Panics
///
/// Panics if `i >= self.len()`
#[inline]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is vital for the performance of take_bits

let output_slice = output_buffer.as_slice_mut();

let indices_has_nulls = indices.null_count() > 0;
#[inline(never)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had issues with LLVM inlining and then not optimising correctly, this just forces it to not be stupid

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise this may be worth a comment in the code as well

@@ -2155,4 +1977,19 @@ mod tests {
UInt32Array::from(vec![9, 10, 11, 6, 7, 8, 3, 4, 5, 6, 7, 8, 0, 1, 2])
);
}

#[test]
fn test_take_null_indices() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test for #4404

@@ -140,6 +140,12 @@ impl<T: ArrowNativeType> From<Vec<T>> for ScalarBuffer<T> {
}
}

impl<T: ArrowNativeType> FromIterator<T> for ScalarBuffer<T> {
fn from_iter<I: IntoIterator<Item = T>>(iter: I) -> Self {
iter.into_iter().collect::<Vec<_>>().into()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An important thing to note is that Vec: FromIterator has a specialization for TrustedLen iterators, such as those from slices. This allows us to not need Buffer::try_from_trusted_len_iter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add this (very interesting) information as a comment inline?

}
}),
None => indices.values().iter().enumerate().for_each(|(i, index)| {
if values.value(index.as_usize()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't there a faster method to create a bitmap rather than doing set_bit in sequence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory BooleanBuffer::collect_bool should be faster, in this case it turned out to be slower for some reason - likely something to do with what LLVM is doing with the bound checks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah collect_bool

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks @tustvold

@alamb alamb changed the title Improve take primitive performance (#4404) Improve take kernel performance on primitive arrays, fix bad null index handling (#4404) Jun 13, 2023
@tustvold tustvold merged commit 700bd33 into apache:master Jun 13, 2023
@@ -140,6 +140,12 @@ impl<T: ArrowNativeType> From<Vec<T>> for ScalarBuffer<T> {
}
}

impl<T: ArrowNativeType> FromIterator<T> for ScalarBuffer<T> {
fn from_iter<I: IntoIterator<Item = T>>(iter: I) -> Self {
iter.into_iter().collect::<Vec<_>>().into()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add this (very interesting) information as a comment inline?

let output_slice = output_buffer.as_slice_mut();

let indices_has_nulls = indices.null_count() > 0;
#[inline(never)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise this may be worth a comment in the code as well

values: Option<&NullBuffer>,
indices: &PrimitiveArray<I>,
) -> Option<NullBuffer> {
match values.filter(|n| n.null_count() > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is certainly neat to see nice Rust code like this and then know rustc / LLVM did the right thing to make it fast

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Take Kernel Handles Nullable Indices Incorrectly
3 participants