From 773a0b0792e7d9023f27b9f9259e9b91eda7b606 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sun, 13 Oct 2024 09:01:55 -0400 Subject: [PATCH 1/8] Minor: Document SIMD rationale and tips --- arrow/CONTRIBUTING.md | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/arrow/CONTRIBUTING.md b/arrow/CONTRIBUTING.md index 0c795d6b9cb..64ef87bf04f 100644 --- a/arrow/CONTRIBUTING.md +++ b/arrow/CONTRIBUTING.md @@ -109,6 +109,36 @@ specific JIRA issues and reference them in these code comments. For example: // This is not sound because .... see https://issues.apache.org/jira/browse/ARROW-nnnnn ``` +### Usage if SIMD / Auto vectorization + +This create does not use SIMD intrinsics (e.g. [`std::simd`] directly, but +instead relies on LLVM's auto-vectorization. + +SIMD intrinsics are difficult to maintain and can be difficult to reason about. +The auto-vectorizer in LLVM is quite good and often produces better code than +hand-written manual uses of SIMD. In fact, this crate used to to have a fair +amount of manual SIMD, and over time we've removed it as the auto-vectorized +code was faster. + +[`std::simd`]: https://doc.rust-lang.org/std/simd/index.html + +LLVM is relatively good at vectorizing vertical operations provided: + +1. No conditionals within the loop body +2. Not too much inlining , as the vectorizer gives up if the code is too complex +3. No bitwise horizontal reductions or masking +4. You've enabled SIMD instructions in the target ISA (e.g. `target-cpu` `RUSTFLAGS` flag) + +The last point is especially important as the default `target-cpu` doesn't +support many SIMD instructions. See the Performance Tips section at the +end of + +To ensure your code is fully vectorized, we recommend getting familiar with +tools like (again being sure to set `RUSTFLAGS`) and +only once you've exhausted that avenue think of reaching for manual SIMD. +Generally the hard part is getting the algorithm structured in such a way that +it can be vectorized, regardless of what goes and generates those instructions. + # Releases and publishing to crates.io Please see the [release](../dev/release/README.md) for details on how to create arrow releases From aefbd7f392638c6731b9beacbebe50356f01a0e4 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Wed, 16 Oct 2024 06:29:00 -0400 Subject: [PATCH 2/8] Apply suggestions from code review Co-authored-by: Ed Seidl Co-authored-by: Piotr Findeisen --- arrow/CONTRIBUTING.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arrow/CONTRIBUTING.md b/arrow/CONTRIBUTING.md index 64ef87bf04f..93444f3f9fb 100644 --- a/arrow/CONTRIBUTING.md +++ b/arrow/CONTRIBUTING.md @@ -111,7 +111,7 @@ specific JIRA issues and reference them in these code comments. For example: ### Usage if SIMD / Auto vectorization -This create does not use SIMD intrinsics (e.g. [`std::simd`] directly, but +This crate does not use SIMD intrinsics (e.g. [`std::simd`] directly, but instead relies on LLVM's auto-vectorization. SIMD intrinsics are difficult to maintain and can be difficult to reason about. @@ -133,11 +133,11 @@ The last point is especially important as the default `target-cpu` doesn't support many SIMD instructions. See the Performance Tips section at the end of -To ensure your code is fully vectorized, we recommend getting familiar with +To ensure your code is fully vectorized, we recommend becoming familiar with tools like (again being sure to set `RUSTFLAGS`) and only once you've exhausted that avenue think of reaching for manual SIMD. Generally the hard part is getting the algorithm structured in such a way that -it can be vectorized, regardless of what goes and generates those instructions. +it can be vectorized, regardless of what generates those instructions. # Releases and publishing to crates.io From 881d2cde41ca65e1a20b0a98807447795151a0cb Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Wed, 16 Oct 2024 06:42:36 -0400 Subject: [PATCH 3/8] More review feedback --- arrow/CONTRIBUTING.md | 34 ++++++++++++++++++++-------------- 1 file changed, 20 insertions(+), 14 deletions(-) diff --git a/arrow/CONTRIBUTING.md b/arrow/CONTRIBUTING.md index 93444f3f9fb..9bbdd953d4f 100644 --- a/arrow/CONTRIBUTING.md +++ b/arrow/CONTRIBUTING.md @@ -112,32 +112,38 @@ specific JIRA issues and reference them in these code comments. For example: ### Usage if SIMD / Auto vectorization This crate does not use SIMD intrinsics (e.g. [`std::simd`] directly, but -instead relies on LLVM's auto-vectorization. +instead relies on the Rust compiler's auto-vectorization capabilities (which are +built on LLVM). SIMD intrinsics are difficult to maintain and can be difficult to reason about. -The auto-vectorizer in LLVM is quite good and often produces better code than -hand-written manual uses of SIMD. In fact, this crate used to to have a fair -amount of manual SIMD, and over time we've removed it as the auto-vectorized -code was faster. +The auto-vectorizer in LLVM is quite good and often produces faster code than +using hand-written SIMD intrinsics. In fact, this crate used to contain several +kenels that used hand-written SIMD instructions, which were removed after +discovering the auto-vectorized code was faster. [`std::simd`]: https://doc.rust-lang.org/std/simd/index.html +#### Tips for auto-vectorization + LLVM is relatively good at vectorizing vertical operations provided: -1. No conditionals within the loop body -2. Not too much inlining , as the vectorizer gives up if the code is too complex -3. No bitwise horizontal reductions or masking -4. You've enabled SIMD instructions in the target ISA (e.g. `target-cpu` `RUSTFLAGS` flag) +1. No conditionals within the loop body (e.g no checking for nulls on each row) +2. Not too much inlining, as the vectorizer gives up if the code is too complex +3. No [horizontal reductions] or data dependencies +4. Suitable SIMD instructions available in the target ISA (e.g. `target-cpu` `RUSTFLAGS` flag) + +[horizontal reductions]: https://rust-lang.github.io/packed_simd/perf-guide/vert-hor-ops.html The last point is especially important as the default `target-cpu` doesn't support many SIMD instructions. See the Performance Tips section at the end of -To ensure your code is fully vectorized, we recommend becoming familiar with -tools like (again being sure to set `RUSTFLAGS`) and -only once you've exhausted that avenue think of reaching for manual SIMD. -Generally the hard part is getting the algorithm structured in such a way that -it can be vectorized, regardless of what generates those instructions. +To ensure your code is fully vectorized, we recommend using tools like + (again being sure `RUSTFLAGS` is set appropriately) +to analyze the resulting code, and only once you've exhausted auto vectorization +think of reaching for manual SIMD. Generally the hard part of vectorizing code +is structuring the algorithm in such a way that it can be vectorized, regardless +of what generates those instructions. # Releases and publishing to crates.io From 853987651fc58c15a52d70df4bf8a99b8677b61c Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Wed, 16 Oct 2024 06:46:03 -0400 Subject: [PATCH 4/8] tweak --- arrow/CONTRIBUTING.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/arrow/CONTRIBUTING.md b/arrow/CONTRIBUTING.md index 9bbdd953d4f..591e98ff57f 100644 --- a/arrow/CONTRIBUTING.md +++ b/arrow/CONTRIBUTING.md @@ -109,26 +109,26 @@ specific JIRA issues and reference them in these code comments. For example: // This is not sound because .... see https://issues.apache.org/jira/browse/ARROW-nnnnn ``` -### Usage if SIMD / Auto vectorization +### Usage if SIMD / auto vectorization -This crate does not use SIMD intrinsics (e.g. [`std::simd`] directly, but -instead relies on the Rust compiler's auto-vectorization capabilities (which are -built on LLVM). +This crate does not use SIMD intrinsics (e.g. [`std::simd`]) directly, but +instead relies on the Rust compiler's auto-vectorization capabilities, which are +built on LLVM. SIMD intrinsics are difficult to maintain and can be difficult to reason about. -The auto-vectorizer in LLVM is quite good and often produces faster code than -using hand-written SIMD intrinsics. In fact, this crate used to contain several -kenels that used hand-written SIMD instructions, which were removed after +The auto-vectorizer in LLVM is quite good and often produces kernels that are +faster than using hand-written SIMD intrinsics. This crate used to contain +several kernels with hand-written SIMD instructions, which were removed after discovering the auto-vectorized code was faster. [`std::simd`]: https://doc.rust-lang.org/std/simd/index.html -#### Tips for auto-vectorization +#### Tips for auto vectorization LLVM is relatively good at vectorizing vertical operations provided: 1. No conditionals within the loop body (e.g no checking for nulls on each row) -2. Not too much inlining, as the vectorizer gives up if the code is too complex +2. Not too much inlining (as the vectorizer gives up if the code is too complex) 3. No [horizontal reductions] or data dependencies 4. Suitable SIMD instructions available in the target ISA (e.g. `target-cpu` `RUSTFLAGS` flag) From b442d52dddc60a562686b627971afa4220886853 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Wed, 16 Oct 2024 06:48:55 -0400 Subject: [PATCH 5/8] Update arrow/CONTRIBUTING.md --- arrow/CONTRIBUTING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arrow/CONTRIBUTING.md b/arrow/CONTRIBUTING.md index 591e98ff57f..e4e7f34c42d 100644 --- a/arrow/CONTRIBUTING.md +++ b/arrow/CONTRIBUTING.md @@ -128,7 +128,7 @@ discovering the auto-vectorized code was faster. LLVM is relatively good at vectorizing vertical operations provided: 1. No conditionals within the loop body (e.g no checking for nulls on each row) -2. Not too much inlining (as the vectorizer gives up if the code is too complex) +2. Not too much `#[inline]` (as the vectorizer gives up if the code is too complex) 3. No [horizontal reductions] or data dependencies 4. Suitable SIMD instructions available in the target ISA (e.g. `target-cpu` `RUSTFLAGS` flag) From cb5627e7d7eb9f10c3003f95f6e071326f3f5360 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Wed, 16 Oct 2024 06:49:54 -0400 Subject: [PATCH 6/8] Update arrow/CONTRIBUTING.md --- arrow/CONTRIBUTING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arrow/CONTRIBUTING.md b/arrow/CONTRIBUTING.md index e4e7f34c42d..c1ad9943bee 100644 --- a/arrow/CONTRIBUTING.md +++ b/arrow/CONTRIBUTING.md @@ -109,7 +109,7 @@ specific JIRA issues and reference them in these code comments. For example: // This is not sound because .... see https://issues.apache.org/jira/browse/ARROW-nnnnn ``` -### Usage if SIMD / auto vectorization +### Usage of SIMD / auto vectorization This crate does not use SIMD intrinsics (e.g. [`std::simd`]) directly, but instead relies on the Rust compiler's auto-vectorization capabilities, which are From b32679a7c9907159ad08f6cb37dc0346120c611f Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Wed, 16 Oct 2024 08:59:48 -0400 Subject: [PATCH 7/8] clarify inlining more --- arrow/CONTRIBUTING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arrow/CONTRIBUTING.md b/arrow/CONTRIBUTING.md index c1ad9943bee..5a0a123b47d 100644 --- a/arrow/CONTRIBUTING.md +++ b/arrow/CONTRIBUTING.md @@ -128,7 +128,7 @@ discovering the auto-vectorized code was faster. LLVM is relatively good at vectorizing vertical operations provided: 1. No conditionals within the loop body (e.g no checking for nulls on each row) -2. Not too much `#[inline]` (as the vectorizer gives up if the code is too complex) +2. Not too much inlining (judicious use of #[inline] and #[inline(never)]) as the vectorizer gives up if the code is too complex 3. No [horizontal reductions] or data dependencies 4. Suitable SIMD instructions available in the target ISA (e.g. `target-cpu` `RUSTFLAGS` flag) From 1b5b7b03d239656b6ef1b3015c2056506edb5b2d Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Wed, 16 Oct 2024 09:00:04 -0400 Subject: [PATCH 8/8] formating --- arrow/CONTRIBUTING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arrow/CONTRIBUTING.md b/arrow/CONTRIBUTING.md index 5a0a123b47d..a9a9426a42a 100644 --- a/arrow/CONTRIBUTING.md +++ b/arrow/CONTRIBUTING.md @@ -128,7 +128,7 @@ discovering the auto-vectorized code was faster. LLVM is relatively good at vectorizing vertical operations provided: 1. No conditionals within the loop body (e.g no checking for nulls on each row) -2. Not too much inlining (judicious use of #[inline] and #[inline(never)]) as the vectorizer gives up if the code is too complex +2. Not too much inlining (judicious use of `#[inline]` and `#[inline(never)]`) as the vectorizer gives up if the code is too complex 3. No [horizontal reductions] or data dependencies 4. Suitable SIMD instructions available in the target ISA (e.g. `target-cpu` `RUSTFLAGS` flag)