From 773a0b0792e7d9023f27b9f9259e9b91eda7b606 Mon Sep 17 00:00:00 2001
From: Andrew Lamb <andrew@nerdnetworks.org>
Date: Sun, 13 Oct 2024 09:01:55 -0400
Subject: [PATCH 1/8] Minor: Document SIMD rationale and tips

---
 arrow/CONTRIBUTING.md | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/arrow/CONTRIBUTING.md b/arrow/CONTRIBUTING.md
index 0c795d6b9cb..64ef87bf04f 100644
--- a/arrow/CONTRIBUTING.md
+++ b/arrow/CONTRIBUTING.md
@@ -109,6 +109,36 @@ specific JIRA issues and reference them in these code comments. For example:
 //      This is not sound because .... see https://issues.apache.org/jira/browse/ARROW-nnnnn
 ```
 
+### Usage if SIMD / Auto vectorization
+
+This create does not use SIMD intrinsics (e.g. [`std::simd`] directly, but
+instead relies on LLVM's auto-vectorization.
+
+SIMD intrinsics are difficult to maintain and can be difficult to reason about.
+The auto-vectorizer in LLVM is quite good and often produces better code than
+hand-written manual uses of SIMD. In fact, this crate used to to have a fair
+amount of manual SIMD, and over time we've removed it as the auto-vectorized
+code was faster.
+
+[`std::simd`]: https://doc.rust-lang.org/std/simd/index.html
+
+LLVM is relatively good at vectorizing vertical operations provided:
+
+1. No conditionals within the loop body
+2. Not too much inlining , as the vectorizer gives up if the code is too complex
+3. No bitwise horizontal reductions or masking
+4. You've enabled SIMD instructions in the target ISA (e.g. `target-cpu` `RUSTFLAGS` flag)
+
+The last point is especially important as the default `target-cpu` doesn't
+support many SIMD instructions. See the Performance Tips section at the
+end of <https://crates.io/crates/arrow>
+
+To ensure your code is fully vectorized, we recommend getting familiar with
+tools like <https://rust.godbolt.org/> (again being sure to set `RUSTFLAGS`) and
+only once you've exhausted that avenue think of reaching for manual SIMD.
+Generally the hard part is getting the algorithm structured in such a way that
+it can be vectorized, regardless of what goes and generates those instructions.
+
 # Releases and publishing to crates.io
 
 Please see the [release](../dev/release/README.md) for details on how to create arrow releases

From aefbd7f392638c6731b9beacbebe50356f01a0e4 Mon Sep 17 00:00:00 2001
From: Andrew Lamb <andrew@nerdnetworks.org>
Date: Wed, 16 Oct 2024 06:29:00 -0400
Subject: [PATCH 2/8] Apply suggestions from code review

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com>
---
 arrow/CONTRIBUTING.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arrow/CONTRIBUTING.md b/arrow/CONTRIBUTING.md
index 64ef87bf04f..93444f3f9fb 100644
--- a/arrow/CONTRIBUTING.md
+++ b/arrow/CONTRIBUTING.md
@@ -111,7 +111,7 @@ specific JIRA issues and reference them in these code comments. For example:
 
 ### Usage if SIMD / Auto vectorization
 
-This create does not use SIMD intrinsics (e.g. [`std::simd`] directly, but
+This crate does not use SIMD intrinsics (e.g. [`std::simd`] directly, but
 instead relies on LLVM's auto-vectorization.
 
 SIMD intrinsics are difficult to maintain and can be difficult to reason about.
@@ -133,11 +133,11 @@ The last point is especially important as the default `target-cpu` doesn't
 support many SIMD instructions. See the Performance Tips section at the
 end of <https://crates.io/crates/arrow>
 
-To ensure your code is fully vectorized, we recommend getting familiar with
+To ensure your code is fully vectorized, we recommend becoming familiar with
 tools like <https://rust.godbolt.org/> (again being sure to set `RUSTFLAGS`) and
 only once you've exhausted that avenue think of reaching for manual SIMD.
 Generally the hard part is getting the algorithm structured in such a way that
-it can be vectorized, regardless of what goes and generates those instructions.
+it can be vectorized, regardless of what generates those instructions.
 
 # Releases and publishing to crates.io
 

From 881d2cde41ca65e1a20b0a98807447795151a0cb Mon Sep 17 00:00:00 2001
From: Andrew Lamb <andrew@nerdnetworks.org>
Date: Wed, 16 Oct 2024 06:42:36 -0400
Subject: [PATCH 3/8] More review feedback

---
 arrow/CONTRIBUTING.md | 34 ++++++++++++++++++++--------------
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/arrow/CONTRIBUTING.md b/arrow/CONTRIBUTING.md
index 93444f3f9fb..9bbdd953d4f 100644
--- a/arrow/CONTRIBUTING.md
+++ b/arrow/CONTRIBUTING.md
@@ -112,32 +112,38 @@ specific JIRA issues and reference them in these code comments. For example:
 ### Usage if SIMD / Auto vectorization
 
 This crate does not use SIMD intrinsics (e.g. [`std::simd`] directly, but
-instead relies on LLVM's auto-vectorization.
+instead relies on the Rust compiler's auto-vectorization capabilities (which are
+built on LLVM).
 
 SIMD intrinsics are difficult to maintain and can be difficult to reason about.
-The auto-vectorizer in LLVM is quite good and often produces better code than
-hand-written manual uses of SIMD. In fact, this crate used to to have a fair
-amount of manual SIMD, and over time we've removed it as the auto-vectorized
-code was faster.
+The auto-vectorizer in LLVM is quite good and often produces faster code than
+using hand-written SIMD intrinsics. In fact, this crate used to contain several
+kenels that used hand-written SIMD instructions, which were removed after
+discovering the auto-vectorized code was faster.
 
 [`std::simd`]: https://doc.rust-lang.org/std/simd/index.html
 
+#### Tips for auto-vectorization
+
 LLVM is relatively good at vectorizing vertical operations provided:
 
-1. No conditionals within the loop body
-2. Not too much inlining , as the vectorizer gives up if the code is too complex
-3. No bitwise horizontal reductions or masking
-4. You've enabled SIMD instructions in the target ISA (e.g. `target-cpu` `RUSTFLAGS` flag)
+1. No conditionals within the loop body (e.g no checking for nulls on each row)
+2. Not too much inlining, as the vectorizer gives up if the code is too complex
+3. No [horizontal reductions] or data dependencies
+4. Suitable SIMD instructions available in the target ISA (e.g. `target-cpu` `RUSTFLAGS` flag)
+
+[horizontal reductions]: https://rust-lang.github.io/packed_simd/perf-guide/vert-hor-ops.html
 
 The last point is especially important as the default `target-cpu` doesn't
 support many SIMD instructions. See the Performance Tips section at the
 end of <https://crates.io/crates/arrow>
 
-To ensure your code is fully vectorized, we recommend becoming familiar with
-tools like <https://rust.godbolt.org/> (again being sure to set `RUSTFLAGS`) and
-only once you've exhausted that avenue think of reaching for manual SIMD.
-Generally the hard part is getting the algorithm structured in such a way that
-it can be vectorized, regardless of what generates those instructions.
+To ensure your code is fully vectorized, we recommend using tools like
+<https://rust.godbolt.org/> (again being sure `RUSTFLAGS` is set appropriately)
+to analyze the resulting code, and only once you've exhausted auto vectorization
+think of reaching for manual SIMD. Generally the hard part of vectorizing code
+is structuring the algorithm in such a way that it can be vectorized, regardless
+of what generates those instructions.
 
 # Releases and publishing to crates.io
 

From 853987651fc58c15a52d70df4bf8a99b8677b61c Mon Sep 17 00:00:00 2001
From: Andrew Lamb <andrew@nerdnetworks.org>
Date: Wed, 16 Oct 2024 06:46:03 -0400
Subject: [PATCH 4/8] tweak

---
 arrow/CONTRIBUTING.md | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/arrow/CONTRIBUTING.md b/arrow/CONTRIBUTING.md
index 9bbdd953d4f..591e98ff57f 100644
--- a/arrow/CONTRIBUTING.md
+++ b/arrow/CONTRIBUTING.md
@@ -109,26 +109,26 @@ specific JIRA issues and reference them in these code comments. For example:
 //      This is not sound because .... see https://issues.apache.org/jira/browse/ARROW-nnnnn
 ```
 
-### Usage if SIMD / Auto vectorization
+### Usage if SIMD / auto vectorization
 
-This crate does not use SIMD intrinsics (e.g. [`std::simd`] directly, but
-instead relies on the Rust compiler's auto-vectorization capabilities (which are
-built on LLVM).
+This crate does not use SIMD intrinsics (e.g. [`std::simd`]) directly, but
+instead relies on the Rust compiler's auto-vectorization capabilities, which are
+built on LLVM.
 
 SIMD intrinsics are difficult to maintain and can be difficult to reason about.
-The auto-vectorizer in LLVM is quite good and often produces faster code than
-using hand-written SIMD intrinsics. In fact, this crate used to contain several
-kenels that used hand-written SIMD instructions, which were removed after
+The auto-vectorizer in LLVM is quite good and often produces kernels that are
+faster than using hand-written SIMD intrinsics. This crate used to contain
+several kernels with hand-written SIMD instructions, which were removed after
 discovering the auto-vectorized code was faster.
 
 [`std::simd`]: https://doc.rust-lang.org/std/simd/index.html
 
-#### Tips for auto-vectorization
+#### Tips for auto vectorization
 
 LLVM is relatively good at vectorizing vertical operations provided:
 
 1. No conditionals within the loop body (e.g no checking for nulls on each row)
-2. Not too much inlining, as the vectorizer gives up if the code is too complex
+2. Not too much inlining (as the vectorizer gives up if the code is too complex)
 3. No [horizontal reductions] or data dependencies
 4. Suitable SIMD instructions available in the target ISA (e.g. `target-cpu` `RUSTFLAGS` flag)
 

From b442d52dddc60a562686b627971afa4220886853 Mon Sep 17 00:00:00 2001
From: Andrew Lamb <andrew@nerdnetworks.org>
Date: Wed, 16 Oct 2024 06:48:55 -0400
Subject: [PATCH 5/8] Update arrow/CONTRIBUTING.md

---
 arrow/CONTRIBUTING.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arrow/CONTRIBUTING.md b/arrow/CONTRIBUTING.md
index 591e98ff57f..e4e7f34c42d 100644
--- a/arrow/CONTRIBUTING.md
+++ b/arrow/CONTRIBUTING.md
@@ -128,7 +128,7 @@ discovering the auto-vectorized code was faster.
 LLVM is relatively good at vectorizing vertical operations provided:
 
 1. No conditionals within the loop body (e.g no checking for nulls on each row)
-2. Not too much inlining (as the vectorizer gives up if the code is too complex)
+2. Not too much `#[inline]` (as the vectorizer gives up if the code is too complex)
 3. No [horizontal reductions] or data dependencies
 4. Suitable SIMD instructions available in the target ISA (e.g. `target-cpu` `RUSTFLAGS` flag)
 

From cb5627e7d7eb9f10c3003f95f6e071326f3f5360 Mon Sep 17 00:00:00 2001
From: Andrew Lamb <andrew@nerdnetworks.org>
Date: Wed, 16 Oct 2024 06:49:54 -0400
Subject: [PATCH 6/8] Update arrow/CONTRIBUTING.md

---
 arrow/CONTRIBUTING.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arrow/CONTRIBUTING.md b/arrow/CONTRIBUTING.md
index e4e7f34c42d..c1ad9943bee 100644
--- a/arrow/CONTRIBUTING.md
+++ b/arrow/CONTRIBUTING.md
@@ -109,7 +109,7 @@ specific JIRA issues and reference them in these code comments. For example:
 //      This is not sound because .... see https://issues.apache.org/jira/browse/ARROW-nnnnn
 ```
 
-### Usage if SIMD / auto vectorization
+### Usage of SIMD / auto vectorization
 
 This crate does not use SIMD intrinsics (e.g. [`std::simd`]) directly, but
 instead relies on the Rust compiler's auto-vectorization capabilities, which are

From b32679a7c9907159ad08f6cb37dc0346120c611f Mon Sep 17 00:00:00 2001
From: Andrew Lamb <andrew@nerdnetworks.org>
Date: Wed, 16 Oct 2024 08:59:48 -0400
Subject: [PATCH 7/8] clarify inlining more

---
 arrow/CONTRIBUTING.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arrow/CONTRIBUTING.md b/arrow/CONTRIBUTING.md
index c1ad9943bee..5a0a123b47d 100644
--- a/arrow/CONTRIBUTING.md
+++ b/arrow/CONTRIBUTING.md
@@ -128,7 +128,7 @@ discovering the auto-vectorized code was faster.
 LLVM is relatively good at vectorizing vertical operations provided:
 
 1. No conditionals within the loop body (e.g no checking for nulls on each row)
-2. Not too much `#[inline]` (as the vectorizer gives up if the code is too complex)
+2. Not too much inlining (judicious use of #[inline] and #[inline(never)]) as the vectorizer gives up if the code is too complex
 3. No [horizontal reductions] or data dependencies
 4. Suitable SIMD instructions available in the target ISA (e.g. `target-cpu` `RUSTFLAGS` flag)
 

From 1b5b7b03d239656b6ef1b3015c2056506edb5b2d Mon Sep 17 00:00:00 2001
From: Andrew Lamb <andrew@nerdnetworks.org>
Date: Wed, 16 Oct 2024 09:00:04 -0400
Subject: [PATCH 8/8] formating

---
 arrow/CONTRIBUTING.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arrow/CONTRIBUTING.md b/arrow/CONTRIBUTING.md
index 5a0a123b47d..a9a9426a42a 100644
--- a/arrow/CONTRIBUTING.md
+++ b/arrow/CONTRIBUTING.md
@@ -128,7 +128,7 @@ discovering the auto-vectorized code was faster.
 LLVM is relatively good at vectorizing vertical operations provided:
 
 1. No conditionals within the loop body (e.g no checking for nulls on each row)
-2. Not too much inlining (judicious use of #[inline] and #[inline(never)]) as the vectorizer gives up if the code is too complex
+2. Not too much inlining (judicious use of `#[inline]` and `#[inline(never)]`) as the vectorizer gives up if the code is too complex
 3. No [horizontal reductions] or data dependencies
 4. Suitable SIMD instructions available in the target ISA (e.g. `target-cpu` `RUSTFLAGS` flag)