Use the improved vectorization algorithm for binary and ternary TensorPrimitives operations #93409

tannergooding · 2023-10-12T17:59:38Z

This is a continuation of #93296

…o use the better SIMD algorithm

…for TensorPrimitives to use the better SIMD algorithm

…ves to use the better SIMD algorithm

…tives to use the better SIMD algorithm

…ta under the threshold

ghost · 2023-10-12T17:59:52Z

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

This is a continuation of #93296

Author:	tannergooding
Assignees:	tannergooding
Labels:	`area-System.Numerics`
Milestone:	-

stephentoub · 2023-10-13T19:39:20Z

src/libraries/System.Numerics.Tensors/src/System/Numerics/Tensors/TensorPrimitives.netcore.cs

+        ///     However, actually computing the amount of L3 cache per core can be tricky or error prone. Native memcpy
+        ///     algorithms use a constant threshold that is typically around 256KB and we match that here for simplicity. This
+        ///     threshold accounts for most processors in the last 10-15 years that had approx. 1MB L3 per core and support
+        ///     hyperthreading, giving a per core last level cache of approx. 512KB.


Love the comment. Thanks.

stephentoub · 2023-10-13T19:51:52Z

src/libraries/System.Numerics.Tensors/src/System/Numerics/Tensors/TensorPrimitives.netcore.cs

+                                vector3.Store(dPtr + (uint)(Vector512<float>.Count * 2));
+                                vector4.Store(dPtr + (uint)(Vector512<float>.Count * 3));
+
+                                // We load, process, and store the next four vectors
+
+                                vector1 = TTernaryOperator.Invoke(Vector512.Load(xPtr + (uint)(Vector512<float>.Count * 4)),
+


Is this better than just letting it be the default case?

Talked over teams and this was meant to be on one of the bits of code that did case 15: case 14: case 13: ... all together...

My response was

Given what the JIT generates right now, maybe...
that is, since it inserts the cmp; ja right now anyways, it might be better for those to be default

but once that's fixed and its not generating any cmp; ja since it knows everything is definitively in range
it'd probably be worse, since it would force the branch to exist rather than being part of the regular address lookup table

…rPrimitives operations (dotnet#93409) * Update InvokeSpanSpanIntoSpan<TBinaryOperator> for TensorPrimitives to use the better SIMD algorithm * Update InvokeSpanScalarIntoSpan<TTransformOperator, TBinaryOperator> for TensorPrimitives to use the better SIMD algorithm * Update InvokeSpanSpanSpanIntoSpan<TTernaryOperator> for TensorPrimitives to use the better SIMD algorithm * Update InvokeSpanSpanScalarIntoSpan<TTernaryOperator> for TensorPrimitives to use the better SIMD algorithm * Update InvokeSpanScalarSpanIntoSpan<TTernaryOperator> for TensorPrimitives to use the better SIMD algorithm * Improve codegen slightly by using case 0, rather than default * Adjust the canAlign check to be latter, to reduce branch count for data under the threshold * Add a comment explaining the NonTemporalByteThreshold * Make sure xTransformOp.CanVectorize is checked on .NET Standard

* Use FMA in TensorPrimitives (#92205) * Simplify TensorPrimitive's AbsoluteOperator (#92577) Vector{128/256/512} all provide Abs; no need to do this manually. * Reduce some boilerplate in TensorPrimitive's IBinaryOperator (#92576) Change a few of the static abstract interface methods to be virtual, as most implementations throw from these methods; we can consolidate that throwing to the base. * Minor code cleanup in TensorPrimitives tests (#92575) * Normalize some test naming * Alphabetize tests * Improve mistmatched length tests with all positions of the shorter tensor * Alphabetize methods in TensorPrimitives.cs * Vectorize TensorPrimitives.Min/Max{Magnitude} (#92618) * Vectorize TensorPrimitives.Min/Max{Magnitude} * Use AdvSimd.Max/Min * Rename some parameters/locals for consistency * Improve HorizontalAggregate * Move a few helpers * Avoid scalar path for returning found NaN * Update TensorPrimitives aggregations to vectorize handling of remaining elements (#92672) * Update TensorPrimitives.CosineSimilarity to vectorize handling of remaining elements * Vectorize remainder handling for Aggregate helpers * Flesh out TensorPrimitives XML docs (#92749) * Flesh out TensorPrimitives XML docs * Address PR feedback - Remove use of FusedMultiplyAdd from all but CosineSimilarity - Remove comments about platform/OS-specific behavior from Add/AddMultiply/Subtract/Multiply/MultiplyAdd/Divide/Negate - Loosen comments about NaN and which exact one is returned * Address PR feedback * Vectorize TensorPrimitives.ConvertToHalf (#92715) * Enable TensorPrimitives to perform in-place operations (#92820) Some operations would produce incorrect results if the same span was passed as both an input and an output. When vectorization was employed but the span's length wasn't a perfect multiple of a vector, we'd do the standard trick of performing one last operation on the last vector's worth of data; however, that relies on the operation being idempotent, and if a previous operation has overwritten input with a new value due to the same memory being used for input and output, some operations won't be idempotent. This fixes that by masking off the already processed elements. It adds tests to validate in-place use works, and it updates the docs to carve out this valid overlapping. * Vectorize TensorPrimitives.ConvertToSingle (#92779) * Vectorize TensorPrimitives.ConvertToSingle * Address PR feedback * Throw exception in TensorPrimitives for unsupported span overlaps (#92838) * This vectorizes TensorPrimitives.Log2 (#92897) * Add a way to support operations that can't be vectorized on netstandard * Updating TensorPrimitives.Log2 to be vectorized on .NET Core * Update src/libraries/System.Numerics.Tensors/src/System/Numerics/Tensors/TensorPrimitives.netstandard.cs Co-authored-by: Stephen Toub <[email protected]> * Ensure we do an arithmetic right shift in the Log2 vectorization * Ensure the code can compile on .NET 7 * Ensure that edge cases are properly handled and don't resolve to `x` * Ensure that Log2 special results are explicitly handled. --------- Co-authored-by: Stephen Toub <[email protected]> * Adding Log2 tests covering some special values (#92946) * [wasm] Disable `TensorPrimitivesTests.ConvertToHalf_SpecialValues` (#92953) Failing test: `System.Numerics.Tensors.Tests.TensorPrimitivesTests.ConvertToHalf_SpecialValues` Issue: #92885 * Adding a vectorized implementation of TensorPrimitives.Log (#92960) * Adding a vectorized implementation of TensorPrimitives.Log * Make sure to hit Ctrl+S * Consolidate some TensorPrimitivesTests logic around special values (#92982) * Vectorize TensorPrimitives.Exp (#93018) * Vectorize TensorPrimitives.Exp * Update src/libraries/System.Numerics.Tensors/src/System/Numerics/Tensors/TensorPrimitives.netstandard.cs * Vectorize TensorPrimitives.Sigmoid and TensorPrimitives.SoftMax (#93029) * Vectorize TensorPrimitives.Sigmoid and TensorPrimitives.SoftMax - Adds a SigmoidOperator that just wraps the ExpOperator - Vectorizes both passes of SoftMax, on top of ExpOperator. Simplest way to do this was to augment the existing InvokeSpanScalarIntoSpan to take a transform operator. - In doing so, found some naming inconsistencies I'd previously introduced, so I did some automatic renaming to make things more consistent. - Added XML comments to all the internal/private surface area. - Fleshes out some tests (and test values). * Disable tests on mono * Address PR feedback * Vectorize TensorPrimitives.Tanh/Cosh/Sinh (#93093) * Vectorize TensorPrimitives.Tanh/Cosh/Sinh Tanh and Cosh are based on AOCL-LibM. AOCL-LibM doesn't appear to have a sinh implementation, so this Sinh is just based on the sinh formula based on exp(x). I also augmented the tests further, including: - Added more tests for sinh/cosh/tanh - Add an equality routine that supports comparing larger values with a tolerance - Tightened the tolerance for most functions - Changed some tests to be theories to be consistent with style elsewhere in the tests - Fixed some use of Math to be MathF * Remove unnecessary special-handling path from cosh * Remove unnecessary special-handling path from tanh * Redo sinh based on cosh * Address PR feedback * Replace confusing new T[] { ... } * Remove a few unnecessary `unsafe` keyword uses in TensorPrimitives (#93219) * Consolidate a few exception throws in TensorPrimitives (#93168) * Fix TensorPrimitives.IndexOfXx corner-case when first element is seed value (#93169) * Fix TensorPrimitives.IndexOfXx corner-case when first element is seed value Found as part of adding more tests for Min/Max{Magnitude} to validate they match their IndexOfXx variants. * Address PR feedback * Improve a vector implementation to support alignment and non-temporal tores (#93296) * Improve a vector implementation to support alignment and non-temporal stores * Fix a build error and mark a couple methods as AggressiveInlining * Fix the remaining block count computation * Ensure overlapping for small data on the V256/512 is handled * Ensure we only go down the vectorized path when supported for netstandard * Mark TensorPrimitives as unsafe (#93412) * Use the improved vectorization algorithm for binary and ternary TensorPrimitives operations (#93409) * Update InvokeSpanSpanIntoSpan<TBinaryOperator> for TensorPrimitives to use the better SIMD algorithm * Update InvokeSpanScalarIntoSpan<TTransformOperator, TBinaryOperator> for TensorPrimitives to use the better SIMD algorithm * Update InvokeSpanSpanSpanIntoSpan<TTernaryOperator> for TensorPrimitives to use the better SIMD algorithm * Update InvokeSpanSpanScalarIntoSpan<TTernaryOperator> for TensorPrimitives to use the better SIMD algorithm * Update InvokeSpanScalarSpanIntoSpan<TTernaryOperator> for TensorPrimitives to use the better SIMD algorithm * Improve codegen slightly by using case 0, rather than default * Adjust the canAlign check to be latter, to reduce branch count for data under the threshold * Add a comment explaining the NonTemporalByteThreshold * Make sure xTransformOp.CanVectorize is checked on .NET Standard * Use the improved vectorization algorithm for aggregate TensorPrimitives operations (#93695) * Improve the handling of the IAggregationOperator implementations * Update Aggregate<TTransformOperator, TAggregationOperator> for TensorPrimitives to use the better SIMD algorithm * Update Aggregate<TBinaryOperator, TAggregationOperator> for TensorPrimitives to use the better SIMD algorithm * Respond to PR feedback * [wasm] Remove more active issues for #92885 (#93596) * adding patch from pr 93556 * Vectorizes IndexOfMin/Max/Magnitude (#93469) * resolved merge conflicts * net core full done * minor code cleanup * NetStandard and PR fixes. * minor pr changes * Fix IndexOfMaxMagnitudeOperator * Fix IndexOfMaxMagnitudeOperator on netcore * updates from PR comments * netcore fixed * net standard updated * add reference assembly exclusions * made naive approach better * resolved PR comments * minor comment changes * minor formatting fixes * added inlining * fixes from PR comments * comments from pr * fixed spacing --------- Co-authored-by: Eric StJohn <[email protected]> --------- Co-authored-by: Stephen Toub <[email protected]> Co-authored-by: Tanner Gooding <[email protected]> Co-authored-by: Ankit Jain <[email protected]> Co-authored-by: Radek Doulik <[email protected]> Co-authored-by: Eric StJohn <[email protected]>

tannergooding added 8 commits October 12, 2023 10:41

Update InvokeSpanSpanIntoSpan<TBinaryOperator> for TensorPrimitives t…

bae2b67

…o use the better SIMD algorithm

Update InvokeSpanScalarIntoSpan<TTransformOperator, TBinaryOperator> …

0f5fb2a

…for TensorPrimitives to use the better SIMD algorithm

Update InvokeSpanSpanSpanIntoSpan<TTernaryOperator> for TensorPrimiti…

5059d01

…ves to use the better SIMD algorithm

Update InvokeSpanSpanScalarIntoSpan<TTernaryOperator> for TensorPrimi…

520a3e1

…tives to use the better SIMD algorithm

Update InvokeSpanScalarSpanIntoSpan<TTernaryOperator> for TensorPrimi…

b1dec16

…tives to use the better SIMD algorithm

Improve codegen slightly by using case 0, rather than default

58a9047

Adjust the canAlign check to be latter, to reduce branch count for da…

e3e6ae2

…ta under the threshold

Add a comment explaining the NonTemporalByteThreshold

42a2014

ghost assigned tannergooding Oct 12, 2023

dotnet-issue-labeler bot added the area-System.Numerics label Oct 12, 2023

Merge remote-tracking branch 'dotnet/main' into vectorize-align-2

e327af9

build-analysis bot mentioned this pull request Oct 12, 2023

Intermittent build failure in AfterSourceBuild: "Could not write state file" #76488

Open

lewing mentioned this pull request Oct 12, 2023

System.Numerics.Tensors.Tests.TensorPrimitivesTests.SoftMax not implemented #93425

Closed

stephentoub reviewed Oct 13, 2023

View reviewed changes

stephentoub approved these changes Oct 13, 2023

View reviewed changes

tannergooding added 2 commits October 13, 2023 12:58

Merge remote-tracking branch 'dotnet/main' into vectorize-align-2

852bf98

Make sure xTransformOp.CanVectorize is checked on .NET Standard

f03dc1d

This was referenced Oct 16, 2023

[mono][tvos] OOM in System.IO.Tests.MemoryStreamTests #92467

Closed

Incorrect value in FirstDayOfWeek test #93354

Closed

tannergooding merged commit 08c08ba into dotnet:main Oct 17, 2023
106 of 109 checks passed

ilonatommy mentioned this pull request Oct 17, 2023

[browser] Disable HybridGlobalization failure #93560

Merged

tannergooding mentioned this pull request Oct 18, 2023

Use the improved vectorization algorithm for aggregate TensorPrimitives operations #93695

Merged

ghost locked as resolved and limited conversation to collaborators Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the improved vectorization algorithm for binary and ternary TensorPrimitives operations #93409

Use the improved vectorization algorithm for binary and ternary TensorPrimitives operations #93409

tannergooding commented Oct 12, 2023

ghost commented Oct 12, 2023

stephentoub Oct 13, 2023

stephentoub Oct 13, 2023

tannergooding Oct 13, 2023

Use the improved vectorization algorithm for binary and ternary TensorPrimitives operations #93409

Use the improved vectorization algorithm for binary and ternary TensorPrimitives operations #93409

Conversation

tannergooding commented Oct 12, 2023

ghost commented Oct 12, 2023

stephentoub Oct 13, 2023

Choose a reason for hiding this comment

stephentoub Oct 13, 2023

Choose a reason for hiding this comment

tannergooding Oct 13, 2023

Choose a reason for hiding this comment