-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add vectorized paths for Span<T>.Reverse #64412
Conversation
Tagging subscribers to this area: @dotnet/area-system-memory Issue DetailsAdds vectorized paths to Compared against 65a5d0e. Using this microbenchmark to compare performance, modified to use more buffer sizes and types: Microbenchmark changes
Performance results:
Please let me know if I can provide any other information. Thanks!
|
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.cs
Outdated
Show resolved
Hide resolved
Thanks for sharing the perf tests. The results only show down to a size of 32 elements, and all of them show improvements. Is there a smaller size at which this is actually a regression? |
runtime/src/libraries/System.Private.CoreLib/src/System/Array.cs Lines 1610 to 1641 in 953fd35
|
src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.cs
Outdated
Show resolved
Hide resolved
@stephentoub I didn't notice that the table didn't include the results for 8 byte spans. I changed my filter to be a little more inclusive when comparing results:
There's a 2-3ns regression for the really small spans possibly due to the overhead of the extra conditional checks, or it could just be noise. Let me know if you'd like more analysis on it. |
I didn't notice this, but you're right that Array can also benefit from this. I'll try adding the paths there as well and send an update. |
@stephentoub Updated PR to include the optimizations for To verify the performance, I did a quick copy of the same benchmark and had it reverse the original array instead. Microbenchmark changes@@ -14,11 +14,20 @@ namespace System.Memory
[GenericTypeArguments(typeof(byte))]
[GenericTypeArguments(typeof(char))]
[GenericTypeArguments(typeof(int))]
+ [GenericTypeArguments(typeof(long))]
+ [GenericTypeArguments(typeof(float))]
+ [GenericTypeArguments(typeof(double))]
[BenchmarkCategory(Categories.Runtime, Categories.Libraries, Categories.Span)]
public class Span<T>
where T : struct, IComparable<T>, IEquatable<T>
{
- [Params(Utils.DefaultCollectionSize)]
+ [Params(
+ 8 /* No vectorization */,
+ 34 /* SSSE3 with leftover */,
+ 68 /* AVX2 path with leftover bytes */,
+ Utils.DefaultCollectionSize,
+ Utils.DefaultCollectionSize * 2
+ )]
public int Size;
private T[] _array, _same, _emptyWithSingleValue;
@@ -41,6 +50,9 @@ namespace System.Memory
[Benchmark]
public void Reverse() => new System.Span<T>(_array).Reverse();
+ [Benchmark]
+ public void ReverseArray() => _array.Reverse();
+
[Benchmark]
public T[] ToArray() => new System.Span<T>(_array).ToArray(); Performance resultsBenchmarkDotNet=v0.13.1.1669-nightly, OS=Windows 10 (10.0.19044.1415/21H2/November2021Update)
AMD Ryzen 5 3600, 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-preview.2.22078.1
[Host] : .NET 7.0.0 (7.0.22.7408), X64 RyuJIT
Job-FHRFIJ : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-KGLSXA : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
Please let me know if there is anything else I can look at. |
return; | ||
} | ||
} | ||
ReverseInner(array, index, length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to duplicate all of the above? How about instead having a single method in SpanHelpers:
public static void Reverse<T>(ref T elements, nuint length);
or something similar. That method can do all the of the delegation to the other helpers (ReverseByRef, ReverseInner, etc.), and then this method can be:
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void Reverse<T>(T[] array, int index, int length)
{
... // argument validation
SpanHelpers.Reverse(ref Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(array), index), length);
}
and Span.Reverse
can be:
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void Reverse<T>(this Span<T> span) =>
SpanHelpers.Reverse(ref MemoryMarshal.GetReference(span), span.Length);
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, yes I agree that would be a lot cleaner! I'll go ahead and make the changes and push an update.
src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
Unsafe.As<byte, Vector256<byte>>(ref last) = tempFirst; | ||
first = ref Unsafe.Add(ref first, Vector256<byte>.Count); | ||
last = ref Unsafe.Add(ref last, -Vector256<byte>.Count); | ||
numBytesWritten += Vector256<byte>.Count * 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments describing what this dense block of code is doing would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added comments to hopefully clear up what the operation is. Please let me know if I can clarify anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. Other than my remaining comments, this LGTM, but @tannergooding should sign-off as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not be alarmed. I just noted a few coding style things.
For the things where I changed the naming from CamelCase to lowerCamelCase please check again that the usage is also adjusted and not that you just take over the diff 1:1.
Otherwise it looks good to me in the first place.
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs
Outdated
Show resolved
Hide resolved
Did you benchmark the AVX2 version against the SSSE3 version (run with |
750ddcd
to
9501a0f
Compare
@saucecontrol Haven't tried yet, I'll give it a shot and update this thread with results once finished. |
c663fa4
to
7ef62ab
Compare
ref T last = ref Unsafe.Subtract(ref Unsafe.Add(ref first, (int)length), 1); | ||
do | ||
{ | ||
(last, first) = (first, last); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This trick is neat, but it generates worse IL:
Notice that there is one local temp for the existing code and two local temps with the tuples trick.
Is the JIT able to optimize out the extra local temp in all cases, even for larger structs? We do not seem to have a coverage for larger structs in dotnet/performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tuples trick was mostly just for styling, but if explicitly using the temp generates better IL then we should probably go with that. I've updated the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a case where either Roslyn or the JIT could be updated.
Edit: I missed that the other JIT example above was for byref
which is probably something only Roslyn can fix...
At a high level....
The (last, first) = (first, last)
generates:
IL_0000: ldarg.1
IL_0001: ldarg.2
IL_0002: stloc.0
IL_0003: starg.s last
IL_0005: ldloc.0
IL_0006: starg.s first
IL_0008: ret
The var temp = last; last = first; first = temp;
generates
IL_0000: ldarg.2
IL_0001: ldarg.1
IL_0002: starg.s last
IL_0004: starg.s first
IL_0006: ret
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't the JIT optimize this to the same code? All this code does is shuffle values around. There is no memory or computation involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed an issue in dotnet/roslyn: dotnet/roslyn#61127
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the JIT optimizes it for int
:
https://sharplab.io/#v2:EYLgtghglgdgNAFxBAzmAPgAQAwAIDKAFhAE4AOAMhMAHQBKArjAlGAKYDcAsAFC8DaAKSgIA4mxhsSUAMYAKBAE8ybAPYAzObAQBKHQF1emACy4AKmxQIAjAB4zAPjkk2687nVQSVuLhduzXAAbVF1eAG9eXGj3BDYwMlwAXg8vK24eGNTvBGTg0IyskKs8uISMgF9eAWExCSlZBWU1TW09Qx4Tc0sEACZ7J393TxzfIcDisJ5IzJi5Sd8Rqx08uSWEX0mdSqA=
but not for other types: https://sharplab.io/#v2:EYLgtghglgdgNAFxBAzmAPgAQAwAIDKAFhAE4AOAMhMAHQBKArjAlGAKYDcAsAFC8DaAKSgIA4mxhsSUAMYAKBAE8ybAPYAzOZgCMNACIQEbACqs2ASnMBdXpgAsuY2xQJtAHmMA+OSTbrHuOpQJC5wuL7+xrgANqgI5rwA3ry4qQFGYGS4ALyBwS7cPGl5IQg5MXGFxbEu5RlkhQC+vALCYhJSsgrKapo6+oYmZpY2PPaOzggATB7eEQFBpWHzUTXxSSlpcmthiy7m5XJ7CGFr5k1AA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't the JIT optimize this to the same code? All this code does is shuffle values around. There is no memory or computation involved.
Likely because it can't "see" that the larger struct case is the same pattern as the smaller one. There's also the question of whether it's a worthwhile pattern to spend time trying to recognise.
@alexcovington I've merged #68493, could you please sync your branch? |
…re the same size as char, int, or long that use AVX2 or SSSE3 where possible
Co-authored-by: Theodore Tsirpanis <[email protected]>
…cit inlining and moved generic fallbacks into their own private methods.
…th Span.Reverse and Array.Reverse
…e for reverseMask variable
… reversing empty or single-element array, better shuffle for int and long Reverse using bit control mask instead of vector control mask
…te4x64 and PermuteVar8x32 for Int32 and Int64 respectively to reduce total operations.
17702cd
to
80ae8ab
Compare
@adamsitnik I've synced the PR to include your updated tests, please let me know if there is anything else I can look at. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @alexcovington !
The preview 5 perf report has detected a lot of improvement from this change on a variety of benchmarks, most notably:
I've included the raw data below in "details". Notably, while the x64 configs have speed up, we do see some slowdown on Arm64 configs. This is being tracked here: #68667 I also see some speedup in System.Memory.Span.Reverse(Size: 512)
System.Tests.Perf_Array.Reverse
System.Memory.Span.Reverse(Size: 512)
System.Memory.Span.Reverse(Size: 512)
|
Adds vectorized paths to
Span<T>.Reverse
for types that are supported. Falls back to previous behavior ifT
is not a value type or too big for a vector.Compared against 65a5d0e.
Using this microbenchmark to compare performance, modified to use more buffer sizes and types:
Microbenchmark changes
Performance results:
Please let me know if I can provide any other information. Thanks!