-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete redundant vectorization code paths #49846
Comments
Tagging subscribers to this area: @tannergooding Issue DetailsWe have a non-trivial number of implementations that duplicate loops across multiple hardware instruction sets plus a generalized Vector path, e.g. runtime/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs Lines 266 to 434 in 9f93bcb
has a structure that's approximately: if (Avx2.IsSupported)
{
... // use Avx2
}
else if (Sse2.IsSupported)
{
... // use Sse2
}
else if (AdvSimd.Arm64.IsSupported)
{
... // use AdvSimd
}
else if (Vector.IsHardwareAccelerated)
{
... // use Vector / Vector<T>
} We have others like: runtime/src/libraries/System.Collections/src/System/Collections/BitArray.cs Lines 152 to 224 in 9f93bcb
that similarly have paths for Avx2, Sse2, and AdvSimd but without a Vector path. This is a lot of complicated code being duplicated, and it should no longer be necessary. In most cases, the path using For paths that don't have an Avx2 option, and instead just have paths for Sse2/AdvSimd/Vector, it's less clear whether we could delete the Sse2/AdvSimd implementations, as it's possible there could be performance differences (for better or worse) if the Vector path ended up using 32-byte widths. cc: @tannergooding, @benaadams
|
They aren't completely the same with for example the |
What is the additional optimization in the BitArray example? |
Looks like it needs work 😅 As an aside I was considering whether a Generator for all the SpanHelper methods would be interesting since they are all variations on a theme (as is the one in Don't know if that would be better or worse? Would be less code to look after, but also easier to generate more variants so more asm... |
I think a generator for vectorized code seems like an overall interesting idea but also a complex one. The main issue I see is that we have 3-4 code paths that are all nearly identical. They often only differ in one or two places when they do differ at all. The options for normalizing these are:
It is not so nice in that it is variable sized and so it can be difficult to accurately know the benefit.
It is not nice in that it may leave perf on the table when inputs are particularly large, but I feel this is algorithm specific and not the "default/expected" |
For span helpers while the "not here" are mostly the same (essentially runtime/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs Lines 1024 to 1035 in 9f93bcb
whereas the Find runtime/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs Lines 1173 to 1180 in 9f93bcb
Locate runtime/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs Lines 1681 to 1698 in 9f93bcb
Isolate runtime/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs Lines 1915 to 1916 in 9f93bcb
However accessing each element of the vector individually to check them isn't great... Perhaps we need to extend the apis on the platform independent vectors? |
The primary blocker here is that |
Looks like Ice Lake wasn't downclocking much and Rocket Lake isn't for AVX-512 even when used on all cores? https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html#rocket-lake |
I don't think we can rely on Ice Lake not downclocking as the basis for using or not using AVX-512. AVX-512 is relatively new still and it will likely be a while before it is commonplace enough to determine what the correct baseline is. 256-bit AVX instructions had similar issues when first released and they have improved overtime in a similar fashion. |
This is something that will likely not make 6.0.0 but will be more easily resolvable once #49397 is in. |
Closing as we're exploring other directions. |
We have a non-trivial number of implementations that duplicate loops across multiple hardware instruction sets plus a generalized Vector path, e.g.
runtime/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Lines 266 to 434 in 9f93bcb
has a structure that's approximately:
We have others like:
runtime/src/libraries/System.Collections/src/System/Collections/BitArray.cs
Lines 152 to 224 in 9f93bcb
that similarly have paths for Avx2, Sse2, and AdvSimd but without a Vector path.
This is a lot of complicated code being duplicated, and it should no longer be necessary. In most cases, the path using
Vector<T>
should produce code identical to what we'd otherwise write manually using hardware intrinsics, automatically picking 32-bytes or 16-bytes based on what the hardware supports. So, for any of these cases that do Avx2/Sse2/AdvSimd/Vector, we should theoretically be able to simply delete the Avx2/Sse2/AdvSimd code paths, leaving only the Vector code path, because if the Avx2 path would be supported, then Vector would similarly use 32-byte widths, and if Avx2 wouldn't be supported but Sse2/AdvSimd would, Vector would similarly use 16-byte widths. And for the cases that today use Avx2/Sse2/AdvSimd without a Vector path, we should be able to add a path for Vector and then delete the others, for the same reason.For paths that don't have an Avx2 option, and instead just have paths for Sse2/AdvSimd/Vector, it's less clear whether we could delete the Sse2/AdvSimd implementations, as it's possible there could be performance differences (for better or worse) if the Vector path ended up using 32-byte widths.
cc: @tannergooding, @benaadams
The text was updated successfully, but these errors were encountered: