Accelerating Vector512.Sum() #87851

DeepakRajendrakumaran · 2023-06-20T23:16:21Z

This accelerates Vector512.Sum() using AVX512 instructions.

Vector512.Sum() ->

vmovups  zmm0, zmmword ptr [rcx]
vextracti64x4 ymm1, zmm0, 1
vpaddd   ymm0, ymm1, ymm0
vextracti128 xmm1, ymm0, 1
vpaddd   xmm0, xmm1, xmm0
vphaddd  xmm0, xmm0, xmm0
vphaddd  xmm0, xmm0, xmm0
vmovd    eax, xmm0

@dotnet/avx512-contrib

ghost · 2023-06-20T23:16:31Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This accelerates Vector512.Sum() using AVX512 instructions.

Vector512.Sum() ->

vmovups  zmm0, zmmword ptr [rcx]
vextracti64x4 ymm1, zmm0, 1
vpaddd   ymm0, ymm1, ymm0
vextracti128 xmm1, ymm0, 1
vpaddd   xmm0, xmm1, xmm0
vphaddd  xmm0, xmm0, xmm0
vphaddd  xmm0, xmm0, xmm0
vmovd    eax, xmm0

@dotnet/avx512-contrib

Author:	DeepakRajendrakumaran
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

gfoidl · 2023-06-21T13:38:32Z

Is vphaddd the instruction we want to use here?
I'm asking because of

phaddd is implemented in microcode on all supporting processors, so it's slow compared to pshufd+paddd

(from SixLabors/ImageSharp#1630 (comment))

and

phaddd is no faster than 2 shuffles + a vertical add

(from https://stackoverflow.com/a/59326242/347870

I know that Vector128/256 also uses vphaddd and I don't know if current cpus still implement it in microcode (from Agner Fog's instruction tables it looks like to be microcode).

DeepakRajendrakumaran · 2023-06-21T16:30:34Z

Is vphaddd the instruction we want to use here? I'm asking because of

phaddd is implemented in microcode on all supporting processors, so it's slow compared to pshufd+paddd

(from SixLabors/ImageSharp#1630 (comment))

and

phaddd is no faster than 2 shuffles + a vertical add

(from https://stackoverflow.com/a/59326242/347870

I know that Vector128/256 also uses vphaddd and I don't know if current cpus still implement it in microcode (from Agner Fog's instruction tables it looks like to be microcode).

That's a good question and something we can possibly consider. ICC for eg. creates the following - https://godbolt.org/z/ojGcK9a4W

I'd imagine this would mean optimizing the whole SUM() lowering if we want to change for Vector128/ vector256/vector512. It's something I'm open to since Vector*.Dot() also in theory can use this logic(currently there is little overlap between Dot() and Sum().

I'd love to hear what @tannergooding and @BruceForstall have to add here

tannergooding · 2023-06-22T14:51:37Z

Should be fine. We'd be looking at 9 instructions with hadd vs 11 with permute+add, but a more opportunities for parallel dispatch (some CPUs have 2 shuffle/permute ports) and almost half the micro-ops/latency.

Would be good to ensure that all 3 sizes for Sum are doing the "optimal" thing here.

BruceForstall · 2023-06-22T20:56:28Z

Related: #85207

DeepakRajendrakumaran · 2023-06-22T20:58:09Z

Related: #85207

You can mark 'Shuffle()' from there done. I had a PR a few weeks ago

tannergooding · 2023-07-14T15:54:13Z

src/coreclr/jit/gentree.cpp

+    if (simdSize == 64)
+    {
+        assert(IsBaselineVector512IsaSupportedDebugOnly());
+        // This is roughly the following managed code:
+        //   ...
+        //   simd64 tmp2 = tmp1;
+        //          tmp3 = tmp2.GetUpper();
+        //   simd32 tmp4 = Isa.Add(tmp1.GetLower(), tmp2);
+        //          tmp5 = tmp4;
+        //   simd16 tmp6 = tmp4.GetUpper();
+        //       tmp1 = Isa.Add(tmp1.GetLower(), tmp2);
+        //   ...
+        // From here on we can treat this as a simd16 reduction
+        GenTree* op1Dup     = fgMakeMultiUse(&op1);
+        GenTree* op1Lower32 = gtNewSimdGetUpperNode(TYP_SIMD32, op1Dup, simdBaseJitType, simdSize);
+        GenTree* op1Upper32 = gtNewSimdGetLowerNode(TYP_SIMD32, op1, simdBaseJitType, simdSize);
+
+        simdSize   = simdSize / 2;
+        op1Lower32 = gtNewSimdBinOpNode(GT_ADD, TYP_SIMD32, op1Lower32, op1Upper32, simdBaseJitType, simdSize);
+        haddCount--;
+
+        GenTree* op1Dup32   = fgMakeMultiUse(&op1Lower32);
+        GenTree* op1Lower16 = gtNewSimdGetUpperNode(TYP_SIMD16, op1Lower32, simdBaseJitType, simdSize);
+        GenTree* op1Upper16 = gtNewSimdGetLowerNode(TYP_SIMD16, op1Dup32, simdBaseJitType, simdSize);
+        simdSize            = simdSize / 2;
+        op1                 = gtNewSimdBinOpNode(GT_ADD, TYP_SIMD16, op1Lower16, op1Upper16, simdBaseJitType, simdSize);
+        haddCount--;
+    }


This is "correct", but has an interesting side effect in that can change the result for float/double.

Since float/double are not associative due to the rounding, summing accross as [0] + [1] + [2] + [3] + ... is different from summing pairwise as (([0] + [1]) + ([2] + [3])) + ..., which is different from summing per lane, then combining the lanes, etc.

Today, we're basically doing it per lane, then combining lanes. Within that lane we're typically doing pairwise because that's how addv (add across) works on Arm64, it's how hadd (horizontal add) works on x86/x64, and its trivial to emulate using shufps on older hardware.

We don't really want to have the results subtly change based on what the hardware supports, so we should probably try to ensure this keeps things overall consistent in how it operates.

I'm busy with something else but will get back to this once I'm done. Sorry for the delay

No worries. This one isn't "critical" for .NET 8 and we're already generating decent (but not amazing) code that would be similar to what a user might manually write.

JulieLeeMSFT · 2023-08-07T16:03:57Z

Markign this as .NET 9.

BruceForstall · 2023-09-13T05:03:52Z

@DeepakRajendrakumaran Can you mark this "Draft" until it is ready to review again?

DeepakRajendrakumaran · 2023-09-13T16:18:56Z

@DeepakRajendrakumaran Can you mark this "Draft" until it is ready to review again?

Done

ghost · 2023-10-13T17:01:15Z

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 20, 2023

ghost added the community-contribution Indicates that the PR has been added by a community member label Jun 20, 2023

Accelerating Vector512.Sum()

2d3c632

JulieLeeMSFT requested review from BruceForstall and tannergooding June 26, 2023 16:34

JulieLeeMSFT assigned tannergooding and DeepakRajendrakumaran Jul 10, 2023

tannergooding added the avx512 Related to the AVX-512 architecture label Jul 14, 2023

tannergooding reviewed Jul 14, 2023

View reviewed changes

JulieLeeMSFT added this to the 9.0.0 milestone Aug 7, 2023

DeepakRajendrakumaran marked this pull request as draft September 13, 2023 16:18

BruceForstall mentioned this pull request Oct 12, 2023

Intel architecture improvements for .NET 9 #93196

Closed

33 tasks

ghost closed this Oct 13, 2023

ghost locked as resolved and limited conversation to collaborators Nov 12, 2023

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerating Vector512.Sum() #87851

Accelerating Vector512.Sum() #87851

DeepakRajendrakumaran commented Jun 20, 2023

ghost commented Jun 20, 2023

gfoidl commented Jun 21, 2023

DeepakRajendrakumaran commented Jun 21, 2023 •

edited

Loading

tannergooding commented Jun 22, 2023

BruceForstall commented Jun 22, 2023

DeepakRajendrakumaran commented Jun 22, 2023 •

edited

Loading

tannergooding Jul 14, 2023

DeepakRajendrakumaran Jul 14, 2023

tannergooding Jul 14, 2023

JulieLeeMSFT commented Aug 7, 2023

BruceForstall commented Sep 13, 2023

DeepakRajendrakumaran commented Sep 13, 2023

ghost commented Oct 13, 2023

Accelerating Vector512.Sum() #87851

Accelerating Vector512.Sum() #87851

Conversation

DeepakRajendrakumaran commented Jun 20, 2023

ghost commented Jun 20, 2023

gfoidl commented Jun 21, 2023

DeepakRajendrakumaran commented Jun 21, 2023 • edited Loading

tannergooding commented Jun 22, 2023

BruceForstall commented Jun 22, 2023

DeepakRajendrakumaran commented Jun 22, 2023 • edited Loading

tannergooding Jul 14, 2023

Choose a reason for hiding this comment

DeepakRajendrakumaran Jul 14, 2023

Choose a reason for hiding this comment

tannergooding Jul 14, 2023

Choose a reason for hiding this comment

JulieLeeMSFT commented Aug 7, 2023

BruceForstall commented Sep 13, 2023

DeepakRajendrakumaran commented Sep 13, 2023

ghost commented Oct 13, 2023

DeepakRajendrakumaran commented Jun 21, 2023 •

edited

Loading

DeepakRajendrakumaran commented Jun 22, 2023 •

edited

Loading