JIT ARM64-SVE: Add FK_3{A,B,C}, EJ_3A, EK_3A, EY_3B, EW_3{A,B} #98187

amanasifkhalid · 2024-02-08T21:35:32Z

Part of #94549. Implements the following encodings:

IF_SVE_FK_3A
IF_SVE_FK_3B
IF_SVE_FK_3C
IF_SVE_EJ_3A
IF_SVE_EK_3A
IF_SVE_EY_3B
IF_SVE_EW_3A (SVE2, unsupported by capstone)
IF_SVE_EW_3B (SVE2, unsupported by capstone)

cstool output:

sqrdmlah      z0.h, z1.h, z1.h[1]
sqrdmlah      z2.h, z3.h, z3.h[3]
sqrdmlsh      z4.h, z5.h, z5.h[5]
sqrdmlsh      z6.h, z7.h, z7.h[7]
sqrdmlah      z8.s, z9.s, z0.s[0]
sqrdmlah      z10.s, z11.s, z2.s[1]
sqrdmlsh      z12.s, z13.s, z4.s[2]
sqrdmlsh      z14.s, z15.s, z6.s[3]
sqrdmlah      z16.d, z17.d, z0.d[0]
sqrdmlah      z18.d, z19.d, z5.d[1]
sqrdmlsh      z20.d, z21.d, z10.d[0]
sqrdmlsh      z22.d, z23.d, z15.d[1]
cdot  z0.s, z1.b, z2.b, #0
cdot  z3.s, z4.b, z5.b, #90
cdot  z6.d, z7.h, z8.h, #180
cdot  z9.d, z10.h, z11.h, #270
cmla  z0.b, z1.b, z2.b, #0
cmla  z3.h, z4.h, z5.h, #90
cmla  z6.s, z7.s, z8.s, #180
cmla  z9.d, z10.d, z11.d, #270
sqrdcmlah     z12.b, z13.b, z14.b, #0
sqrdcmlah     z15.h, z16.h, z17.h, #90
sqrdcmlah     z18.s, z19.s, z20.s, #180
sqrdcmlah     z21.d, z22.d, z23.d, #270
sdot  z0.d, z1.h, z0.h[0]
sdot  z2.d, z3.h, z5.h[1]
udot  z4.d, z5.h, z10.h[0]
udot  z6.d, z7.h, z15.h[1]

JitDisasm output:

sqrdmlah z0.h, z1.h, z1.h[1]
sqrdmlah z2.h, z3.h, z3.h[3]
sqrdmlsh z4.h, z5.h, z5.h[5]
sqrdmlsh z6.h, z7.h, z7.h[7]
sqrdmlah z8.s, z9.s, z0.s[0]
sqrdmlah z10.s, z11.s, z2.s[1]
sqrdmlsh z12.s, z13.s, z4.s[2]
sqrdmlsh z14.s, z15.s, z6.s[3]
sqrdmlah z16.d, z17.d, z0.d[0]
sqrdmlah z18.d, z19.d, z5.d[1]
sqrdmlsh z20.d, z21.d, z10.d[0]
sqrdmlsh z22.d, z23.d, z15.d[1]
cdot    z0.s, z1.b, z2.b, #0
cdot    z3.s, z4.b, z5.b, #90
cdot    z6.d, z7.h, z8.h, #180
cdot    z9.d, z10.h, z11.h, #270
cmla    z0.b, z1.b, z2.b, #0
cmla    z3.h, z4.h, z5.h, #90
cmla    z6.s, z7.s, z8.s, #180
cmla    z9.d, z10.d, z11.d, #270
sqrdcmlah z12.b, z13.b, z14.b, #0
sqrdcmlah z15.h, z16.h, z17.h, #90
sqrdcmlah z18.s, z19.s, z20.s, #180
sqrdcmlah z21.d, z22.d, z23.d, #270
sdot    z0.d, z1.h, z0.h[0]
sdot    z2.d, z3.h, z5.h[1]
udot    z4.d, z5.h, z10.h[0]
udot    z6.d, z7.h, z15.h[1]

cc @dotnet/arm64-contrib

ghost · 2024-02-08T21:35:43Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Part of #94549. Implements the following encodings:

IF_SVE_FK_3A
IF_SVE_FK_3B
IF_SVE_FK_3C
IF_SVE_EJ_3A
IF_SVE_EK_3A
IF_SVE_EY_3B
IF_SVE_EW_3A (SVE2, unsupported by capstone)
IF_SVE_EW_3B (SVE2, unsupported by capstone)

cstool output:

sqrdmlah      z0.h, z1.h, z1.h[1]
sqrdmlah      z2.h, z3.h, z3.h[3]
sqrdmlsh      z4.h, z5.h, z5.h[5]
sqrdmlsh      z6.h, z7.h, z7.h[7]
sqrdmlah      z8.s, z9.s, z0.s[0]
sqrdmlah      z10.s, z11.s, z2.s[1]
sqrdmlsh      z12.s, z13.s, z4.s[2]
sqrdmlsh      z14.s, z15.s, z6.s[3]
sqrdmlah      z16.d, z17.d, z0.d[0]
sqrdmlah      z18.d, z19.d, z5.d[1]
sqrdmlsh      z20.d, z21.d, z10.d[0]
sqrdmlsh      z22.d, z23.d, z15.d[1]
cdot  z0.s, z1.b, z2.b, #0
cdot  z3.s, z4.b, z5.b, #90
cdot  z6.d, z7.h, z8.h, #180
cdot  z9.d, z10.h, z11.h, #270
cmla  z0.b, z1.b, z2.b, #0
cmla  z3.h, z4.h, z5.h, #90
cmla  z6.s, z7.s, z8.s, #180
cmla  z9.d, z10.d, z11.d, #270
sqrdcmlah     z12.b, z13.b, z14.b, #0
sqrdcmlah     z15.h, z16.h, z17.h, #90
sqrdcmlah     z18.s, z19.s, z20.s, #180
sqrdcmlah     z21.d, z22.d, z23.d, #270
sdot  z0.d, z1.h, z0.h[0]
sdot  z2.d, z3.h, z5.h[1]
udot  z4.d, z5.h, z10.h[0]
udot  z6.d, z7.h, z15.h[1]

JitDisasm output:

sqrdmlah z0.h, z1.h, z1.h[1]
sqrdmlah z2.h, z3.h, z3.h[3]
sqrdmlsh z4.h, z5.h, z5.h[5]
sqrdmlsh z6.h, z7.h, z7.h[7]
sqrdmlah z8.s, z9.s, z0.s[0]
sqrdmlah z10.s, z11.s, z2.s[1]
sqrdmlsh z12.s, z13.s, z4.s[2]
sqrdmlsh z14.s, z15.s, z6.s[3]
sqrdmlah z16.d, z17.d, z0.d[0]
sqrdmlah z18.d, z19.d, z5.d[1]
sqrdmlsh z20.d, z21.d, z10.d[0]
sqrdmlsh z22.d, z23.d, z15.d[1]
cdot    z0.s, z1.b, z2.b, #0
cdot    z3.s, z4.b, z5.b, #90
cdot    z6.d, z7.h, z8.h, #180
cdot    z9.d, z10.h, z11.h, #270
cmla    z0.b, z1.b, z2.b, #0
cmla    z3.h, z4.h, z5.h, #90
cmla    z6.s, z7.s, z8.s, #180
cmla    z9.d, z10.d, z11.d, #270
sqrdcmlah z12.b, z13.b, z14.b, #0
sqrdcmlah z15.h, z16.h, z17.h, #90
sqrdcmlah z18.s, z19.s, z20.s, #180
sqrdcmlah z21.d, z22.d, z23.d, #270
sdot    z0.d, z1.h, z0.h[0]
sdot    z2.d, z3.h, z5.h[1]
udot    z4.d, z5.h, z10.h[0]
udot    z6.d, z7.h, z15.h[1]

cc @dotnet/arm64-contrib

Author:	amanasifkhalid
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

amanasifkhalid · 2024-02-08T21:38:47Z

@a74nh sorry I stole one of your encodings; I thought I would get IF_SVE_EW_3A out of the way if I'm going to do IF_SVE_EW_3B.

TIHan · 2024-02-08T23:14:28Z

src/coreclr/jit/codegenarm64test.cpp

@@ -5912,6 +5912,42 @@ void CodeGen::genArm64EmitterUnitTestsSve()
    theEmitter->emitIns_R_R_R_I(INS_sve_udot, EA_SCALABLE, REG_V7, REG_V8, REG_V3, 3,
                                INS_OPTS_SCALABLE_H); // UDOT <Zda>.S, <Zn>.H, <Zm>.H[<imm>]

+    // IF_SVE_EJ_3A
+    theEmitter->emitIns_R_R_R_I(INS_sve_cdot, EA_SCALABLE, REG_V0, REG_V1, REG_V2, 0,


For these instructions that take a rotation value, shouldn't we be passing 0, 90, 180, 270 instead of 0, 1, 2, 3?

I also implemented a few instructions that needed rotation values here: #98141 which I am passing 0, 90, 180, 270.

It looks a little weird passing 0-3 instead of 0-270, but I did this to match the bit-level representation of the rotation value, so that we don't need a helper method to encode the rr bits; then when displaying the instruction in JitDisasms, I multiply the immediate by 90 to display it correctly. I'm fine changing my approach to match yours, though I'll have to update a few encodings already merged in. @kunalspathak @a74nh do you have any preference?

The way I look at is the emitIns functions are APIs that should try to match what the instructions actually are. The bit-level representation/encoding is an implementation detail.

I see. In that case, how about I wait for #98141 to be merged in, and then I'll update my encodings that use rotation values to use the helper methods you added?

That seems fair to me

I would prefer we write for readability first if we know that such an optimization would not make any difference in 99.9% of scenarios.

However, I'm fine with encoding it as 0, 1, 2, 3 on instrDesc if that is what we all want. I'll have to adjust my work as well.

I would prefer we write for readability first

It is readable from the perspective of calling the emitIns* method as seen here: https://github.com/dotnet/runtime/pull/98187/files#diff-d4f9f119d0a321cea7e82023cb754d8abdb800d6185c8bb9464d389ebd50debcR6288

and we flip it just before saving it in instrdesc:

https://github.com/dotnet/runtime/pull/98187/files#diff-2b2c8b9011607926410624d6f81613fad7b74c6e0516d578675a8b792998fe4fR11110

I am not sure if emitOutputInstr() method is readable anyway :)

I'm sympathetic to the readability argument. I guess the silver lining with our current approach is the bitwise representation of the rotation value is abstracted away from the API surface (i.e. the emitIns methods). Maybe I'm being naive, but I don't anticipate the code for handling the rotation values in emitIns or emitDispInsHelp changing with any frequency after this is merged in, whereas the usage of emitIns will certainly increase once we start using these SVE instructions. So in the "important" case, readability isn't hindered.

It isn't necessarily about making emitOutputInstr readable, but about the display code being simple. We will have to decode the imm whose values are 0-3 to be translated to 0-270 on display.

I guess what I'm trying to say is, if we encode the values as 0-3 on the instrDesc, we will have to have an encode/decode for it, whereas if we store 0-270, we only need one encode function.

ryujit-bot · 2024-02-09T00:14:10Z

Diff results for #98187

Throughput diffs

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)

Collection	PDIFF
benchmarks.run.windows.arm64.checked.mch	+0.01%

Details here

kunalspathak · 2024-02-09T16:24:47Z

src/coreclr/jit/emitarm64.cpp

-            {
-                assert(isValidUimm4(imm)); // ii rr
-                assert((REG_V0 <= reg3) && (reg3 <= REG_V7));
-                fmt = IF_SVE_FA_3A;


what are these changes? I see that we moved it under emitIns_R_R_R_I_I() and wondering why this was not done when we implemented SVE_FA_3A, SVE_FA_3B, etc. ?

When I first added emitIns_R_R_R_I_I, I was trying to minimize code duplication, so I just made it a wrapper for emitIns_R_R_R_I by bitwise OR-ing imm1 and imm2 into one imm, and then passing this along to emitIns_R_R_R_I to do the rest. Now I'm running into instructions that have encodings that take one immediate, and encodings that take two immediates, so it's easier to separate these two emitIns methods out.

kunalspathak · 2024-02-09T16:26:37Z

src/coreclr/jit/emitarm64.cpp

+            else
+            {
+                assert(opt == INS_OPTS_SCALABLE_D);
+                assert((REG_V0 <= reg3) && (reg3 <= REG_V15)); // mmmm


worth adding a function like isLowVectorRegister()?

Sure thing.

amanasifkhalid · 2024-02-09T19:46:39Z

@kunalspathak thanks for the review -- I applied your feedback

ryujit-bot · 2024-02-09T20:16:44Z

Diff results for #98187

Throughput diffs

Throughput diffs for linux/arm64 ran on linux/x64

MinOpts (+0.00% to +0.01%)

Collection	PDIFF
libraries.crossgen2.linux.arm64.checked.mch	+0.01%
libraries_tests.run.linux.arm64.Release.mch	+0.01%
coreclr_tests.run.linux.arm64.checked.mch	+0.01%
benchmarks.run.linux.arm64.checked.mch	+0.01%
benchmarks.run_pgo.linux.arm64.checked.mch	+0.01%
smoke_tests.nativeaot.linux.arm64.checked.mch	+0.01%
benchmarks.run_tiered.linux.arm64.checked.mch	+0.01%

Details here

kunalspathak

LGTM. Thanks!

amanasifkhalid · 2024-02-10T01:41:31Z

I merged in #98141, and replaced my rotation value fixups with the helper methods @TIHan added.

ryujit-bot · 2024-02-10T03:17:59Z

Diff results for #98187

Throughput diffs

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)

Collection	PDIFF
libraries.pmi.windows.arm64.checked.mch	+0.01%

Details here

Throughput diffs for linux/arm64 ran on linux/x64

MinOpts (+0.00% to +0.01%)

Collection	PDIFF
coreclr_tests.run.linux.arm64.checked.mch	+0.01%
libraries.crossgen2.linux.arm64.checked.mch	+0.01%
smoke_tests.nativeaot.linux.arm64.checked.mch	+0.01%
benchmarks.run_pgo.linux.arm64.checked.mch	+0.01%
benchmarks.run_tiered.linux.arm64.checked.mch	+0.01%
libraries_tests.run.linux.arm64.Release.mch	+0.01%
benchmarks.run.linux.arm64.checked.mch	+0.01%

Details here

ryujit-bot · 2024-02-11T09:21:21Z

Diff results for #98187

Throughput diffs

Throughput diffs for linux/arm64 ran on linux/x64

MinOpts (+0.00% to +0.01%)

Collection	PDIFF
benchmarks.run_pgo.linux.arm64.checked.mch	+0.01%
libraries.crossgen2.linux.arm64.checked.mch	+0.01%
smoke_tests.nativeaot.linux.arm64.checked.mch	+0.01%
benchmarks.run_tiered.linux.arm64.checked.mch	+0.01%
benchmarks.run.linux.arm64.checked.mch	+0.01%
coreclr_tests.run.linux.arm64.checked.mch	+0.01%
libraries_tests.run.linux.arm64.Release.mch	+0.01%

Details here

amanasifkhalid added 5 commits February 8, 2024 11:35

Add IF_SVE_FK_3{A,B,C}

e7c7068

Add IF_SVE_EJ_3A

f8f7143

Add IF_SVE_EK_3A

f7c0820

Add IF_SVE_EY_3B

554b98c

Add IF_SVE_EW_3{A,B} (unsupported)

03b3a04

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 8, 2024

ghost assigned amanasifkhalid Feb 8, 2024

amanasifkhalid added the arm-sve Work related to arm64 SVE/SVE2 support label Feb 8, 2024

amanasifkhalid mentioned this pull request Feb 8, 2024

Arm64: Implement SVE encodings #94549

Closed

TIHan reviewed Feb 8, 2024

View reviewed changes

amanasifkhalid added 3 commits February 8, 2024 18:42

temp

a6686ca

Merge from main

ef67c88

Fix merge

4d6ccd8

kunalspathak mentioned this pull request Feb 9, 2024

JIT: ARM64 SVE format encodings, SVE_GP_3A to SVE_HV_4A #98141

Merged

kunalspathak requested changes Feb 9, 2024

View reviewed changes

ghost added needs-author-action An issue or pull request that requires more info or actions from the author. and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Feb 9, 2024

amanasifkhalid added 2 commits February 9, 2024 12:57

Add isLowVectorRegister

5516f7d

Fix rotation value

f75c3ce

kunalspathak approved these changes Feb 9, 2024

View reviewed changes

Merge from main

84b3c52

Merge branch 'main' into sve-sqrdmlah

22ccd3f

amanasifkhalid merged commit 35562ee into dotnet:main Feb 11, 2024
6 of 16 checks passed

This was referenced Feb 11, 2024

Checkout failure: "Git fetch failed with exit code 128" dotnet/arcade#9009

Open

DataContractSerializerTests.DCS_MyPersonSurrogate_Stress failing in CI #35066

Open

amanasifkhalid deleted the sve-sqrdmlah branch February 12, 2024 03:37

github-actions bot locked and limited conversation to collaborators Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT ARM64-SVE: Add FK_3{A,B,C}, EJ_3A, EK_3A, EY_3B, EW_3{A,B} #98187

JIT ARM64-SVE: Add FK_3{A,B,C}, EJ_3A, EK_3A, EY_3B, EW_3{A,B} #98187

amanasifkhalid commented Feb 8, 2024

ghost commented Feb 8, 2024

amanasifkhalid commented Feb 8, 2024

TIHan Feb 8, 2024

amanasifkhalid Feb 8, 2024 •

edited

Loading

TIHan Feb 8, 2024

amanasifkhalid Feb 8, 2024

TIHan Feb 9, 2024

TIHan Feb 9, 2024

kunalspathak Feb 9, 2024

amanasifkhalid Feb 9, 2024

TIHan Feb 9, 2024

TIHan Feb 9, 2024

ryujit-bot commented Feb 9, 2024

Throughput diffs

Throughput diffs for windows/arm64 ran on windows/x64

kunalspathak Feb 9, 2024

amanasifkhalid Feb 9, 2024

kunalspathak Feb 9, 2024

amanasifkhalid Feb 9, 2024

amanasifkhalid commented Feb 9, 2024

ryujit-bot commented Feb 9, 2024

Throughput diffs

Throughput diffs for linux/arm64 ran on linux/x64

kunalspathak left a comment

amanasifkhalid commented Feb 10, 2024

ryujit-bot commented Feb 10, 2024

Throughput diffs

Throughput diffs for windows/arm64 ran on windows/x64

Throughput diffs for linux/arm64 ran on linux/x64

ryujit-bot commented Feb 11, 2024

Throughput diffs

Throughput diffs for linux/arm64 ran on linux/x64

JIT ARM64-SVE: Add FK_3{A,B,C}, EJ_3A, EK_3A, EY_3B, EW_3{A,B} #98187

JIT ARM64-SVE: Add FK_3{A,B,C}, EJ_3A, EK_3A, EY_3B, EW_3{A,B} #98187

Conversation

amanasifkhalid commented Feb 8, 2024

ghost commented Feb 8, 2024

amanasifkhalid commented Feb 8, 2024

Choose a reason for hiding this comment

amanasifkhalid Feb 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryujit-bot commented Feb 9, 2024

Throughput diffs

Throughput diffs for windows/arm64 ran on windows/x64

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amanasifkhalid commented Feb 9, 2024

ryujit-bot commented Feb 9, 2024

Throughput diffs

Throughput diffs for linux/arm64 ran on linux/x64

kunalspathak left a comment

Choose a reason for hiding this comment

amanasifkhalid commented Feb 10, 2024

ryujit-bot commented Feb 10, 2024

Throughput diffs

Throughput diffs for windows/arm64 ran on windows/x64

Throughput diffs for linux/arm64 ran on linux/x64

ryujit-bot commented Feb 11, 2024

Throughput diffs

Throughput diffs for linux/arm64 ran on linux/x64

amanasifkhalid Feb 8, 2024 •

edited

Loading