-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JIT] More ARM64 comparison instruction optimizations with Vector.Zero #64783
Conversation
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsDescription ARM64 diffs based on the tests - movi v16.2s, #0x00
- fcmgt v16.2s, v0.2s, v16.2s
+ fcmgt v16.2s, v0.2s, #0
- ;; bbWeight=1 PerfScore 2.00
+ ;; bbWeight=1 PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=aecda9b5) for method Program:AdvSimd_CompareGreaterThan_Vector64_Single_Zero(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=aecda9b5) for method Program:AdvSimd_CompareGreaterThan_Vector64_Single_Zero(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
; ============================================================
- movi v16.4s, #0x00
- fcmgt v16.4s, v0.4s, v16.4s
+ fcmgt v16.4s, v0.4s, #0
- ;; bbWeight=1 PerfScore 2.00
+ ;; bbWeight=1 PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=2b4f1b8c) for method Program:AdvSimd_CompareGreaterThan_Vector128_Single_Zero(System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=2b4f1b8c) for method Program:AdvSimd_CompareGreaterThan_Vector128_Single_Zero(System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
; ============================================================
- movi v16.4s, #0x00
- fcmgt v16.2d, v0.2d, v16.2d
+ fcmgt v16.2d, v0.2d, #0
- ;; bbWeight=1 PerfScore 2.00
+ ;; bbWeight=1 PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=d2d38400) for method Program:AdvSimd_Arm64_CompareGreaterThan_Vector128_Double_Zero(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=d2d38400) for method Program:AdvSimd_Arm64_CompareGreaterThan_Vector128_Double_Zero(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
; ============================================================
- movi v16.4s, #0x00
- cmgt v16.2d, v0.2d, v16.2d
+ cmgt v16.2d, v0.2d, #0
- ;; bbWeight=1 PerfScore 2.00
+ ;; bbWeight=1 PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=7f83abe4) for method Program:AdvSimd_Arm64_CompareGreaterThan_Vector128_Int64_Zero(System.Runtime.Intrinsics.Vector128`1[Int64]):System.Runtime.Intrinsics.Vector128`1[Int64]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=7f83abe4) for method Program:AdvSimd_Arm64_CompareGreaterThan_Vector128_Int64_Zero(System.Runtime.Intrinsics.Vector128`1[Int64]):System.Runtime.Intrinsics.Vector128`1[Int64]
; ============================================================
- movi v16.2s, #0x00
- fcmgt d16, d0, d16
+ fcmgt d16, d0, #0
- ;; bbWeight=1 PerfScore 2.00
+ ;; bbWeight=1 PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=f2f05ab7) for method Program:AdvSimd_Arm64_CompareGreaterThanScalar_Vector64_Double_Zero(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=f2f05ab7) for method Program:AdvSimd_Arm64_CompareGreaterThanScalar_Vector64_Double_Zero(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
; ============================================================
- movi v16.2s, #0x00
- cmgt d16, d0, d16
+ cmgt d16, d0, #0
- ;; bbWeight=1 PerfScore 2.00
+ ;; bbWeight=1 PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=73a0d0d3) for method Program:AdvSimd_Arm64_CompareGreaterThanScalar_Vector64_Int64_Zero(System.Runtime.Intrinsics.Vector64`1[Int64]):System.Runtime.Intrinsics.Vector64`1[Int64]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=73a0d0d3) for method Program:AdvSimd_Arm64_CompareGreaterThanScalar_Vector64_Int64_Zero(System.Runtime.Intrinsics.Vector64`1[Int64]):System.Runtime.Intrinsics.Vector64`1[Int64]
; ============================================================
- movi v16.2s, #0x00
- fcmge v16.2s, v0.2s, v16.2s
+ fcmge v16.2s, v0.2s, #0
- ;; bbWeight=1 PerfScore 2.00
+ ;; bbWeight=1 PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=38c243c4) for method Program:AdvSimd_CompareGreaterThanOrEqual_Vector64_Single_Zero(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=38c243c4) for method Program:AdvSimd_CompareGreaterThanOrEqual_Vector64_Single_Zero(System.Runtime.Intrinsics.Vector64`1[Single]):System.Runtime.Intrinsics.Vector64`1[Single]
; ============================================================
- movi v16.4s, #0x00
- fcmge v16.4s, v0.4s, v16.4s
+ fcmge v16.4s, v0.4s, #0
- ;; bbWeight=1 PerfScore 2.00
+ ;; bbWeight=1 PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=a5f3879d) for method Program:AdvSimd_CompareGreaterThanOrEqual_Vector128_Single_Zero(System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=a5f3879d) for method Program:AdvSimd_CompareGreaterThanOrEqual_Vector128_Single_Zero(System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
; ============================================================
- movi v16.4s, #0x00
- fcmge v16.2d, v0.2d, v16.2d
+ fcmge v16.2d, v0.2d, #0
- ;; bbWeight=1 PerfScore 2.00
+ ;; bbWeight=1 PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=71184af1) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqual_Vector128_Double_Zero(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=71184af1) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqual_Vector128_Double_Zero(System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
; ============================================================
- movi v16.4s, #0x00
- cmge v16.2d, v0.2d, v16.2d
+ cmge v16.2d, v0.2d, #0
- ;; bbWeight=1 PerfScore 2.00
+ ;; bbWeight=1 PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=795b52b5) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqual_Vector128_Int64_Zero(System.Runtime.Intrinsics.Vector128`1[Int64]):System.Runtime.Intrinsics.Vector128`1[Int64]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=795b52b5) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqual_Vector128_Int64_Zero(System.Runtime.Intrinsics.Vector128`1[Int64]):System.Runtime.Intrinsics.Vector128`1[Int64]
; ============================================================
- movi v16.2s, #0x00
- fcmge d16, d0, d16
+ fcmge d16, d0, #0
- ;; bbWeight=1 PerfScore 2.00
+ ;; bbWeight=1 PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=cf991a66) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqualScalar_Vector64_Double_Zero(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=cf991a66) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqualScalar_Vector64_Double_Zero(System.Runtime.Intrinsics.Vector64`1[Double]):System.Runtime.Intrinsics.Vector64`1[Double]
; ============================================================
- movi v16.2s, #0x00
- cmge d16, d0, d16
+ cmge d16, d0, #0
- ;; bbWeight=1 PerfScore 2.00
+ ;; bbWeight=1 PerfScore 1.50
-; Total bytes of code 28, prolog size 8, PerfScore 8.30, instruction count 7, allocated bytes for code 28 (MethodHash=3213b422) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqualScalar_Vector64_Int64_Zero(System.Runtime.Intrinsics.Vector64`1[Int64]):System.Runtime.Intrinsics.Vector64`1[Int64]
+; Total bytes of code 24, prolog size 8, PerfScore 7.40, instruction count 6, allocated bytes for code 24 (MethodHash=3213b422) for method Program:AdvSimd_Arm64_CompareGreaterThanOrEqualScalar_Vector64_Int64_Zero(System.Runtime.Intrinsics.Vector64`1[Int64]):System.Runtime.Intrinsics.Vector64`1[Int64]
; ============================================================
Acceptance Criteria
|
@@ -499,29 +537,6 @@ void CodeGen::genHWIntrinsic(GenTreeHWIntrinsic* node) | |||
GetEmitter()->emitIns_R_R_R(ins, emitSize, targetReg, op1Reg, op2Reg, opt); | |||
break; | |||
|
|||
case NI_AdvSimd_CompareEqual: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part was removed in favor of the table driven approach as it was simpler to handle the instructions explicitly rather than the intrinsics.
@dotnet/jit-contrib This is ready. |
@@ -12995,7 +12995,7 @@ void emitter::emitDispIns( | |||
emitDispVectorReg(id->idReg1(), id->idInsOpt(), true); | |||
emitDispVectorReg(id->idReg2(), id->idInsOpt(), false); | |||
} | |||
if (ins == INS_cmeq) | |||
if (ins == INS_cmeq || ins == INS_cmge || ins == INS_cmgt || ins == INS_cmle || ins == INS_cmlt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's worth a small helper to cover these 5 instructions since they get grouped together a few times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, having the helper would be nice.
But this can be done in a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, this is the jit coding convention around subexpressions (including Boolean expressions)
https://github.com/dotnet/runtime/blob/main/docs/coding-guidelines/clr-jit-coding-conventions.md#11.1
There should be parentheses around all unary and binary expressions if they are contained within other expressions.
if ((numOperands == 2) && ((intrin.op2->IsVectorZero() && intrin.op2->isContained()) || | ||
(intrin.op1->IsVectorZero() && intrin.op1->isContained() && | ||
HWIntrinsicInfo::IsCommutative(intrin.id)))) | ||
{ | ||
assert(HWIntrinsicInfo::SupportsContainment(intrin.id)); | ||
|
||
if (intrin.op1->IsVectorZero() && intrin.op1->isContained() && | ||
HWIntrinsicInfo::IsCommutative(intrin.id)) | ||
{ | ||
// The intrinsic is commutative, swap the registers. | ||
assert(op1Reg == REG_NA); | ||
op1Reg = op2Reg; | ||
op2Reg = REG_NA; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For x86/x64 we actually do the operand swap in lowering to help simplify codegen a bit.
For example:
https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lowerxarch.cpp#L6113-L6119
Which then means in codegen we only ever have to consider intrin.op2->isContained()
It might be good to do that here too, but I'll defer to @echesakovMSFT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind where we do the swap, but the one positive thing about doing it in codegen is that it isn't dependent on any specific intrinsic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think moving the logic to be table-driven complicates things.
I would rather leave it as a special handling.
For x86/x64 we actually do the operand swap in lowering to help simplify codegen a bit.
I would do this too - this would slightly simplify the codegen part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think moving the logic to be table-driven complicates things.
I actually think the opposite. I find that it greatly simplifies things here and makes it a lot more trivial to light up instructions with similar behavior.
Rather than needing to specially handle the same 'n' comparison in lowering and codegen, and ensuring they stay in sync/etc; we just annotate with a flag that it supports containment of zero and it implicitly lights up.
This has allowed a lot of code and special handling to not exist for x86/x64 and to a similar extent Arm64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving the swap to lowering is fine; it will simplify the logic in codegen though make it more dependent on specific intrinsics, but it only benefits CompareEqual*
intrinsics so it makes sense to do it lowering.
I think moving the logic to be table-driven complicates things.
Can you explain why this complicates things? It makes it easier to focus on the specific instruction rather than the intrinsic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than needing to specially handle the same 'n' comparison in lowering and codegen, and ensuring they stay in sync/etc; we just annotate with a flag that it supports containment of zero and it implicitly lights up.
This is not what I was objecting to. Having a flag is fine. Moving the intrinsics to a dedicated category SIMDCompare
would also be fine. Checking for a specific instruction opcode in a code (that was supposed to be general) is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@echesakovMSFT, you're suggesting that something like:
switch (numOperands)
{
// ... existing code
case 2:
{
if (intrin.SupportsContainmentZero() && intrin.op2->isContained())
{
assert(intrin.op2->isVectorZero());
}
else if (isRMW)
// ... existingCode
}
// ... existing code
}
would be better here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not necessarily exactly that, but something similar or along those lines)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but now it depends on ins (i.e. on a particular instruction opcode). This is not what I would think of as table-driven implementation.
That's fair and even in the xarch path, there is no branching on specific instructions as well.
I added a flag called "SupportsContainmentZero" and appropriately used the flag on the instrinsics we care about; so no more acting on specific instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@echesakovMSFT, you're suggesting that something like:
@tannergooding Yes.
Actually, I wouldn't focus that much on that the contained operand is 'zero`. So something like
switch (numOperands)
{
// ... existing code
case 2:
{
// Lowering ensures that only op2 need to be checked for containment and will swap operands if needed.
if (intrin.SupportsContainment() && intrin.op2->isContained())
{
}
else if (isRMW)
// ... existingCode
}
// ... existing code
}
would also work.
That's fair and even in the xarch path, there is no branching on specific instructions as well.
I added a flag called "SupportsContainmentZero" and appropriately used the flag on the instrinsics we care about; so no more acting on specific instructions.
@TIHan Thanks!
// - cmhi | ||
// - cmhs | ||
// require both operands; they do not have a 'with zero'. | ||
if (intrin.op2->IsVectorZero() && !varTypeIsUnsigned(intrin.baseType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you look at #64785, there is probably an opportunity to also handle the op1->IsVectorZero()
. For example: CompareGreaterThan(0, x)
can be CompareLessThan(x, 0)
Looks good/correct to me. Just a couple small comments/suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall - left some comments/suggestions.
Description
Addresses the most of the instructions listed in #33972 (comment) except for
cmle
,cmlt
,fcmle
,fcmlt
as we do not emit them anywhere yet - we do not have hardware intrinsic APIs that correspond to those instructions.ARM64 diffs based on the tests
Acceptance Criteria