-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arm64: Generate conditional comparison and selection instructions #55364
Comments
I've been taking a look at this. Starting with:
Gives the CIL:
Which becomes:
When running tier1
When running tier0
Delete LowerJTrue(), then run tier0
At first glance, this looks correct. However, the LowerJTrue feels wrong. Would it make sense to
There's probably some subtleties I'm missing (I'm not sure if OptimizeConstCompare catches all the cases LowerJTrue does). And I've not run any performance testing on any of the above. If neither the above hold, then this issue can be closed? |
That's correct, we do generate optimized code, so I have updated the PR description to reflect the current problem. G_M9565_IG01:
stp fp, lr, [sp,#-16]!
mov fp, sp
;; bbWeight=1 PerfScore 1.50
G_M9565_IG02:
tst w0, w1
bne G_M9565_IG04
;; bbWeight=1 PerfScore 1.50
G_M9565_IG03:
mov w0, w1
b G_M9565_IG05
;; bbWeight=0.50 PerfScore 0.75
G_M9565_IG04:
mov w1, w0
;; bbWeight=0.50 PerfScore 0.25
G_M9565_IG05:
bl _12219:Consume(int,int)
;; bbWeight=1 PerfScore 1.00
G_M9565_IG06:
ldp fp, lr, [sp],#16
ret lr
;; bbWeight=1 PerfScore 2.00
We try to do minimum optimization in tier0 for speedup and simplicity, so we mostly focus on improving the tier1 code. |
Ok, so trying to break this down into a first step:
Ends up as two basic blocks:
That needs optimising optimising to:
If done correctly, then that should hopefully get rid of all the branches in all the examples above, which should get a large portion of the performance. We can then look at generating the other conditionals and combining instructions in the other examples. |
I've been speaking to various people within Arm, and the benefit for switching to using csel (and friends) for AArch64 isn't obvious. This is due to modern branch prediction. Branches are predicted many cycles before the condition is evaluated (and before the branch itself is even fetched), so if the prediction is accurate, it results in significant speedups when using branches. In addition, dependency chains on the csel, especially when the result of the csel is required in the next iteration, can significantly slow down csel compared to using branches. Note that GCC and LLVM make cost based choices on when to use csel. LLVM is considering changing their approach too ( https://discourse.llvm.org/t/rfc-cmov-vs-branch-optimization/6040 ). Of course for a jit, we need the cost of generating the costs to be lightweight. The current advice is:
The performance impact of the above is likely to be small. However, every use of csel will reduce code size. AIUI, this is a concern for .Net, so where performance between the two options is the same, then csel should be preferred. AIUI, X86 has a similar behaviour, but I'm not sure how close. @TamarChristinaArm for reference |
Moved to .NET 8 to finish the remaining work. |
With the merging of #77728, some of the examples are now looking much better. There are still missing bits.
The op2 check should generate to a CMP and CSEL. This should be fully fixed by #79283 It won't catch when the else case is a different target to the if case, eg:
This now will do:
To make this ideal, we'd have to detect the 6 is 1 greater than the 5:
Should be a fairly straightforward to do via lowering/containing.
We'll generate:
Switching the CSEL to a CSET would get rid of the MOV. Suspect this would need changes in If Conversion pass
This is ideal now. CMP followed by CSEL.
Looking good here.
|
@a74nh are we going to address example 3-5 as well in the upcoming months? |
Example 3 - Needs work Having a quick look at example 3, it only saves a single mov, but it should be fairly easy to implement as it can fit into the existing csel work. I'll get @SwapnilGaikwad to look at this in Q2 so that we can close this issue. There is also the option of using CINV for the equivalent of:
I suspect instances of this are low. Currently it generates |
I verified all the examples and they generate expected code. Thank you @a74nh , @SwapnilGaikwad and @jakobbotsch ! C# examples[MethodImpl(MethodImplOptions.NoInlining)]
static int Example1(int op1, int op2) {
if (op1 > 0 && op2 > 0) {
op1 = 5;
}
else {
op1 = 10;
}
return op1;
}
[MethodImpl(MethodImplOptions.NoInlining)]
static int Example2(int op1, int op2) {
return op1 > 0 ? 5 : 6;
}
[MethodImpl(MethodImplOptions.NoInlining)]
static int Example3(int op1, int op2) {
return (op1 > 5) ? 0 : 1;
}
[MethodImpl(MethodImplOptions.NoInlining)]
static int Example4(int op1, int op2, int xyz, int def) {
return op1 > 0 ? xyz : def;
}
[MethodImpl(MethodImplOptions.NoInlining)]
static int Example5(int op1, int op2, int xyz, int def) {
return ((op1 & op2) == 0) ? 5 : def;
} Assembly codeInside TLS()
; Assembly listing for method helloworld.TLS:Example1(int,int):int
; Emitting BLENDED_CODE for generic ARM64 - Windows
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; invoked as altjit
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 5, 5 ) int -> x0
; V01 arg1 [V01,T01] ( 3, 3 ) int -> x1 single-def
;# V02 OutArgs [V02 ] ( 1, 1 ) struct ( 0) [sp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;
; Lcl frame size = 0
G_M60152_IG01: ;; offset=0000H
stp fp, lr, [sp, #-0x10]!
mov fp, sp
;; size=8 bbWeight=1 PerfScore 1.50
G_M60152_IG02: ;; offset=0008H
mov w2, #10
mov w3, #5
cmp w0, #0
ccmp w1, #0, nzc, gt
csel w0, w2, w3, le
;; size=20 bbWeight=1 PerfScore 2.50
G_M60152_IG03: ;; offset=001CH
ldp fp, lr, [sp], #0x10
ret lr
;; size=8 bbWeight=1 PerfScore 2.00
; Total bytes of code 36, prolog size 8, PerfScore 9.60, instruction count 9, allocated bytes for code 36 (MethodHash=5f6e1507) for method helloworld.TLS:Example1(int,int):int
; ============================================================
; Assembly listing for method helloworld.TLS:Example2(int,int):int
; Emitting BLENDED_CODE for generic ARM64 - Windows
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; invoked as altjit
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) int -> x0 single-def
;* V01 arg1 [V01 ] ( 0, 0 ) int -> zero-ref single-def
;# V02 OutArgs [V02 ] ( 1, 1 ) struct ( 0) [sp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;
; Lcl frame size = 0
G_M31387_IG01: ;; offset=0000H
stp fp, lr, [sp, #-0x10]!
mov fp, sp
;; size=8 bbWeight=1 PerfScore 1.50
G_M31387_IG02: ;; offset=0008H
mov w1, #5
cmp w0, #0
cinc w0, w1, le
;; size=12 bbWeight=1 PerfScore 1.50
G_M31387_IG03: ;; offset=0014H
ldp fp, lr, [sp], #0x10
ret lr
;; size=8 bbWeight=1 PerfScore 2.00
; Total bytes of code 28, prolog size 8, PerfScore 7.80, instruction count 7, allocated bytes for code 28 (MethodHash=8c118564) for method helloworld.TLS:Example2(int,int):int
; ============================================================
; Assembly listing for method helloworld.TLS:Example3(int,int):int
; Emitting BLENDED_CODE for generic ARM64 - Windows
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; invoked as altjit
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) int -> x0 single-def
;* V01 arg1 [V01 ] ( 0, 0 ) int -> zero-ref single-def
;# V02 OutArgs [V02 ] ( 1, 1 ) struct ( 0) [sp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;
; Lcl frame size = 0
G_M43834_IG01: ;; offset=0000H
stp fp, lr, [sp, #-0x10]!
mov fp, sp
;; size=8 bbWeight=1 PerfScore 1.50
G_M43834_IG02: ;; offset=0008H
cmp w0, #5
cset x0, le
;; size=8 bbWeight=1 PerfScore 1.00
G_M43834_IG03: ;; offset=0010H
ldp fp, lr, [sp], #0x10
ret lr
;; size=8 bbWeight=1 PerfScore 2.00
; Total bytes of code 24, prolog size 8, PerfScore 6.90, instruction count 6, allocated bytes for code 24 (MethodHash=993b54c5) for method helloworld.TLS:Example3(int,int):int
; ============================================================
; Assembly listing for method helloworld.TLS:Example4(int,int,int,int):int
; Emitting BLENDED_CODE for generic ARM64 - Windows
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; invoked as altjit
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) int -> x0 single-def
;* V01 arg1 [V01 ] ( 0, 0 ) int -> zero-ref single-def
; V02 arg2 [V02,T01] ( 3, 3 ) int -> x2 single-def
; V03 arg3 [V03,T02] ( 3, 3 ) int -> x3 single-def
;# V04 OutArgs [V04 ] ( 1, 1 ) struct ( 0) [sp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;
; Lcl frame size = 0
G_M62429_IG01: ;; offset=0000H
stp fp, lr, [sp, #-0x10]!
mov fp, sp
;; size=8 bbWeight=1 PerfScore 1.50
G_M62429_IG02: ;; offset=0008H
cmp w0, #0
csel w0, w2, w3, gt
;; size=8 bbWeight=1 PerfScore 1.00
G_M62429_IG03: ;; offset=0010H
ldp fp, lr, [sp], #0x10
ret lr
;; size=8 bbWeight=1 PerfScore 2.00
; Total bytes of code 24, prolog size 8, PerfScore 6.90, instruction count 6, allocated bytes for code 24 (MethodHash=79970c22) for method helloworld.TLS:Example4(int,int,int,int):int
; ============================================================
; Assembly listing for method helloworld.TLS:Example5(int,int,int,int):int
; Emitting BLENDED_CODE for generic ARM64 - Windows
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; invoked as altjit
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) int -> x0 single-def
; V01 arg1 [V01,T01] ( 3, 3 ) int -> x1 single-def
;* V02 arg2 [V02 ] ( 0, 0 ) int -> zero-ref single-def
; V03 arg3 [V03,T02] ( 3, 3 ) int -> x3 single-def
;# V04 OutArgs [V04 ] ( 1, 1 ) struct ( 0) [sp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;
; Lcl frame size = 0
G_M9340_IG01: ;; offset=0000H
stp fp, lr, [sp, #-0x10]!
mov fp, sp
;; size=8 bbWeight=1 PerfScore 1.50
G_M9340_IG02: ;; offset=0008H
mov w2, #5
tst w0, w1
csel w0, w2, w3, eq
;; size=12 bbWeight=1 PerfScore 1.50
G_M9340_IG03: ;; offset=0014H
ldp fp, lr, [sp], #0x10
ret lr
;; size=8 bbWeight=1 PerfScore 2.00
; Total bytes of code 28, prolog size 8, PerfScore 7.80, instruction count 7, allocated bytes for code 28 (MethodHash=1640db83) for method helloworld.TLS:Example5(int,int,int,int):int
; ============================================================
|
Arm64 provides branchless conditional selection and comparison instructions that should be utilized by RyuJIT in the code it generates.
Reference: https://eclecticlight.co/2021/07/20/code-in-arm-assembly-conditions-without-branches/
RyuJIT already has support for them as seen below:
runtime/src/coreclr/jit/instrsarm64.h
Lines 1353 to 1375 in f0b7773
runtime/src/coreclr/jit/instrsarm64.h
Lines 633 to 639 in f0b7773
Currently, the method emitIns_R_R_R_COND and emitIns_R_I_FLAGS_COND that produces these instructions are not utilized at all.
emitIns_R_R_R_COND
was recently used in #66407 to generate csneg instruction. Once these instructions are used, we could produce much better code. Below are some examples:Example# 1:
Ideal code: https://godbolt.org/z/5ov9TKx6P
Current code:
Example# 2:
Ideal code: https://godbolt.org/z/GTnc4jjfG
Current code:
Example# 3:
Ideal code: https://godbolt.org/z/GoqcsM1Tf
Current code:
Example# 4:
Ideal code: https://godbolt.org/z/1EfxPn48q
Current code:
Example# 5:
Ideal code: https://godbolt.org/z/fc3eddPx3
Current code:
@TamarChristinaArm
Some related issues:
Presumably, some parts of the analysis can be implemented in platform agnostic way and benefit both Arm64 and X86 platforms.
category:cq
theme:intrinsics
skill-level:expert
cost:large
impact:medium
The text was updated successfully, but these errors were encountered: