-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RyuJIT call optimization and aggressive inlining with known generic types #4489
Comments
I'm not quite sure what this has to do with generics, this looks more like a devirtualization problem. |
If I remember correctly, if ClassCalls was a struct and not a class, the Execute method would be inlined |
@panost Yes, in the case of structs the JIT usually does devirtualization. It's practically forced to do so, making an interface call on a value type would require boxing and you'd end up calling a method on a copy of the original value. |
@mikedn yeah, but my benchmarks and the emitted assembly suggest that the JIT will perform some limited devirtualization if no generic type is involved. I agree that probably the topic could be changed to "RyuJIT support aggressive call devirtualization over constrained generic types" or something better if we can think a better one. |
Do you have some sample code? |
@mikedn Sure. https://gist.github.com/redknightlois/5bafa47ee9835605da26 Just don't execute the naked call versions along with the generic ones (the difference between the count of instruction per each will screw the results --- probably as I am not counting properly the source instructions so I am not passing the right number to BenchmarkDotNet to do a proper adjustment). In there you will see that the timing for all the naked calls (sealed, unsealed and interface) have essentially the same cost. The assembly emitted for the 3 is identical as far as I remember. This suggest that some limited devirtualization is happening. @CarolEidt can you provide some insight here? |
I haven't measured the time but the naked interface variant certainly generates different code from the other 2 naked variants. As for actually doing devirtualization in this case - it isn't that simple. For example, the call in |
@mikedn And then I remember that I have RyuJIT disabled :) These are the legacy JIT calls: Not a huge timing difference in between the alternatives. EDIT: In the 64bits version there is an indirection on the call and a "lea" operation over the r11 register that looks like padding. Naked class call:
Interface call:
|
That's the 32 bit JIT, not the legacy (aka JIT64) JIT. Though on my machine JIT32 does inline the first two calls... |
@mikedn I hate the "Prefer 32bits" option set by default of Visual Studio. See the edit. |
That |
@mikedn I know, that's why I said: "And then I remember that I have RyuJIT disabled :)" ... the limited devirtualization I've seen was JIT64 not RyuJIT making the whole argument moot. |
But there's no kind of devirtualization going on in JIT64 either. |
@mikedn OK now I see what you mean. For all uses and purposes those 2 call opcodes are equivalent. That interface call performance profile is the same of the register call for every processor upwards of Sandy Bridge (and maybe a couple of before). But, that's an artifact introduced by my code because I isolated the 3 calls in their own method. When called one after another (even creating the object in the line before) it can be seen that no devirtualization happens for the interface even if that would have been insanely safe. However, it can be argued that devirtualization of the type: ICall instance = new ClassCall();
instance.Execute(); could be done at the compiler level without much hassle. On the constrained generic types case that doesn't seem to be true. EDIT: Even if the only devirtualization happening works for the following code I would be glad: public class Executer<T> where T : ICalls
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void Execute(T instance)
{
instance.Execute();
}
} Where the calling code would look like: ClassCalls _nakedClassCalls = new ClassCalls();
....
Executer<ClassCalls>.Execute(_nakedClassCalls); |
Yes, that's one case where devirtualization is possible. In itself it is a rather useless case as there's little reason to write such code to begin with (the only practical use for that kind of code is to access explicitly implement members). But such opportunities can show up in real code as the result of inlining of either the
Generics don't play any part in this except for the fact that they introduce the interface call. For reference types your generic public class Executer
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void Execute(ICalls instance)
{
instance.Execute();
}
} |
Devirtualization of enumerators called via Interfaces back to structs would be nice... |
@redknightlois can you look this over and update if you still think there is anything actionable here, or close if not? For generics instantiated over ref types we're unlikely to do devirtualization anytime soon, as the jit only sees the shared version. This might change down the road, if we somehow enabled unshared ref type instantiations or started looking into speculative devirtualization. If the generic can get inlined into a context where the types are known then things open up a little and if the jit can put enough pieces together or see sealed types, it can do a lot of optimization. |
@AndyAyersMS given that there are a few workarounds that could be found with sealed types and the actual solution for this is devirtualization of generic ref types I would say that the criteria for closing could be:
If all are yes, I would say that this is done. |
At runtime there's no way for the jit to deduce the exact type of instance members at jit time; all the jit knows is that the type is one of the exact instantiations of the the shared type Executer`1. If it turns out that the instance the member is always just one or a handful of types then via profiling the jit can discover which type is most likely and guess for that, and perform guarded devirtualization and subsequent inlining. This can be seen with the changes for class profiling linked above; eg ; Assembly listing for method Runtime4489:UseSealedCalls():this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; partially interruptible
; with IBC profile data, edge weights are valid, and fgCalledCount is 15122173
; invoked as altjit
; Final local variable assignments
;
; V00 this [V00,T00] ( 4, 4 ) ref -> rcx this class-hnd
; V01 OutArgs [V01 ] ( 1, 1 ) lclBlk (32) [rsp+0x00] "OutgoingArgSpace"
; V02 tmp1 [V02,T02] ( 2, 4 ) ref -> rcx ld-addr-op class-hnd "Inlining Arg"
; V03 tmp2 [V03,T01] ( 3, 4 ) ref -> rcx "guarded devirt this temp"
;* V04 tmp3 [V04 ] ( 0, 0 ) ref -> zero-ref class-hnd exact "guarded devirt this exact temp"
; V05 tmp4 [V05,T03] ( 2, 4 ) ref -> r11 class-hnd "Inlining Arg"
;
; Lcl frame size = 40
G_M52712_IG01: ;; offset=0000H
4883EC28 sub rsp, 40
;; bbWeight=1 PerfScore 0.25
G_M52712_IG02: ;; offset=0004H
4C8B5930 mov r11, gword ptr [rcx+48]
488B4918 mov rcx, gword ptr [rcx+24]
45391B cmp dword ptr [r11], r11d
49BBE8C335C4F87F0000 mov r11, 0x7FF8C435C3E8
4C3919 cmp qword ptr [rcx], r11
7517 jne SHORT G_M52712_IG04
48B9ACA232C4F87F0000 mov rcx, 0x7FF8C432A2AC
4533DB xor r11d, r11d
448919 mov dword ptr [rcx], r11d
FF01 inc dword ptr [rcx]
;; bbWeight=1 PerfScore 15.75
G_M52712_IG03: ;; offset=0030H
4883C428 add rsp, 40
C3 ret
;; bbWeight=1 PerfScore 1.25
G_M52712_IG04: ;; offset=0035H
49BB580505C4F87F0000 mov r11, 0x7FF8C4050558
48B8580505C4F87F0000 mov rax, 0x7FF8C4050558
FF10 call qword ptr [rax]ICalls:Execute():this
EBE3 jmp SHORT G_M52712_IG03
;; bbWeight=0 PerfScore 0.00
In an AOT scenario without PGO, and if one can impose suitable restrictions (no reflection, etc) it might be possible for RTA or similar to deduce that only one type can possibly be assigned to the instance members. Going to keep this open and in future, but once PGO is a bit further along may come back and close this one. |
Now that dynamic PGO is on by default, I think we can indeed close this. |
This probably will end up in the future releases wishlist, but it something that has been looking forward for a long time already.
Lets say that we have this code:
And we have the following instances:
Now we would expect that the call for _classCalls.Execute(x) would be different than for _interfaceCalls(x). Apparently that is not the case, the JIT stops at the first level even if have the complete information to emit highly optimized code for that call-site.
Now, supposed the implementation is:
There is no way that the JIT would inline that code, even if for all purposes it is safe to do so.
The scenario for this pattern is pretty common in high performance code where the calls are very small, in tight loops but must be able to handle more than a single type... An example is a BitVector with variants for MemoryMappedBitVector, UnsafeBitVector, LongBitVector and so on. Operations tend to be very small and executed in very tight loops.
Today we either need a different codepath for each one, or pay the call tax.
category:cq
theme:inlining
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: