-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[macOS] Potential regression in delegates invocation #59152
Comments
Tagging subscribers to this area: @dotnet/area-system-runtime Issue DetailsIt seems to be affecting only macOS (cc @Lxiamail @jeffhandley) git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f net5.0 net6.0 --filter PerfLabTests.DelegatePerf.DelegateInvoke PerfLabTests.DelegatePerf.DelegateInvoke
cc @AndyAyersMS @kunalspathak @EgorBo
|
@AndyAyersMS @kunalspathak @EgorBo What would you recommend as the first step of investigating this? |
@AndyAyersMS - Is this the example that involves struct arguments passing and we hit some struct promotion limit that you were referring to few days back? |
I don't think so -- the test doesn't look like it involves structs. As far as investigating goes, we should see if we can repro locally, then we can drill in (perhaps start with jitted codegen for the benchmark method). What HW was this...? |
@adamsitnik I need to redirect this back to you for initial investigation per Andy's comments. |
@AndyAyersMS it was reproducible on all three x64 mac books that we have used (mine is 4 year old mac book pro, not sure about @jeffhandley or @carlossanlop laptops who provided the other results).
I am able to reproduce it on my macBook, the problem is that there is no easy way to get disassembly on macOS so I can't just share it. That is why it would be better if someone from the JIT Team took a look at this. |
My MacBook Pro is a mid-2014 13" retina with 2.6GHz dual-core Intel Core i5 and 8GB RAM. |
I'm moving this to 7.0.0, but we'll still want to get to the root cause to make sure we understand the impact and tradeoff. |
I can see if this repros on my Mac Mini, but it may take me a while to get around to it... |
I can repro this. Will see what I can uncover...
|
Thanks, @AndyAyersMS. Unless you think this should be considered for 6.0.1 (first servicing release) as a potential issue to address, it's not urgent. Since it can be repro'd, I'll remove the needs further triage label. |
From what I can tell the jitted codegen for 5.0 and 6.0 is identical. So the perf issue is either in the native part of the runtime (seems unlikely) or some stub I can't yet see, or...? Going to try and get a more general profile, but I am not that familiar with how to do this on MacOS. |
Ok, think this is related to loop alignment. In 5.0 we only 32 byte aligned Tier1 methods with loops. This was fixed in 6.0 (#42909) to handle 32 byte aligning all optimized method with loops. We later went on to add alignment padding for loops in 6.0. But we bypass padding if a loop contains a call. In this test the jitted codegen is identical in 5.0 and 6.0, and the key inner loop in G_M6345_IG03: ;; offset=003EH
488BC3 mov rax, rbx
488B7808 mov rdi, gword ptr [rax+8]
498BF6 mov rsi, r14
BA64000000 mov edx, 100
B964000000 mov ecx, 100
FF5018 call qword ptr [rax+24]DelegateLong:Invoke(System.Object,long,long):long:this
4C8BF8 mov r15, rax
41FFC4 inc r12d
48B88067BF0801000000 mov rax, 0x108BF6780
443B20 cmp r12d, dword ptr [rax]
7CD4 jl SHORT G_M6345_IG03
;; bbWeight=4 PerfScore 39.00
G_M6345_IG04: ;; offset=006AH For 6.0 the method is 32 byte aligned, and so this loop body runs from 0x3E .. 0x6A and needs 3 fetch windows (plus presumably one more for the call). For 5.0 the method ends up being 16 byte aligned, this is favorable for the key loop which now spans (effectively) 0x2E..0x5A or two fetch windows (plus perhaps one more for the call). As a result 5.0 runs faster:
But if I fix the "bug" in the 5.0 alignment code by setting
If you modify the benchmark to use a param instead of a static for the loop limit At any rate, here's a case where aligning a loop with a call seems to have a noticeable impact on perf, because the callee is trivial. |
We should revisit this now that OSR is enabled and see what that does (will wait for the 7P4 perf runs...) |
Current data with 7.0 looks better but still not quite as good as 5.0
|
Procrastination not paying off here. 7p6 is not any faster. BenchmarkDotNet=v0.13.1.1786-nightly, OS=macOS Monterey 12.4 (21F79) [Darwin 21.5.0] PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
|
FWIW TieredPGO does nicely here in .NET 7, thanks to @jakobbotsch
|
Some non-OSX data on Intel CPUs
PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
|
From the above we can see 5.0 is consistently fastest, with 6.0 or 7.0 sometimes similar to 5.0 and sometimes slower depending on the particular processor. 7.0 is generally faster than 6.0 though not always. I modified the jit to align small (single block) loops with calls (as in the benchmark) and didn't see any improvement, so at this point I suspect the perf differences are related to the alignment of the delegate or its precode. I think this sort of thing is only going to show up prominently when we have frequently executed delegates that do little or no computation. So, I'm going to close this as won't fix. |
It seems to be affecting only macOS (cc @Lxiamail @jeffhandley)
PerfLabTests.DelegatePerf.DelegateInvoke
cc @AndyAyersMS @kunalspathak @EgorBo
The text was updated successfully, but these errors were encountered: