Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[macOS] Potential regression in delegates invocation #59152

Closed
Tracked by #64603
adamsitnik opened this issue Sep 15, 2021 · 19 comments
Closed
Tracked by #64603

[macOS] Potential regression in delegates invocation #59152

adamsitnik opened this issue Sep 15, 2021 · 19 comments
Assignees
Labels
Milestone

Comments

@adamsitnik
Copy link
Member

It seems to be affecting only macOS (cc @Lxiamail @jeffhandley)

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f net5.0 net6.0 --filter PerfLabTests.DelegatePerf.DelegateInvoke

PerfLabTests.DelegatePerf.DelegateInvoke

Result Ratio Operating System
Same 1.01 Windows 10.0.19043.1165
Same 0.99 Windows 10.0.20348
Same 1.00 Windows 10.0.20348
Same 1.00 Windows 10.0.18363.1621
Same 0.97 Windows 8.1
Same 1.00 Windows 10.0.19042.685
Same 1.00 Windows 10.0.19043.1165
Same 0.99 Windows 10.0.22454
Same 0.99 Windows 10.0.22451
Same 1.00 Windows 10.0.19042.1165
Slower 0.29 Windows 7 SP1
Same 0.99 centos 8
Same 1.00 debian 10
Same 0.99 rhel 7
Same 1.02 sles 15
Same 1.01 opensuse-leap 15.3
Same 1.00 ubuntu 18.04
Same 1.00 ubuntu 18.04
Same 1.00 alpine 3.13
Same 1.00 ubuntu 16.04
Faster 1.33 Windows 10.0.19043.1165
Faster 1.43 Windows 10.0.22000
Same 1.00 Windows 10.0.19043.1165
Same 1.00 Windows 10.0.18363.1621
Same 0.99 Windows 10.0.19043.1165
Slower 0.89 macOS Big Sur 11.5.2
Slower 0.72 macOS Big Sur 11.5.2
Slower 0.78 macOS Big Sur 11.4

cc @AndyAyersMS @kunalspathak @EgorBo

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Sep 15, 2021
@ghost
Copy link

ghost commented Sep 15, 2021

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

It seems to be affecting only macOS (cc @Lxiamail @jeffhandley)

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f net5.0 net6.0 --filter PerfLabTests.DelegatePerf.DelegateInvoke

PerfLabTests.DelegatePerf.DelegateInvoke

Result Ratio Operating System
Same 1.01 Windows 10.0.19043.1165
Same 0.99 Windows 10.0.20348
Same 1.00 Windows 10.0.20348
Same 1.00 Windows 10.0.18363.1621
Same 0.97 Windows 8.1
Same 1.00 Windows 10.0.19042.685
Same 1.00 Windows 10.0.19043.1165
Same 0.99 Windows 10.0.22454
Same 0.99 Windows 10.0.22451
Same 1.00 Windows 10.0.19042.1165
Slower 0.29 Windows 7 SP1
Same 0.99 centos 8
Same 1.00 debian 10
Same 0.99 rhel 7
Same 1.02 sles 15
Same 1.01 opensuse-leap 15.3
Same 1.00 ubuntu 18.04
Same 1.00 ubuntu 18.04
Same 1.00 alpine 3.13
Same 1.00 ubuntu 16.04
Faster 1.33 Windows 10.0.19043.1165
Faster 1.43 Windows 10.0.22000
Same 1.00 Windows 10.0.19043.1165
Same 1.00 Windows 10.0.18363.1621
Same 0.99 Windows 10.0.19043.1165
Slower 0.89 macOS Big Sur 11.5.2
Slower 0.72 macOS Big Sur 11.5.2
Slower 0.78 macOS Big Sur 11.4

cc @AndyAyersMS @kunalspathak @EgorBo

Author: adamsitnik
Assignees: -
Labels:

area-System.Runtime, tenet-performance

Milestone: -

@jeffhandley jeffhandley added this to the 6.0.0 milestone Sep 30, 2021
@jeffhandley jeffhandley added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels Sep 30, 2021
@jeffhandley
Copy link
Member

@AndyAyersMS @kunalspathak @EgorBo What would you recommend as the first step of investigating this?

@kunalspathak
Copy link
Member

@AndyAyersMS - Is this the example that involves struct arguments passing and we hit some struct promotion limit that you were referring to few days back?

@AndyAyersMS
Copy link
Member

I don't think so -- the test doesn't look like it involves structs.

https://github.com/dotnet/performance/blob/d7dac8a7ca12a28d099192f8a901cf8e30361384/src/benchmarks/micro/runtime/perflab/DelegatePerf.cs#L32-L44

As far as investigating goes, we should see if we can repro locally, then we can drill in (perhaps start with jitted codegen for the benchmark method).

What HW was this...?

@jeffhandley
Copy link
Member

@adamsitnik I need to redirect this back to you for initial investigation per Andy's comments.

@adamsitnik
Copy link
Member Author

What HW was this...?

@AndyAyersMS it was reproducible on all three x64 mac books that we have used (mine is 4 year old mac book pro, not sure about @jeffhandley or @carlossanlop laptops who provided the other results).

I need to redirect this back to you for initial investigation per Andy's comments.

I am able to reproduce it on my macBook, the problem is that there is no easy way to get disassembly on macOS so I can't just share it. That is why it would be better if someone from the JIT Team took a look at this.

@adamsitnik adamsitnik removed their assignment Oct 11, 2021
@jeffhandley
Copy link
Member

My MacBook Pro is a mid-2014 13" retina with 2.6GHz dual-core Intel Core i5 and 8GB RAM.

@jeffhandley
Copy link
Member

I'm moving this to 7.0.0, but we'll still want to get to the root cause to make sure we understand the impact and tradeoff.

@jeffhandley jeffhandley modified the milestones: 6.0.0, 7.0.0 Oct 18, 2021
@AndyAyersMS
Copy link
Member

I can see if this repros on my Mac Mini, but it may take me a while to get around to it...

@AndyAyersMS
Copy link
Member

I can repro this. Will see what I can uncover...

BenchmarkDotNet=v0.13.1.1611-nightly, OS=macOS High Sierra 10.13.6 (17G14042) [Darwin 17.7.0]
Intel Core i5-4278U CPU 2.60GHz (Haswell), 1 CPU, 4 logical and 2 physical cores
.NET SDK=6.0.100-rc.2.21505.57
  [Host]     : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
  Job-IIHWUZ : .NET 5.0.11 (5.0.1121.47308), X64 RyuJIT
  Job-SZCDQB : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1  

|         Method |        Job |  Runtime | Toolchain |     Mean |   Error |  StdDev |   Median |      Min |      Max | Ratio | Allocated |
|--------------- |----------- |--------- |---------- |---------:|--------:|--------:|---------:|---------:|---------:|------:|----------:|
| DelegateInvoke | Job-IIHWUZ | .NET 5.0 |    net5.0 | 458.4 us | 1.22 us | 1.08 us | 458.3 us | 457.1 us | 461.2 us |  1.00 |       1 B |
| DelegateInvoke | Job-SZCDQB | .NET 6.0 |    net6.0 | 523.7 us | 1.02 us | 0.85 us | 523.8 us | 522.3 us | 524.8 us |  1.14 |       2 B |

@jeffhandley
Copy link
Member

Thanks, @AndyAyersMS. Unless you think this should be considered for 6.0.1 (first servicing release) as a potential issue to address, it's not urgent. Since it can be repro'd, I'll remove the needs further triage label.

@jeffhandley jeffhandley removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Oct 19, 2021
@AndyAyersMS
Copy link
Member

From what I can tell the jitted codegen for 5.0 and 6.0 is identical. So the perf issue is either in the native part of the runtime (seems unlikely) or some stub I can't yet see, or...?

Going to try and get a more general profile, but I am not that familiar with how to do this on MacOS.

@AndyAyersMS
Copy link
Member

Ok, think this is related to loop alignment.

In 5.0 we only 32 byte aligned Tier1 methods with loops. This was fixed in 6.0 (#42909) to handle 32 byte aligning all optimized method with loops. We later went on to add alignment padding for loops in 6.0. But we bypass padding if a loop contains a call.

In this test the jitted codegen is identical in 5.0 and 6.0, and the key inner loop in InvokeDelegate is

G_M6345_IG03:              ;; offset=003EH
       488BC3               mov      rax, rbx
       488B7808             mov      rdi, gword ptr [rax+8]
       498BF6               mov      rsi, r14
       BA64000000           mov      edx, 100
       B964000000           mov      ecx, 100
       FF5018               call     qword ptr [rax+24]DelegateLong:Invoke(System.Object,long,long):long:this
       4C8BF8               mov      r15, rax
       41FFC4               inc      r12d
       48B88067BF0801000000 mov      rax, 0x108BF6780
       443B20               cmp      r12d, dword ptr [rax]
       7CD4                 jl       SHORT G_M6345_IG03
						;; bbWeight=4    PerfScore 39.00
G_M6345_IG04:              ;; offset=006AH

For 6.0 the method is 32 byte aligned, and so this loop body runs from 0x3E .. 0x6A and needs 3 fetch windows (plus presumably one more for the call).

For 5.0 the method ends up being 16 byte aligned, this is favorable for the key loop which now spans (effectively) 0x2E..0x5A or two fetch windows (plus perhaps one more for the call).

As a result 5.0 runs faster:

BenchmarkDotNet=v0.13.1.1611-nightly, OS=macOS Catalina 10.15.7 (19H1419) [Darwin 19.6.0]
Intel Core i5-4278U CPU 2.60GHz (Haswell), 1 CPU, 4 logical and 2 physical cores
.NET SDK=6.0.100-rc.2.21505.57
  [Host]     : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
  Job-AIDUIT : .NET 5.0.11 (5.0.1121.47308), X64 RyuJIT
  Job-NBIBBZ : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1  
Method Job Runtime Toolchain Mean Error StdDev Median Min Max Ratio Allocated
DelegateInvoke Job-AIDUIT .NET 5.0 net5.0 462.6 us 2.25 us 2.11 us 462.6 us 460.1 us 467.9 us 1.00 1 B
DelegateInvoke Job-NBIBBZ .NET 6.0 net6.0 530.0 us 3.86 us 3.61 us 528.2 us 525.8 us 536.9 us 1.15 2 B

But if I fix the "bug" in the 5.0 alignment code by setting COMPlus_TC_QuickJitForLoops=1 then 5.0 also gets 32 byte method alignment and so poor loop alignment, and the performance equalizes:

BenchmarkDotNet=v0.13.1.1611-nightly, OS=macOS Catalina 10.15.7 (19H1419) [Darwin 19.6.0]
Intel Core i5-4278U CPU 2.60GHz (Haswell), 1 CPU, 4 logical and 2 physical cores
.NET SDK=6.0.100-rc.2.21505.57
  [Host]     : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
  Job-CFHBWL : .NET 5.0.11 (5.0.1121.47308), X64 RyuJIT
  Job-SQMUMX : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT

EnvironmentVariables=COMPlus_TC_QuickJitForLoops=1  PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog  
IterationTime=250.0000 ms  MaxIterationCount=20  MinIterationCount=15  
WarmupCount=1  
Method Job Runtime Toolchain Mean Error StdDev Median Min Max Ratio Allocated
DelegateInvoke Job-CFHBWL .NET 5.0 net5.0 540.9 us 5.78 us 4.83 us 540.5 us 533.7 us 550.2 us 1.00 -
DelegateInvoke Job-SQMUMX .NET 6.0 net6.0 540.7 us 2.75 us 2.58 us 541.0 us 534.7 us 544.4 us 1.00 2 B

If you modify the benchmark to use a param instead of a static for the loop limit InnerIterationCount200000 we see similar swings in perf, as removing the class init check modifies the loop alignment.

At any rate, here's a case where aligning a loop with a call seems to have a noticeable impact on perf, because the callee is trivial.

cc @kunalspathak

@AndyAyersMS
Copy link
Member

We should revisit this now that OSR is enabled and see what that does (will wait for the 7P4 perf runs...)

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jun 16, 2022

Current data with 7.0 looks better but still not quite as good as 5.0

BenchmarkDotNet=v0.13.1.1786-nightly, OS=macOS Monterey 12.4 (21F79) [Darwin 21.5.0]
Intel Core i5-4278U CPU 2.60GHz (Haswell), 1 CPU, 4 logical and 2 physical cores
.NET SDK=7.0.100-preview.5.22307.18
  [Host]     : .NET 6.0.6 (6.0.622.26707), X64 RyuJIT
  Job-TXZCIP : .NET 5.0.11 (5.0.1121.47308), X64 RyuJIT
  Job-NIIKGA : .NET 6.0.6 (6.0.622.26707), X64 RyuJIT
  Job-HAVFNP : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT
Method Job Runtime Toolchain Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
DelegateInvoke Job-KVFMUW .NET 5.0 net5.0 452.8 us 0.22 us 0.20 us 452.8 us 452.5 us 453.2 us 1.00 1 B 1.00
DelegateInvoke Job-HRZMKX .NET 6.0 net6.0 613.1 us 6.84 us 6.40 us 612.8 us 604.2 us 624.9 us 1.35 2 B 2.00
DelegateInvoke Job-PUBOIJ .NET 7.0 net7.0 518.0 us 0.85 us 0.75 us 518.2 us 517.0 us 519.5 us 1.14 2 B 2.00

@AndyAyersMS
Copy link
Member

Procrastination not paying off here. 7p6 is not any faster.

BenchmarkDotNet=v0.13.1.1786-nightly, OS=macOS Monterey 12.4 (21F79) [Darwin 21.5.0]
Intel Core i5-4278U CPU 2.60GHz (Haswell), 1 CPU, 4 logical and 2 physical cores
.NET SDK=7.0.100-preview.6.22352.1
[Host] : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT
Job-SVXENR : .NET 5.0.11 (5.0.1121.47308), X64 RyuJIT
Job-AZCEHU : .NET 6.0.6 (6.0.622.26707), X64 RyuJIT
Job-XJBOAY : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1

Method Job Runtime Toolchain Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
DelegateInvoke Job-SVXENR .NET 5.0 net5.0 452.4 us 0.49 us 0.41 us 452.2 us 452.0 us 453.3 us 1.00 1 B 1.00
DelegateInvoke Job-AZCEHU .NET 6.0 net6.0 594.4 us 4.51 us 4.21 us 595.7 us 586.6 us 601.1 us 1.31 2 B 2.00
DelegateInvoke Job-XJBOAY .NET 7.0 net7.0 516.7 us 0.14 us 0.12 us 516.6 us 516.5 us 516.9 us 1.14 2 B 2.00

@AndyAyersMS
Copy link
Member

FWIW TieredPGO does nicely here in .NET 7, thanks to @jakobbotsch

Method Job Runtime Toolchain Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
DelegateInvoke Job-KIGZAI .NET 5.0 net5.0 452.4 us 0.56 us 0.47 us 452.2 us 452.04 us 453.5 us 1.00 - NA
DelegateInvoke Job-GEGJCA .NET 6.0 net6.0 516.8 us 0.21 us 0.18 us 516.8 us 516.49 us 517.1 us 1.14 3 B NA
DelegateInvoke Job-RJIHFR .NET 7.0 net7.0 102.0 us 1.50 us 1.33 us 102.1 us 99.52 us 104.1 us 0.23 17 B NA

@AndyAyersMS
Copy link
Member

Some non-OSX data on Intel CPUs

BenchmarkDotNet=v0.13.1.1786-nightly, OS=ubuntu 18.04
Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100-preview.5.22307.18
  [Host]     : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT
  Job-ROQCLI : .NET 5.0.5 (5.0.521.16609), X64 RyuJIT
  Job-RENGIG : .NET 6.0.5 (6.0.522.21309), X64 RyuJIT
  Job-STLHTD : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  IterationTime=250.0000 ms  MaxIterationCount=20  
MinIterationCount=15  WarmupCount=1  
Method Job Runtime Toolchain Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
DelegateInvoke Job-ROQCLI .NET 5.0 net5.0 551.5 us 0.51 us 0.45 us 551.4 us 550.7 us 552.2 us 1.00 1 B 1.00
DelegateInvoke Job-RENGIG .NET 6.0 net6.0 614.2 us 2.18 us 1.70 us 613.5 us 612.5 us 618.0 us 1.11 2 B 2.00
DelegateInvoke Job-STLHTD .NET 7.0 net7.0 552.0 us 1.43 us 1.19 us 552.0 us 550.4 us 554.1 us 1.00 2 B 2.00
BenchmarkDotNet=v0.13.1.1786-nightly, OS=Windows 10 (10.0.19044.1739/21H2/November2021Update)
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT
  Job-TDISHI : .NET 5.0.17 (5.0.1722.21314), X64 RyuJIT
  Job-DIKFLT : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
  Job-AWWZXD : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1
Method Job Runtime Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
DelegateInvoke Job-TDISHI .NET 5.0 net5.0 319.3 us 6.46 us 6.91 us 319.3 us 309.5 us 336.4 us 1.00 0.00 - NA
DelegateInvoke Job-DIKFLT .NET 6.0 net6.0 323.0 us 6.29 us 6.99 us 323.0 us 310.2 us 335.2 us 1.01 0.03 1 B NA
DelegateInvoke Job-AWWZXD .NET 7.0 net7.0 432.0 us 6.38 us 5.33 us 430.2 us 422.6 us 444.7 us 1.35 0.03 1 B NA
BenchmarkDotNet=v0.13.1.1786-nightly, OS=Windows 11 (10.0.22000.795/21H2)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100-preview.5.22307.18
  [Host]     : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT
  Job-COVRZU : .NET 5.0.17 (5.0.1722.21314), X64 RyuJIT
  Job-IADVQC : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
  Job-MJQVRZ : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1

Method Job Runtime Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
DelegateInvoke Job-COVRZU .NET 5.0 net5.0 625.9 us 75.56 us 87.02 us 587.0 us 527.3 us 750.7 us 1.00 0.00 - NA
DelegateInvoke Job-IADVQC .NET 6.0 net6.0 813.8 us 98.12 us 104.99 us 768.8 us 701.2 us 1,086.4 us 1.29 0.15 1 B NA
DelegateInvoke Job-MJQVRZ .NET 7.0 net7.0 684.8 us 64.32 us 74.07 us 660.4 us 600.7 us 851.8 us 1.12 0.24 2 B NA
BenchmarkDotNet=v0.13.1.1786-nightly, OS=Windows 11 (10.0.22000.739/21H2)
Intel Core i9-9900T CPU 2.10GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT
  Job-AHPPXT : .NET 5.0.17 (5.0.1722.21314), X64 RyuJIT
  Job-AXCIJO : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
  Job-WWEVZP : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1
Method Job Runtime Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
DelegateInvoke Job-AHPPXT .NET 5.0 net5.0 325.9 us 13.47 us 15.51 us 329.9 us 304.1 us 354.2 us 1.00 0.00 - NA
DelegateInvoke Job-AXCIJO .NET 6.0 net6.0 452.2 us 4.05 us 3.79 us 452.0 us 444.9 us 458.8 us 1.40 0.07 1 B NA
DelegateInvoke Job-WWEVZP .NET 7.0 net7.0 385.0 us 11.07 us 12.75 us 388.1 us 361.1 us 400.0 us 1.18 0.05 1 B NA
BenchmarkDotNet=v0.13.1.1786-nightly, OS=Windows 11 (10.0.22000.795/21H2)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-preview.5.22307.18
  [Host]     : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT
  Job-MCKRNJ : .NET 5.0.17 (5.0.1722.21314), X64 RyuJIT
  Job-UAXSFU : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
  Job-MYYFBF : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1
Method Job Runtime Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
DelegateInvoke Job-MCKRNJ .NET 5.0 net5.0 269.4 us 4.89 us 4.57 us 269.2 us 263.6 us 278.9 us 1.00 0.00 - NA
DelegateInvoke Job-UAXSFU .NET 6.0 net6.0 268.4 us 3.66 us 3.43 us 267.7 us 263.5 us 275.4 us 1.00 0.01 1 B NA
DelegateInvoke Job-MYYFBF .NET 7.0 net7.0 313.1 us 1.98 us 1.85 us 313.2 us 310.0 us 317.3 us 1.16 0.02 1 B NA

@AndyAyersMS
Copy link
Member

From the above we can see 5.0 is consistently fastest, with 6.0 or 7.0 sometimes similar to 5.0 and sometimes slower depending on the particular processor. 7.0 is generally faster than 6.0 though not always.

I modified the jit to align small (single block) loops with calls (as in the benchmark) and didn't see any improvement, so at this point I suspect the perf differences are related to the alignment of the delegate or its precode.

I think this sort of thing is only going to show up prominently when we have frequently executed delegates that do little or no computation. So, I'm going to close this as won't fix.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants