Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: enable devirtualization/inlining of other array interface methods #109209

Merged
merged 10 commits into from
Oct 28, 2024

Conversation

AndyAyersMS
Copy link
Member

The JIT recently enabled devirtualization of GetEnumerator, but other methods were inhibited from devirtualization because the runtime was returning an instantiating stub instead of the actual method. This blocked inlining and the JIT currently will not GDV unless it can also inline.

So for instance ICollection<T>.Count would not devirtualize.

We think we know enough to pass the right inst parameter (the exact method desc) so enable this for the array case, at least for normal jitting.

For NAOT array devirtualization happens via normal paths thanks to Array<T> so should already fpr these cases. For R2R we don't do array interface devirt (yet).

There was an existing field on CORINFO_DEVIRTUALIZATION_INFO to record the need for an inst parameter, but it was unused and so I renamed it and use it for this case.

Contributes to #108913.

The JIT recently enabled devirtualization of `GetEnumerator`, but other methods
were inhibited from devirtualization because the runtime was returning an
instantiating stub instead of the actual method. This blocked inlining and
the JIT currently will not GDV unless it can also inline.

So for instance `ICollection<T>.Count` would not devirtualize.

We think we know enough to pass the right inst parameter (the exact method
desc) so enable this for the array case, at least for normal jitting.

For NAOT array devirtualization happens via normal paths thanks to `Array<T>` so
should already fpr these cases. For R2R we don't do array interface devirt (yet).

There was an existing field on `CORINFO_DEVIRTUALIZATION_INFO` to record the
need for an inst parameter, but it was unused and so I renamed it and use it
for this case.

Contributes to dotnet#108913.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 25, 2024
@AndyAyersMS
Copy link
Member Author

FYI @EgorBo

@jkotas is this hopelessly naive? It seems to work but I suspect most of the SZArrayHelper methods ignore the inst param.

SPMI will not see diffs from this; it needs to see new behavior from the VM side.

@EgorBo
Copy link
Member

EgorBo commented Oct 25, 2024

@MihuBot

@AndyAyersMS
Copy link
Member Author

@EgorBot

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    static byte[] Data = new byte[512];
    [Benchmark]
    public int Test() => TestInner(Data);
    [MethodImpl(MethodImplOptions.NoInlining)]
    int TestInner(ICollection<byte> c) => c.Count;
}

@AndyAyersMS
Copy link
Member Author

EgorBot results:

BenchmarkDotNet v0.14.0, Ubuntu 24.04 LTS (Noble Numbat)
AMD EPYC 9R14, 1 CPU, 8 logical and 8 physical cores
Job-RSCWGO : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Job-MXOKVG : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Method Toolchain Mean Error Ratio
Test Main 0.5454 ns 0.0002 ns 1.00
Test PR 0.5456 ns 0.0001 ns 1.00

When I run locally though:

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2033)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK 9.0.100-rc.2.24474.11
[Host] : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX2
Job-BZVIXB : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-TXYRTJ : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX2

Runtime=.NET 9.0

Method Job Toolchain Mean Error StdDev Ratio RatioSD
Test Job-BZVIXB CoreRun 0.5370 ns 0.0405 ns 0.0666 ns 0.21 0.03
Test Job-TXYRTJ net9.0 2.5593 ns 0.0303 ns 0.0269 ns 1.00 0.01

Maybe my baselines are out of date...

@EgorBo
Copy link
Member

EgorBo commented Oct 25, 2024

@EgorBot -intel -arm64 -profiler --envvars DOTNET_JitDisasm:TestInner

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    static byte[] Data = new byte[512];

    [Benchmark]
    public int Test() => TestInner(Data);

    [MethodImpl(MethodImplOptions.NoInlining)]
    int TestInner(ICollection<byte> c) => c.Count;
}

@jkotas
Copy link
Member

jkotas commented Oct 25, 2024

It seems to work but I suspect most of the SZArrayHelper methods ignore the inst param.

Most of them ignore inst param, except GetEnumerator. We take advantage of it to reduce number of methods that need to be instantiated:

// OPTIMIZATION: For any method other than GetEnumerator(), we can safely substitute
// "Object" for reference-type theT's. This causes fewer methods to be instantiated.
if (startingMethod[inheritanceDepth] != METHOD__SZARRAYHELPER__GETENUMERATOR &&
!theT.IsValueType())
{
theT = TypeHandle(g_pObjectClass);
}

I think this approach may work for non-shared code, but it looks too simple to handle shared generic code properly.

@EgorBo
Copy link
Member

EgorBo commented Oct 25, 2024

When I run locally though:

@AndyAyersMS hm.. weird, it feels like it's already expanded in main according to bot...

@AndyAyersMS
Copy link
Member Author

@EgorBot -intel -arm64 -profiler --envvars DOTNET_JitDisasm:TestInner

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    static string[] Data = new string[512];

    [Benchmark]
    public int Test() => TestInner(Data);

    [MethodImpl(MethodImplOptions.NoInlining)]
    int TestInner(ICollection<string> c) => c.Count;
}

@AndyAyersMS
Copy link
Member Author

When I run locally though:

@AndyAyersMS hm.. weird, it feels like it's already expanded in main according to bot...

The issue was with array devirt for ref-type (or shared) array elements... updated the benchmark to use string[], and hackily fixed things to pass the right method desc as dictionary source. If this works out, I'll need to update the jit interface to do it properly.

@AndyAyersMS
Copy link
Member Author

still need to update JIT to expect a method handle instead of class handle in a few places.

@AndyAyersMS
Copy link
Member Author

@EgorBot -intel -arm64 -profiler --envvars DOTNET_JitDisasm:TestInner

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    static string[] Data = new string[512];

    [Benchmark]
    public int Test() => TestInner(Data);

    [MethodImpl(MethodImplOptions.NoInlining)]
    int TestInner(ICollection<string> c) => c.Count;
}

@AndyAyersMS
Copy link
Member Author

@MihuBot

Comment on lines 6389 to 6390
THROWS;
GC_TRIGGERS;
Copy link
Member

@jkotas jkotas Oct 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
THROWS;
GC_TRIGGERS;
NOTHROW;
GC_NOTRIGGER;

LEAF JIT/EE interface methods should be nothrow/notrigger.

@AndyAyersMS

This comment was marked as outdated.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Oct 27, 2024

@EgorBot -intel -arm64 -profiler --envvars DOTNET_JitDisasm:Foreach Count

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    string[] s_ro_str_array = new string[512];

    [Benchmark]
    public int Foreach()
    {
        IEnumerable<string> e = s_ro_str_array;
        int sum = 0;
        foreach (string s in e) sum += s == null ? 0 : s.Length;
        return sum;
    }

    [Benchmark]
    public int Count() => CountInner(s_ro_str_array);

    [MethodImpl(MethodImplOptions.NoInlining)]
    int CountInner(ICollection<string> c) => c.Count;
}

The Foreach test should now devirtualize and stack allocate the enumerator (but not yet promote, needs #109237).

@AndyAyersMS
Copy link
Member Author

@EgorBot -intel -arm64

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    string[] s_ro_str_array = new string[512];

    [Benchmark]
    public int Foreach()
    {
        IEnumerable<string> e = s_ro_str_array;
        int sum = 0;
        foreach (string s in e) sum += s == null ? 0 : s.Length;
        return sum;
    }

    [Benchmark]
    public int Count() => CountInner(s_ro_str_array);

    [MethodImpl(MethodImplOptions.NoInlining)]
    int CountInner(ICollection<string> c) => c.Count;
}

@EgorBo
Copy link
Member

EgorBo commented Oct 27, 2024

@AndyAyersMS the bot didn't report anything because of --envvars DOTNET_JitDisasm:Foreach Count (namely, because of whitespace)

I guess it has to be wrapped with quotes, but not sure it will work either

@AndyAyersMS
Copy link
Member Author

BenchmarkDotNet v0.14.0, Ubuntu 24.04 LTS (Noble Numbat)
Intel Xeon Platinum 8488C, 1 CPU, 8 logical and 4 physical cores
.NET SDK 9.0.100-rc.2.24474.11
  [Host]     : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-OEFINY : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-VDHZTA : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Toolchain Mean Error Ratio
Foreach /core_root_base/corerun 397.2479 ns 2.3665 ns 1.00
Foreach /core_root_diff/corerun 442.7244 ns 4.5621 ns 1.11
Count /core_root_base/corerun 3.8438 ns 0.0332 ns 1.00
Count /core_root_diff/corerun 0.5762 ns 0.0139 ns 0.15

Locally I have intel as faster (at least on windows):

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2033)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK 9.0.100-rc.2.24474.11
  [Host]     : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX2
  Job-GVMQPB : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-DZYGUG : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-LLHZLG : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX2

Runtime=.NET 9.0
Method Job Toolchain Mean Error StdDev Ratio Gen0 Allocated Alloc Ratio
foreach_static_readonly_ref_array_via_interface Job-GVMQPB base 738.4 ns 4.63 ns 4.34 ns 0.77 0.0048 32 B 1.00
foreach_static_readonly_ref_array_via_interface Job-DZYGUG diff 622.7 ns 3.43 ns 3.04 ns 0.65 - - 0.00
foreach_static_readonly_ref_array_via_interface Job-LLHZLG net9.0 959.1 ns 11.22 ns 10.50 ns 1.00 0.0038 32 B 1.00

Loop code is similar too:

;;; base inner loop (heap allocated enumerator)

G_M47389_IG03:  ;; offset=0x0026
       mov      edx, dword ptr [rcx+0x08]
						;; size=3 bbWeight=0.91 PerfScore 1.82
G_M47389_IG04:  ;; offset=0x0029
       add      ebx, edx
						;; size=2 bbWeight=516.86 PerfScore 129.22
G_M47389_IG05:  ;; offset=0x002B
       mov      ecx, dword ptr [rax+0x08]
       inc      ecx
       mov      edx, ecx
       mov      esi, dword ptr [rax+0x0C]
       cmp      edx, esi
       jae      SHORT G_M47389_IG08
						;; size=14 bbWeight=517.86 PerfScore 2977.70
G_M47389_IG06:  ;; offset=0x0039
       mov      dword ptr [rax+0x08], edx
       cmp      ecx, esi
       jae      SHORT G_M47389_IG10
       mov      rdx, gword ptr [rax+0x10]
       cmp      ecx, dword ptr [rdx+0x08]
       jae      SHORT G_M47389_IG11
       mov      ecx, ecx
       mov      rcx, gword ptr [rdx+8*rcx+0x10]
       test     rcx, rcx
       jne      SHORT G_M47389_IG03

;;; diff inner loop (stack allocated enumerator / not promoted)

G_M47389_IG03:  ;; offset=0x0053
       mov      r8d, dword ptr [rcx+0x08]
						;; size=4 bbWeight=0.76 PerfScore 1.52
G_M47389_IG04:  ;; offset=0x0057
       add      eax, r8d
						;; size=3 bbWeight=511.71 PerfScore 127.93
G_M47389_IG05:  ;; offset=0x005A
       mov      ecx, dword ptr [rdx+0x08]
       inc      ecx
       mov      ebx, dword ptr [rdx+0x0C]
       cmp      ecx, ebx
       jae      SHORT G_M47389_IG08
						;; size=12 bbWeight=512.71 PerfScore 2819.93
G_M47389_IG06:  ;; offset=0x0066
       mov      dword ptr [rdx+0x08], ecx
       mov      ecx, dword ptr [rdx+0x08]
       cmp      ecx, dword ptr [rdx+0x0C]
       jae      SHORT G_M47389_IG10
       mov      r8, gword ptr [rdx+0x10]
       cmp      ecx, dword ptr [r8+0x08]
       jae      SHORT G_M47389_IG11
       mov      rcx, gword ptr [r8+8*rcx+0x10]
       test     rcx, rcx
       jne      SHORT G_M47389_IG03

@AndyAyersMS
Copy link
Member Author

Now that #109237 is merged, the perf of Foreach with this PR should be significantly better.

@AndyAyersMS
Copy link
Member Author

@EgorBot -intel -arm64

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    string[] s_ro_str_array = new string[512];

    [Benchmark]
    public int Foreach()
    {
        IEnumerable<string> e = s_ro_str_array;
        int sum = 0;
        foreach (string s in e) sum += s == null ? 0 : s.Length;
        return sum;
    }

    [Benchmark]
    public int Count() => CountInner(s_ro_str_array);

    [MethodImpl(MethodImplOptions.NoInlining)]
    int CountInner(ICollection<string> c) => c.Count;
}

@AndyAyersMS
Copy link
Member Author

BenchmarkDotNet v0.14.0, Ubuntu 24.04 LTS (Noble Numbat)
Unknown processor
.NET SDK 9.0.100-rc.2.24474.11
  [Host]     : .NET 9.0.0 (9.0.24.47305), Arm64 RyuJIT AdvSIMD
  Job-UXBTJA : .NET 10.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-VJTWZI : .NET 10.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Toolchain Mean Error Ratio
Foreach /core_root_base/corerun 1,297.675 ns 0.0915 ns 1.00
Foreach /core_root_diff/corerun 282.593 ns 0.8442 ns 0.22
Count /core_root_base/corerun 1.111 ns 0.0003 ns 1.00
Count /core_root_diff/corerun 1.071 ns 0.0003 ns 0.96
BenchmarkDotNet v0.14.0, Ubuntu 24.04 LTS (Noble Numbat)
Intel Xeon Platinum 8488C, 1 CPU, 8 logical and 4 physical cores
.NET SDK 9.0.100-rc.2.24474.11
  [Host]     : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-GNMMMF : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-ZDWMNG : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Toolchain Mean Error Ratio
Foreach /core_root_base/corerun 382.5067 ns 1.2694 ns 1.00
Foreach /core_root_diff/corerun 298.1094 ns 1.5452 ns 0.78
Count /core_root_base/corerun 3.7048 ns 0.0246 ns 1.00
Count /core_root_diff/corerun 0.5671 ns 0.0118 ns 0.15

@AndyAyersMS
Copy link
Member Author

NAOT failure looks like #104340.

Unhandled exception. System.Threading.SynchronizationLockException: Object synchronization method was called from an unsynchronized block of code.
   at System.Threading.Monitor.Exit(Object) + 0x170
   at System.Runtime.CompilerServices.ConditionalWeakTable`2.Container.Finalize() + 0xdc
   at System.Runtime.__Finalizer.DrainQueue() + 0x4c
   at System.Runtime.__Finalizer.ProcessFinalizers() + 0x8c
./RunTests.sh: line 176:    23 Aborted   

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Oct 28, 2024

Updated diff inner loop for Foreach above. Enumerator is gone, now we just need to make this loop shape fit in with our normal loop opts.

       jmp      SHORT G_M47389_IG05
       align    [12 bytes for IG03]
						;; size=34 bbWeight=1.84 PerfScore 8.74
G_M47389_IG03:  ;; offset=0x0030
       mov      r10d, dword ptr [r8+0x08]
						;; size=4 bbWeight=4.58 PerfScore 9.16
G_M47389_IG04:  ;; offset=0x0034
       add      eax, r10d
						;; size=3 bbWeight=944.53 PerfScore 236.13
G_M47389_IG05:  ;; offset=0x0037
       inc      edx
       cmp      edx, 512
       jae      SHORT G_M47389_IG08
						;; size=10 bbWeight=946.37 PerfScore 1419.55
G_M47389_IG06:  ;; offset=0x0041
       mov      edx, edx
       mov      r8, gword ptr [rcx+8*rdx+0x10]
       test     r8, r8
       jne      SHORT G_M47389_IG03

@AndyAyersMS AndyAyersMS marked this pull request as ready for review October 28, 2024 14:44
@AndyAyersMS
Copy link
Member Author

@EgorBo @jkotas PTAL
cc @dotnet/jit-contrib

The JIT now "inlines" the instantiating stub and the instantiated method.

@AndyAyersMS AndyAyersMS requested a review from EgorBo October 28, 2024 14:46
@@ -14238,6 +14238,52 @@ CORINFO_CLASS_HANDLE Compiler::gtGetHelperArgClassHandle(GenTree* tree)
return result;
}

//------------------------------------------------------------------------
// gtGetHelperArgMethodHandle: find the compile time method handle from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume the impl could be merged with gtGetHelperArgClassHandle to reduce copy-paste, but not sure it's worth it

{
*methodArg = null;
*classArg = null;
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth leaving a note why it's no-op for NAOT/R2R?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point.

Also this should not be too hard to implement, but I don't know how to do it, and it won't get used. Currently R2R lacks the devirtualization support, and NAOT doesn't handle arrays the same way.

Copy link
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we kick off a PGO pipeline?

@AndyAyersMS
Copy link
Member Author

/azp run runtime-coreclr libraries-pgo, runtime-coreclr pgo, runtime-coreclr pgostress

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@AndyAyersMS AndyAyersMS merged commit 87fea60 into dotnet:main Oct 28, 2024
133 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Nov 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants