Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor register allocation with hardware intrinsic (x86) #37216

Closed
ebfortin opened this issue May 31, 2020 · 14 comments
Closed

Poor register allocation with hardware intrinsic (x86) #37216

ebfortin opened this issue May 31, 2020 · 14 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@ebfortin
Copy link

ebfortin commented May 31, 2020

Description

I'm porting an algorithm from scalar double arithmetics to SIMD using the Hardware Intrinsics. After some testing I concluded that the performance of the SIMD version is worst. Now it can be that I'm just not good at using SIMD instructions. However looking at the asm produced by the JIT, I think there may be a problem.

Configuration

.NET Core 5.0.0 (CoreCLR 5.0.20.21406, CoreFX 5.0.20.21406), X64 RyuJIT

Regression?

Data

Look at one example:

                 var v02 = Avx.HorizontalAdd(v00, v01); // r = (xh + yh) | - xl - yl)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 00007ffc`ca6fdca4 c5f9284dd0      vmovapd xmm1,xmmword ptr [rbp-30h]
 00007ffc`ca6fdca9 c5f17c4dc0      vhaddpd xmm1,xmm1,xmmword ptr [rbp-40h]
 00007ffc`ca6fdcae c5f9294db0      vmovapd xmmword ptr [rbp-50h],xmm1
                 var v03 = Avx.Subtract(v00, v02);
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 00007ffc`ca6fdcb3 c5f9284dd0      vmovapd xmm1,xmmword ptr [rbp-30h]
 00007ffc`ca6fdcb8 c5f15c4db0      vsubpd  xmm1,xmm1,xmmword ptr [rbp-50h]
 00007ffc`ca6fdcbd c5f9294da0      vmovapd xmmword ptr [rbp-60h],xmm1

This comes from the disassembly of BenchmarkDotet.

Also benchmark results:

Method EnvironmentVariables Mean Error StdDev Median Max
AdditionDouble COMPlus_EnableHWIntrinsic=0 1.112 ns 0.0658 ns 0.1187 ns 1.100 ns 1.400 ns
AdditionDouble2 COMPlus_EnableHWIntrinsic=0 104.808 ns 2.8832 ns 6.0183 ns 102.300 ns 124.500 ns
AdditionDouble COMPlus_EnableHWIntrinsic=1 1.065 ns 0.0645 ns 0.0985 ns 1.100 ns 1.200 ns
AdditionDouble2 COMPlus_EnableHWIntrinsic=1 196.530 ns 9.3874 ns 26.4772 ns 178.950 ns 268.100 ns

Analysis

If you look closely you see that each instruction seem to be taken in isolation, with its own register allocation, instead of being global to the method. This means a LOT more memory load/store than seem necessary. There is a lot of register to play with beside xmm1...

The documentation on Hardware Intrinsics states that for some time in the compilation tree intrinsics are seen as method. Maybe they are seen as method for a bit too long and so each "method" see some register allocation but only in its own local "method" context.

category:cq
theme:register-allocator
skill-level:expert
cost:medium

@ebfortin ebfortin added the tenet-performance Performance related issue label May 31, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels May 31, 2020
@AndyAyersMS
Copy link
Member

@ebfortin the disassembly above looks like it is from un-optimized code. Usually BenchmarkDotNet is pretty good at ensuring that it benchmarks only optimized code, but in this case I wonder.

Can you share your test case, or some representative sample?

@AndyAyersMS
Copy link
Member

Going to mark this as future, may reconsider once we have more information.

@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Jun 1, 2020
@AndyAyersMS AndyAyersMS added this to the Future milestone Jun 1, 2020
@ebfortin
Copy link
Author

ebfortin commented Jun 1, 2020

I will provide some more details later tonight. I'll put the method code being evaluated. The benchmark configuration used and other stuff.

@ebfortin
Copy link
Author

ebfortin commented Jun 1, 2020

Code from BenchmarkDotnet to define the benchmarks.

        [IterationSetup]
        public void Setup()
		{
            _a = _rnd.NextDouble();
            _b = _rnd.NextDouble();

            _a2 = new Double2(_a);
            _b2 = new Double2(_b);
		}

        [Benchmark]
        public double AdditionDouble()
        {
            var c = _a + _b;
            return c;
        }

        [Benchmark]
        public Double2 AdditionDouble2()
        {
            var c = _a2 + _b2;
            return c;
        }

Configuration for the benchmark.

    public class BenchmarkConfig : ManualConfig
    {
        private const string JitNoInline = "COMPlus_JitNoInline";
        private const string JitHardwareIntrinsics = "COMPlus_EnableHWIntrinsic";

        public BenchmarkConfig()
        {
            // Configure your benchmarks, see for more details: https://benchmarkdotnet.org/articles/configs/configs.html.
            //Add(Job.Dry);
            Add(ConsoleLogger.Default);
            Add(TargetMethodColumn.Method, StatisticColumn.Max);
            //Add(RPlotExporter.Default, CsvExporter.Default);
            Add(EnvironmentAnalyser.Default);

            
            Add(
                Job.Default.With(CoreRuntime.Core50)
                .WithId("CoreRT")
                );

            Add(
                Job.RyuJitX64
                .With(new EnvironmentVariable(JitHardwareIntrinsics, "1"))
                .WithId("RyunJITX64 : Intrinsics ENABLED")
                );

            Add(
                Job.RyuJitX64
                .With(new EnvironmentVariable(JitHardwareIntrinsics, "0"))
                .WithId("RyunJITX64 : Intrinsics DISABLED")
                );

        }
    }

Code of operator+() called from the Benchmark.

        private static (double, double) Add22(double xh, double xl, double yh, double yl)
		{
            double zh;
            double zl;

            if(Avx2.IsSupported)
			{
                var v00 = Vector128.Create(xh, yh);
                var v01 = Vector128.Create(-xl, -yl);
                var v02 = Avx.HorizontalAdd(v00, v01); // r = (xh + yh) | - xl - yl)
                var v03 = Avx.Subtract(v00, v02);
                var v05 = Avx.HorizontalAdd(v03, Vector128<double>.Zero); // s = xh - r + yh + xl + yl | 0
                var v06 = Avx.Add(v02, v05); // r + s | - xl - yl
                var v08 = Avx.Add(Avx.Subtract(v02, v06), v05);

                zh = v06.GetElement(0);
                zl = v08.GetElement(0);

                return (zh, zl);
			}

            double r, s;

            r = xh + yh;
            s = xh - r + yh + yl + xl;
            zh = r + s;
            zl = r - zh + s;

            return (zh, zl);
        }

        public static Double2 operator +(Double2 a, Double2 b)
		{
            var r = Add22(a.h, a.l, b.h, b.l);
            return new Double2(r);
        }

From BenchmarkDotnet run showing that no debugger is attached.

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.19635
Intel Core i5-7300HQ CPU 2.50GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.3.20216.6
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.21406, CoreFX 5.0.20.21406), X64 RyuJIT
  Job-IDRGDI : .NET Core 5.0.0 (CoreCLR 5.0.20.21406, CoreFX 5.0.20.21406), X64 RyuJIT
  Job-YDAAYA : .NET Core 5.0.0 (CoreCLR 5.0.20.21406, CoreFX 5.0.20.21406), X64 RyuJIT
  Job-TWHNGR : .NET Core 5.0.0 (CoreCLR 5.0.20.21406, CoreFX 5.0.20.21406), X64 RyuJIT

@EgorBo
Copy link
Member

EgorBo commented Jun 1, 2020

Add22 looks pretty good to me:

G_M12720_IG01:
       vzeroupper 
						;; bbWeight=1    PerfScore 1.00

G_M12720_IG02:
       vmovsd   xmm0, qword ptr [reloc @RWD00]
       vxorps   xmm0, xmm2
       vmovsd   xmm4, qword ptr [rsp+28H]
       vmovsd   xmm2, qword ptr [reloc @RWD00]
       vxorps   xmm2, xmm4
       vmovlhps xmm0, xmm0, xmm2
       vmovlhps xmm1, xmm1, xmm3
       vhaddpd  xmm0, xmm1, xmm0
       vsubpd   xmm1, xmm1, xmm0
       vxorps   xmm2, xmm2, xmm2
       vhaddpd  xmm1, xmm1, xmm2
       vaddpd   xmm2, xmm0, xmm1
       vmovaps  xmm3, xmm2  ;; <-- ?
       vsubpd   xmm0, xmm0, xmm2
       vaddpd   xmm0, xmm0, xmm1
       vmovsd   qword ptr [rcx], xmm3
       vmovsd   qword ptr [rcx+8], xmm0
       mov      rax, rcx
G_M12720_IG03:
       ret      
RWD00  dq	8000000000000000h
; Total bytes of code: 86

The only thing it seems is this repro:

static Vector128<double> Add22(double a, double b)
{
    return Vector128.Create(-a, -b);
}

Codegen:

G_M46873_IG01:
       vzeroupper 
G_M46873_IG02:
       vmovsd   xmm0, qword ptr [reloc @RWD00]
       vxorps   xmm0, xmm1
       vmovsd   xmm1, qword ptr [reloc @RWD00]
       vxorps   xmm1, xmm2
       vmovlhps xmm0, xmm0, xmm1
       vmovupd  xmmword ptr [rcx], xmm0
       mov      rax, rcx
G_M46873_IG03:
       ret      
RWD00  dq	8000000000000000h
; Total bytes of code: 39

it loads 0 twice to perform negations (I am not even mentioning it could do negation only once on top of result)

cc @tannergooding

@EgorBo
Copy link
Member

EgorBo commented Jun 1, 2020

I wonder if we can expand GT_NEG to 0 - x so VN/CSE can pick 0 up (and then optimize to GT_NEG back in lowering if necessary)

@tannergooding
Copy link
Member

The loading 0 twice is basically the same as: #37079.

Horizontal Adds aren't necessarily the most performance instructions and it might be better to rearrange the data so you can use normal addition/subtraction instead. It looks like it might be possible at first glance, but I didn't confirm...

Constructing and deconstructing the vector also has some minimal overhead that wouldn't be great in a loop.

@ebfortin
Copy link
Author

ebfortin commented Jun 1, 2020

Add22 looks pretty good to me:

G_M12720_IG01:
       vzeroupper 
						;; bbWeight=1    PerfScore 1.00

G_M12720_IG02:
       vmovsd   xmm0, qword ptr [reloc @RWD00]
       vxorps   xmm0, xmm2
       vmovsd   xmm4, qword ptr [rsp+28H]
       vmovsd   xmm2, qword ptr [reloc @RWD00]
       vxorps   xmm2, xmm4
       vmovlhps xmm0, xmm0, xmm2
       vmovlhps xmm1, xmm1, xmm3
       vhaddpd  xmm0, xmm1, xmm0
       vsubpd   xmm1, xmm1, xmm0
       vxorps   xmm2, xmm2, xmm2
       vhaddpd  xmm1, xmm1, xmm2
       vaddpd   xmm2, xmm0, xmm1
       vmovaps  xmm3, xmm2  ;; <-- ?
       vsubpd   xmm0, xmm0, xmm2
       vaddpd   xmm0, xmm0, xmm1
       vmovsd   qword ptr [rcx], xmm3
       vmovsd   qword ptr [rcx+8], xmm0
       mov      rax, rcx
G_M12720_IG03:
       ret      
RWD00  dq	8000000000000000h
; Total bytes of code: 86

The only thing it seems is this repro:

static Vector128<double> Add22(double a, double b)
{
    return Vector128.Create(-a, -b);
}

Codegen:

G_M46873_IG01:
       vzeroupper 
G_M46873_IG02:
       vmovsd   xmm0, qword ptr [reloc @RWD00]
       vxorps   xmm0, xmm1
       vmovsd   xmm1, qword ptr [reloc @RWD00]
       vxorps   xmm1, xmm2
       vmovlhps xmm0, xmm0, xmm1
       vmovupd  xmmword ptr [rcx], xmm0
       mov      rax, rcx
G_M46873_IG03:
       ret      
RWD00  dq	8000000000000000h
; Total bytes of code: 39

it loads 0 twice to perform negations (I am not even mentioning it could do negation only once on top of result)

cc @tannergooding

And where is this diassembly coming from? Why would un-optimized code be used by BenchmarkDotnet? Because the code that is produced on my side has basically one load before, and one store after any SIMD instruction. And the lower performance comes from there, not in the choice of SIMD instruction. Also what are the condition in the JIT to emit optimized code? Because I see a problem if on some computer it's optimized, on other it's not. It's too variable.

@ebfortin
Copy link
Author

ebfortin commented Jun 2, 2020

I tested it with COMPlus_TC_QuickJit set to 0. A HUGE difference. So indeed the code was not optimized. Which is demonstrated with the output from BenchmarkDotnet diassembler below:

                var v02 = Avx.HorizontalAdd(v00, v01); // r = (xh + yh) | - xl - yl)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00007ffa`89ca7d99 c5f97cc9        vhaddpd xmm1,xmm0,xmm1
                var v03 = Avx.Subtract(v00, v02);
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00007ffa`89ca7d9d c5f95cc1        vsubpd  xmm0,xmm0,xmm1
                var v05 = Avx.HorizontalAdd(v03, Vector128.Zero); // s = xh - r + yh + xl + yl | 0
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00007ffa`89ca7da1 c5e857d2        vxorps  xmm2,xmm2,xmm2
00007ffa`89ca7da5 c5f97cc2        vhaddpd xmm0,xmm0,xmm2

And if I look at the benchmark, I see that my SIMD code is bad and need improvement, like what @tannergooding was suggesting. Now that the un-optimized variable is out of the way.

However it raises the question : Why wouldn't the Tiered compilation kicks in for this workload? Method being evaluated is too small? Run time is too low?

And a side question : What is AgressiveOptimization doing? Does it affect only C# compilation? Any effect on the JIT? Would be great to have the ability to force optimization, in the JIT, on a per method basis. Would make it possible to ship librairies with critical methods that are sure to get optimized right away. And letting the JIT decide for the others.

@AndyAyersMS
Copy link
Member

@adamsitnik any idea why BenchmarkDotNet can't get to Tier1 code in the test case above?

What is AgressiveOptimization doing?

AggressiveOptimization will indeed bypass Tier0 and generate Tier1 code the first time a method is jitted. But it shouldn't be needed in most cases.

@adamsitnik
Copy link
Member

any idea why BenchmarkDotNet can't get to Tier1 code in the test case above?

Most probably it's a side effect of using [IterationSetup] which makes BDN invoke the benchmark only once per iteration, so it's not executed more than 30 times during the warmup stage. Using [IterationSetup] is not recommended: https://github.com/dotnet/performance/blob/master/docs/microbenchmark-design-guidelines.md#IterationSetup

@ebfortin could you please try to switch using [GlobalSetup] instead? https://github.com/dotnet/performance/blob/master/docs/microbenchmark-design-guidelines.md#globalsetup If you really need a new pair of double for every invocation then it would be best to generate a big array of doubles in the [IterationSetup] and just iterate over it in the benchmark.

@ebfortin
Copy link
Author

ebfortin commented Jun 2, 2020

I just tried with [GlobalSetup] instead. I get the same un-optimized run.

                var v02 = Avx.HorizontalAdd(v00, v01); // r = (xh + yh) | - xl - yl)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00007ffa`8c3bdf94 c5f9284dd0      vmovapd xmm1,xmmword ptr [rbp-30h]
00007ffa`8c3bdf99 c5f17c4dc0      vhaddpd xmm1,xmm1,xmmword ptr [rbp-40h]
00007ffa`8c3bdf9e c5f9294db0      vmovapd xmmword ptr [rbp-50h],xmm1
                var v03 = Avx.Subtract(v00, v02);
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00007ffa`8c3bdfa3 c5f9284dd0      vmovapd xmm1,xmmword ptr [rbp-30h]
00007ffa`8c3bdfa8 c5f15c4db0      vsubpd  xmm1,xmm1,xmmword ptr [rbp-50h]
00007ffa`8c3bdfad c5f9294da0      vmovapd xmmword ptr [rbp-60h],xmm1

If I use AgressiveOptimization on Add22 then it optimize that method but not the rest, so performance is 1 order of magnitude worst overall than full optimization.

So no problems with the register allocators. The optimizer is working, clearly. All good. But in what circumstanced it gets invoked is not quite clear. I was expecting it to be used on this benchmark.

@ebfortin
Copy link
Author

ebfortin commented Jun 7, 2020

I'm still struggling to understand when the runtime optimize and when it doesn't. The only moment I get clear optimized code is by using COMPlus_TC_QuickJit=0. If I use the MSBuild property or runtimeconfig.json it doesn't do anything. I still get the awfully bad performance.

If it's something I should check with the BenchmarkDotNet team tell me and I will close this here. But I have the feeling something not by design is going on. And the results seem to be pointing to it.

@CarolEidt CarolEidt modified the milestones: Future, 6.0.0 Nov 10, 2020
@JulieLeeMSFT JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 23, 2021
@kunalspathak kunalspathak removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Apr 9, 2021
@kunalspathak
Copy link
Member

@ebfortin - I tried BenchmarkDotNet with the code you have shared and I see optimized code getting generated.

C# benchmark
using BenchmarkDotNet.Attributes;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using System.Text;
using System.Threading.Tasks;

namespace Test
{
    public class _37216
    {
        private double _a, _b;
        private Double2 _a2, _b2;
        private static Random _rnd = new Random();

        [GlobalSetup]
        public void Setup()
        {
            _a = _rnd.NextDouble();
            _b = _rnd.NextDouble();

            _a2 = new Double2(_a);
            _b2 = new Double2(_b);
        }

        [Benchmark]
        public double AdditionDouble()
        {
            var c = _a + _b;
            return c;
        }

        [Benchmark]
        public Double2 AdditionDouble2()
        {
            var c = _a2 + _b2;
            return c;
        }

        public class Double2
        {
            private double h;
            private double l;

            public Double2(double _x)
            {
                h = _x;
                l = _x;
            }

            private Double2((double, double) _x)
            {
                h = _x.Item1;
                l = _x.Item2;
            }

            private static (double, double) Add22(double xh, double xl, double yh, double yl)
            {
                double zh;
                double zl;

                if (Avx2.IsSupported)
                {
                    var v00 = Vector128.Create(xh, yh);
                    var v01 = Vector128.Create(-xl, -yl);
                    var v02 = Avx.HorizontalAdd(v00, v01); // r = (xh + yh) | - xl - yl)
                    var v03 = Avx.Subtract(v00, v02);
                    var v05 = Avx.HorizontalAdd(v03, Vector128<double>.Zero); // s = xh - r + yh + xl + yl | 0
                    var v06 = Avx.Add(v02, v05); // r + s | - xl - yl
                    var v08 = Avx.Add(Avx.Subtract(v02, v06), v05);

                    zh = v06.GetElement(0);
                    zl = v08.GetElement(0);

                    return (zh, zl);
                }

                double r, s;

                r = xh + yh;
                s = xh - r + yh + yl + xl;
                zh = r + s;
                zl = r - zh + s;

                return (zh, zl);
            }

            public static Double2 operator +(Double2 a, Double2 b)
            {
                var r = Add22(a.h, a.l, b.h, b.l);
                return new Double2(r);
            }
        }
    }
}

Ran with:

dotnet Test.dll  -filter *AdditionDouble2* --corerun %CORE_RUN%
Assembly code
OverheadJitting  1: 1 op, 1962000.00 ns, 1.9620 ms/op
; Assembly listing for method Double2:Add22(double,double,double,double):System.ValueTuple`2[Double,Double]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 RetBuf       [V00,T00] (  5,  5   )   byref  ->  rcx
;  V01 arg0         [V01,T02] (  3,  3   )  double  ->  mm1
;  V02 arg1         [V02,T03] (  3,  3   )  double  ->  mm2
;  V03 arg2         [V03,T04] (  3,  3   )  double  ->  mm3
;  V04 arg3         [V04,T13] (  1,  1   )  double  ->  [rsp+28H]
;  V05 loc0         [V05,T08] (  2,  2   )  double  ->  mm3
;  V06 loc1         [V06,T09] (  2,  2   )  double  ->  mm0
;* V07 loc2         [V07    ] (  0,  0   )  double  ->  zero-ref
;* V08 loc3         [V08    ] (  0,  0   )  double  ->  zero-ref
;  V09 loc4         [V09,T10] (  2,  2   )  simd16  ->  mm0
;  V10 loc5         [V10,T05] (  4,  4   )  simd16  ->  mm0
;  V11 loc6         [V11,T06] (  3,  3   )  simd16  ->  mm1
;  V12 loc7         [V12,T07] (  3,  3   )  simd16  ->  mm2
;# V13 OutArgs      [V13    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;  V14 tmp1         [V14,T01] (  3,  6   )  simd16  ->  mm1         "dup spill"
;* V15 tmp2         [V15    ] (  0,  0   )  struct (16) zero-ref    "NewObj constructor temp"
;  V16 tmp3         [V16,T11] (  2,  2   )  double  ->  mm3         V15.Item1(offs=0x00) P-INDEP "field V15.Item1 (fldOffset=0x0)"
;  V17 tmp4         [V17,T12] (  2,  2   )  double  ->  mm0         V15.Item2(offs=0x08) P-INDEP "field V15.Item2 (fldOffset=0x8)"
;
; Lcl frame size = 0

G_M42769_IG01:              ;; offset=0000H
       C5F877               vzeroupper
                                                ;; bbWeight=1    PerfScore 1.00
G_M42769_IG02:              ;; offset=0003H
       C5E8570555000000     vxorps   xmm0, xmm2, qword ptr [reloc @RWD00]
       C5FB10642428         vmovsd   xmm4, qword ptr [rsp+28H]
       C5D8571547000000     vxorps   xmm2, xmm4, qword ptr [reloc @RWD00]
       C5F816C2             vmovlhps xmm0, xmm0, xmm2
       C5F016CB             vmovlhps xmm1, xmm1, xmm3
       C5F17CC0             vhaddpd  xmm0, xmm1, xmm0
       C5F15CC8             vsubpd   xmm1, xmm1, xmm0
       C5E857D2             vxorps   xmm2, xmm2, xmm2
       C5F17CCA             vhaddpd  xmm1, xmm1, xmm2
       C5F958D1             vaddpd   xmm2, xmm0, xmm1
       C5F828DA             vmovaps  xmm3, xmm2
       C5F95CC2             vsubpd   xmm0, xmm0, xmm2
       C5F958C1             vaddpd   xmm0, xmm0, xmm1
       C5FB1119             vmovsd   qword ptr [rcx], xmm3
       C5FB114108           vmovsd   qword ptr [rcx+8], xmm0
       488BC1               mov      rax, rcx
                                                ;; bbWeight=1    PerfScore 31.83
G_M42769_IG03:              ;; offset=004DH
       C3                   ret
                                                ;; bbWeight=1    PerfScore 1.00
RWD00   dq      8000000000000000h       ;           -0
        dq      8000000000000000h       ;           -0


; Total bytes of code 78, prolog size 3, PerfScore 43.13, instruction count 18, allocated bytes for code 93 (MethodHash=ec3258ee) for method Double2:Add22(double,double,double,double):System.ValueTuple`2[Double,Double]
; ============================================================

WorkloadJitting  1: 1 op, 11126300.00 ns, 11.1263 ms/op

@ghost ghost locked as resolved and limited conversation to collaborators May 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

9 participants