-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor register allocation with hardware intrinsic (x86) #37216
Comments
@ebfortin the disassembly above looks like it is from un-optimized code. Usually BenchmarkDotNet is pretty good at ensuring that it benchmarks only optimized code, but in this case I wonder. Can you share your test case, or some representative sample? |
Going to mark this as future, may reconsider once we have more information. |
I will provide some more details later tonight. I'll put the method code being evaluated. The benchmark configuration used and other stuff. |
Code from BenchmarkDotnet to define the benchmarks.
Configuration for the benchmark.
Code of operator+() called from the Benchmark.
From BenchmarkDotnet run showing that no debugger is attached.
|
G_M12720_IG01:
vzeroupper
;; bbWeight=1 PerfScore 1.00
G_M12720_IG02:
vmovsd xmm0, qword ptr [reloc @RWD00]
vxorps xmm0, xmm2
vmovsd xmm4, qword ptr [rsp+28H]
vmovsd xmm2, qword ptr [reloc @RWD00]
vxorps xmm2, xmm4
vmovlhps xmm0, xmm0, xmm2
vmovlhps xmm1, xmm1, xmm3
vhaddpd xmm0, xmm1, xmm0
vsubpd xmm1, xmm1, xmm0
vxorps xmm2, xmm2, xmm2
vhaddpd xmm1, xmm1, xmm2
vaddpd xmm2, xmm0, xmm1
vmovaps xmm3, xmm2 ;; <-- ?
vsubpd xmm0, xmm0, xmm2
vaddpd xmm0, xmm0, xmm1
vmovsd qword ptr [rcx], xmm3
vmovsd qword ptr [rcx+8], xmm0
mov rax, rcx
G_M12720_IG03:
ret
RWD00 dq 8000000000000000h
; Total bytes of code: 86 The only thing it seems is this repro: static Vector128<double> Add22(double a, double b)
{
return Vector128.Create(-a, -b);
} Codegen: G_M46873_IG01:
vzeroupper
G_M46873_IG02:
vmovsd xmm0, qword ptr [reloc @RWD00]
vxorps xmm0, xmm1
vmovsd xmm1, qword ptr [reloc @RWD00]
vxorps xmm1, xmm2
vmovlhps xmm0, xmm0, xmm1
vmovupd xmmword ptr [rcx], xmm0
mov rax, rcx
G_M46873_IG03:
ret
RWD00 dq 8000000000000000h
; Total bytes of code: 39 it loads |
I wonder if we can expand GT_NEG to |
The loading Horizontal Adds aren't necessarily the most performance instructions and it might be better to rearrange the data so you can use normal addition/subtraction instead. It looks like it might be possible at first glance, but I didn't confirm... Constructing and deconstructing the vector also has some minimal overhead that wouldn't be great in a loop. |
And where is this diassembly coming from? Why would un-optimized code be used by BenchmarkDotnet? Because the code that is produced on my side has basically one load before, and one store after any SIMD instruction. And the lower performance comes from there, not in the choice of SIMD instruction. Also what are the condition in the JIT to emit optimized code? Because I see a problem if on some computer it's optimized, on other it's not. It's too variable. |
I tested it with COMPlus_TC_QuickJit set to 0. A HUGE difference. So indeed the code was not optimized. Which is demonstrated with the output from BenchmarkDotnet diassembler below:
And if I look at the benchmark, I see that my SIMD code is bad and need improvement, like what @tannergooding was suggesting. Now that the un-optimized variable is out of the way. However it raises the question : Why wouldn't the Tiered compilation kicks in for this workload? Method being evaluated is too small? Run time is too low? And a side question : What is AgressiveOptimization doing? Does it affect only C# compilation? Any effect on the JIT? Would be great to have the ability to force optimization, in the JIT, on a per method basis. Would make it possible to ship librairies with critical methods that are sure to get optimized right away. And letting the JIT decide for the others. |
@adamsitnik any idea why BenchmarkDotNet can't get to Tier1 code in the test case above?
|
Most probably it's a side effect of using @ebfortin could you please try to switch using |
I just tried with [GlobalSetup] instead. I get the same un-optimized run.
If I use AgressiveOptimization on Add22 then it optimize that method but not the rest, so performance is 1 order of magnitude worst overall than full optimization. So no problems with the register allocators. The optimizer is working, clearly. All good. But in what circumstanced it gets invoked is not quite clear. I was expecting it to be used on this benchmark. |
I'm still struggling to understand when the runtime optimize and when it doesn't. The only moment I get clear optimized code is by using COMPlus_TC_QuickJit=0. If I use the MSBuild property or runtimeconfig.json it doesn't do anything. I still get the awfully bad performance. If it's something I should check with the BenchmarkDotNet team tell me and I will close this here. But I have the feeling something not by design is going on. And the results seem to be pointing to it. |
@ebfortin - I tried BenchmarkDotNet with the code you have shared and I see optimized code getting generated. C# benchmarkusing BenchmarkDotNet.Attributes;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using System.Text;
using System.Threading.Tasks;
namespace Test
{
public class _37216
{
private double _a, _b;
private Double2 _a2, _b2;
private static Random _rnd = new Random();
[GlobalSetup]
public void Setup()
{
_a = _rnd.NextDouble();
_b = _rnd.NextDouble();
_a2 = new Double2(_a);
_b2 = new Double2(_b);
}
[Benchmark]
public double AdditionDouble()
{
var c = _a + _b;
return c;
}
[Benchmark]
public Double2 AdditionDouble2()
{
var c = _a2 + _b2;
return c;
}
public class Double2
{
private double h;
private double l;
public Double2(double _x)
{
h = _x;
l = _x;
}
private Double2((double, double) _x)
{
h = _x.Item1;
l = _x.Item2;
}
private static (double, double) Add22(double xh, double xl, double yh, double yl)
{
double zh;
double zl;
if (Avx2.IsSupported)
{
var v00 = Vector128.Create(xh, yh);
var v01 = Vector128.Create(-xl, -yl);
var v02 = Avx.HorizontalAdd(v00, v01); // r = (xh + yh) | - xl - yl)
var v03 = Avx.Subtract(v00, v02);
var v05 = Avx.HorizontalAdd(v03, Vector128<double>.Zero); // s = xh - r + yh + xl + yl | 0
var v06 = Avx.Add(v02, v05); // r + s | - xl - yl
var v08 = Avx.Add(Avx.Subtract(v02, v06), v05);
zh = v06.GetElement(0);
zl = v08.GetElement(0);
return (zh, zl);
}
double r, s;
r = xh + yh;
s = xh - r + yh + yl + xl;
zh = r + s;
zl = r - zh + s;
return (zh, zl);
}
public static Double2 operator +(Double2 a, Double2 b)
{
var r = Add22(a.h, a.l, b.h, b.l);
return new Double2(r);
}
}
}
} Ran with: dotnet Test.dll -filter *AdditionDouble2* --corerun %CORE_RUN% Assembly codeOverheadJitting 1: 1 op, 1962000.00 ns, 1.9620 ms/op
; Assembly listing for method Double2:Add22(double,double,double,double):System.ValueTuple`2[Double,Double]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
; V00 RetBuf [V00,T00] ( 5, 5 ) byref -> rcx
; V01 arg0 [V01,T02] ( 3, 3 ) double -> mm1
; V02 arg1 [V02,T03] ( 3, 3 ) double -> mm2
; V03 arg2 [V03,T04] ( 3, 3 ) double -> mm3
; V04 arg3 [V04,T13] ( 1, 1 ) double -> [rsp+28H]
; V05 loc0 [V05,T08] ( 2, 2 ) double -> mm3
; V06 loc1 [V06,T09] ( 2, 2 ) double -> mm0
;* V07 loc2 [V07 ] ( 0, 0 ) double -> zero-ref
;* V08 loc3 [V08 ] ( 0, 0 ) double -> zero-ref
; V09 loc4 [V09,T10] ( 2, 2 ) simd16 -> mm0
; V10 loc5 [V10,T05] ( 4, 4 ) simd16 -> mm0
; V11 loc6 [V11,T06] ( 3, 3 ) simd16 -> mm1
; V12 loc7 [V12,T07] ( 3, 3 ) simd16 -> mm2
;# V13 OutArgs [V13 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
; V14 tmp1 [V14,T01] ( 3, 6 ) simd16 -> mm1 "dup spill"
;* V15 tmp2 [V15 ] ( 0, 0 ) struct (16) zero-ref "NewObj constructor temp"
; V16 tmp3 [V16,T11] ( 2, 2 ) double -> mm3 V15.Item1(offs=0x00) P-INDEP "field V15.Item1 (fldOffset=0x0)"
; V17 tmp4 [V17,T12] ( 2, 2 ) double -> mm0 V15.Item2(offs=0x08) P-INDEP "field V15.Item2 (fldOffset=0x8)"
;
; Lcl frame size = 0
G_M42769_IG01: ;; offset=0000H
C5F877 vzeroupper
;; bbWeight=1 PerfScore 1.00
G_M42769_IG02: ;; offset=0003H
C5E8570555000000 vxorps xmm0, xmm2, qword ptr [reloc @RWD00]
C5FB10642428 vmovsd xmm4, qword ptr [rsp+28H]
C5D8571547000000 vxorps xmm2, xmm4, qword ptr [reloc @RWD00]
C5F816C2 vmovlhps xmm0, xmm0, xmm2
C5F016CB vmovlhps xmm1, xmm1, xmm3
C5F17CC0 vhaddpd xmm0, xmm1, xmm0
C5F15CC8 vsubpd xmm1, xmm1, xmm0
C5E857D2 vxorps xmm2, xmm2, xmm2
C5F17CCA vhaddpd xmm1, xmm1, xmm2
C5F958D1 vaddpd xmm2, xmm0, xmm1
C5F828DA vmovaps xmm3, xmm2
C5F95CC2 vsubpd xmm0, xmm0, xmm2
C5F958C1 vaddpd xmm0, xmm0, xmm1
C5FB1119 vmovsd qword ptr [rcx], xmm3
C5FB114108 vmovsd qword ptr [rcx+8], xmm0
488BC1 mov rax, rcx
;; bbWeight=1 PerfScore 31.83
G_M42769_IG03: ;; offset=004DH
C3 ret
;; bbWeight=1 PerfScore 1.00
RWD00 dq 8000000000000000h ; -0
dq 8000000000000000h ; -0
; Total bytes of code 78, prolog size 3, PerfScore 43.13, instruction count 18, allocated bytes for code 93 (MethodHash=ec3258ee) for method Double2:Add22(double,double,double,double):System.ValueTuple`2[Double,Double]
; ============================================================
WorkloadJitting 1: 1 op, 11126300.00 ns, 11.1263 ms/op
|
Description
I'm porting an algorithm from scalar double arithmetics to SIMD using the Hardware Intrinsics. After some testing I concluded that the performance of the SIMD version is worst. Now it can be that I'm just not good at using SIMD instructions. However looking at the asm produced by the JIT, I think there may be a problem.
Configuration
.NET Core 5.0.0 (CoreCLR 5.0.20.21406, CoreFX 5.0.20.21406), X64 RyuJIT
Regression?
Data
Look at one example:
This comes from the disassembly of BenchmarkDotet.
Also benchmark results:
Analysis
If you look closely you see that each instruction seem to be taken in isolation, with its own register allocation, instead of being global to the method. This means a LOT more memory load/store than seem necessary. There is a lot of register to play with beside xmm1...
The documentation on Hardware Intrinsics states that for some time in the compilation tree intrinsics are seen as method. Maybe they are seen as method for a bit too long and so each "method" see some register allocation but only in its own local "method" context.
category:cq
theme:register-allocator
skill-level:expert
cost:medium
The text was updated successfully, but these errors were encountered: