-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FMA codegen #11260
Comments
CC. @CarolEidt, it looks like there is some additional register shuffling going on. |
it is unnecessarily shuffling |
The first unnecessarily shuffling (ymm0 to ymm2) seems from emitter void emitter::emitIns_SIMD_R_R_R_AR(
instruction ins, emitAttr attr, regNumber targetReg, regNumber op1Reg, regNumber op2Reg, regNumber base)
{
assert(IsFMAInstruction(ins));
assert(UseVEXEncoding());
if (op1Reg != targetReg)
{
// Ensure we aren't overwriting op2
assert(op2Reg != targetReg);
emitIns_R_R(INS_movaps, attr, targetReg, op1Reg);
}
emitIns_R_R_AR(ins, attr, targetReg, op2Reg, base, 0);
} and the second one may be from delayfree? |
The current FMA instruction selection/RA is not optimal, but we need more scenarios to improve it. |
@fiigii, I don't think the issue is with the emitter, but rather the register allocator deciding that the source and destination cannot be the same here. Maybe I missed something in LSRA or in the containment checks for one of the FMA cases? |
At least as of right now, we try to ensure at least one of the operands is contained, preferencing 3, then 2, then 1: https://github.com/dotnet/coreclr/blob/master/src/jit/lowerxarch.cpp#L3006
We then try to ensure that the correct delay free uses are setup in LSRA: https://github.com/dotnet/coreclr/blob/master/src/jit/lsraxarch.cpp#L2516 And finally just emit the appropriate encoding (based on which are is contained) in codegen: https://github.com/dotnet/coreclr/blob/master/src/jit/hwintrinsiccodegenxarch.cpp#L2410 |
Perhaps a possible fix here is telling the register allocator which operand ends up being the RMW operand.... @CarolEidt, do we have a good way to do that (is just setting |
This is really more of a register allocation (and |
Another example I recently encountered: unsafe static void MulKernel2x8(double* C,double* A,double* B,uint A_cols)
{
Vector256<double> sum00, sum01, sum10, sum11;
sum00 = sum01 = sum10 = sum11 = default;
for (uint k = 0; k < A_cols; k += 1) {
Vector256<double> va, vb0, vb1;
vb0 = Avx.LoadVector256(B);
vb1 = Avx.LoadVector256(B + 4);
va = Avx.BroadcastScalarToVector256(A);
sum00 = Fma.MultiplyAdd(va,vb0,sum00);
sum01 = Fma.MultiplyAdd(va,vb1,sum01);
va = Avx.BroadcastScalarToVector256(A + 1);
sum10 = Fma.MultiplyAdd(va,vb0,sum10);
sum11 = Fma.MultiplyAdd(va,vb1,sum11);
A += 2;
B += 8;
}
Avx.Store(C,sum00);
Avx.Store(C + 4,sum01);
Avx.Store(C + 8,sum10);
Avx.Store(C + 12,sum11);
} The compiled loop body: vmovupd ymm4,ymmword ptr [r8]
vmovupd ymm5,ymmword ptr [r8+20h]
vbroadcastsd ymm6,mmword ptr [rdx]
vmovaps ymm7,ymm6
vfmadd213pd ymm7,ymm4,ymm3
vmovaps ymm3,ymm7
vfmadd213pd ymm6,ymm5,ymm2
vmovaps ymm2,ymm6
vbroadcastsd ymm6,mmword ptr [rdx+8]
vfmadd213pd ymm4,ymm6,ymm1
vmovaps ymm1,ymm4
vfmadd213pd ymm5,ymm6,ymm0
vmovaps ymm0,ymm5
add rdx,10h
add r8,40h Ideal ouput: vmovupd ymm4,ymmword ptr [r8]
vmovupd ymm5,ymmword ptr [r8+20h]
vbroadcastsd ymm6,mmword ptr [rdx]
vfmadd231pd ymm3,ymm6,ymm4
vfmadd231pd ymm2,ymm6,ymm5
vbroadcastsd ymm6,mmword ptr [rdx+8]
vfmadd231pd ymm1,ymm6,ymm4
vfmadd231pd ymm0,ymm6,ymm5
add rdx,10h
add r8,40h It seems |
This should be at least be analyzed; it may have been fixed e.g. by dotnet/coreclr/pull/19429 |
Moving to 7.0.0 since it's not a bug. |
Un-assigning myself |
This has been mostly resolved and we now generate identical code in most cases. There is still a possibility of an extra |
These two functions are doing the same thing. The only difference is the order of parameters of
Fma.MultiplyAdd
. However, the codegen of the second function is not as good as the first one. Maybe worth looking at this to see if the JIT can optimize it.Codegen of foo2:
Codegen of foo3:
category:cq
theme:register-allocator
skill-level:expert
cost:medium
The text was updated successfully, but these errors were encountered: