FMA codegen #11260

helloguo · 2018-10-18T18:45:31Z

These two functions are doing the same thing. The only difference is the order of parameters of Fma.MultiplyAdd. However, the codegen of the second function is not as good as the first one. Maybe worth looking at this to see if the JIT can optimize it.

static unsafe void foo2(float scale, float[] src, float[] dst, int count)
{
    fixed (float* psrc = &src[0])
    fixed (float* pdst = &dst[0])
    {
        float* pSrcCurrent = psrc;
        float* pDstCurrent = pdst;
        float* pEnd = pdst + count;

        Vector256<float> scaleVector256 = Avx.SetAllVector256(scale);

        while (pDstCurrent + 8 <= pEnd)
        {
            Vector256<float> dstVector = Avx.LoadVector256(pDstCurrent);
            dstVector = Fma.MultiplyAdd(Avx.LoadVector256(pSrcCurrent), scaleVector256, dstVector);
            Avx.Store(pDstCurrent, dstVector);

            pSrcCurrent += 8;
            pDstCurrent += 8;
        }
    }
}

static unsafe void foo3(float scale, float[] src, float[] dst, int count)
{
    fixed (float* psrc = &src[0])
    fixed (float* pdst = &dst[0])
    {
        float* pSrcCurrent = psrc;
        float* pDstCurrent = pdst;
        float* pEnd = pdst + count;

        Vector256<float> scaleVector256 = Avx.SetAllVector256(scale);

        while (pDstCurrent + 8 <= pEnd)
        {
            Vector256<float> dstVector = Avx.LoadVector256(pDstCurrent);
            dstVector = Fma.MultiplyAdd(scaleVector256, Avx.LoadVector256(pSrcCurrent), dstVector);
            Avx.Store(pDstCurrent, dstVector);

            pSrcCurrent += 8;
            pDstCurrent += 8;
        }
    }
}

Codegen of foo2:

Codegen of foo3:

category:cq
theme:register-allocator
skill-level:expert
cost:medium

The text was updated successfully, but these errors were encountered:

helloguo · 2018-10-18T18:46:44Z

cc\ @fiigii @tannergooding

tannergooding · 2018-10-18T18:53:29Z

CC. @CarolEidt, it looks like there is some additional register shuffling going on.

tannergooding · 2018-10-18T18:54:44Z

it is unnecessarily shuffling ymm0 to ymm2 before the call and ymm2 to ymm1 after the call.

fiigii · 2018-10-18T19:02:12Z

The first unnecessarily shuffling (ymm0 to ymm2) seems from emitter

void emitter::emitIns_SIMD_R_R_R_AR(
    instruction ins, emitAttr attr, regNumber targetReg, regNumber op1Reg, regNumber op2Reg, regNumber base)
{
    assert(IsFMAInstruction(ins));
    assert(UseVEXEncoding());

    if (op1Reg != targetReg)
    {
        // Ensure we aren't overwriting op2
        assert(op2Reg != targetReg);

        emitIns_R_R(INS_movaps, attr, targetReg, op1Reg);
    }

    emitIns_R_R_AR(ins, attr, targetReg, op2Reg, base, 0);
}

and the second one may be from delayfree?

fiigii · 2018-10-18T19:06:08Z

The current FMA instruction selection/RA is not optimal, but we need more scenarios to improve it.
@tannergooding As I know, you are going to use HW intrinsic for some math lib, which would be a great opportunity.

tannergooding · 2018-10-18T19:44:43Z

@fiigii, I don't think the issue is with the emitter, but rather the register allocator deciding that the source and destination cannot be the same here. Maybe I missed something in LSRA or in the containment checks for one of the FMA cases?

tannergooding · 2018-10-18T19:50:50Z

At least as of right now, we try to ensure at least one of the operands is contained, preferencing 3, then 2, then 1: https://github.com/dotnet/coreclr/blob/master/src/jit/lowerxarch.cpp#L3006

There is an existing issue with "reg-optional" in that we don't have a way to say that any of the three operand can be optional. Tracked by [LSRA][RyuJIT] Add support to allow either of op1 or op2 to be considered as reg optional rather than Lower fixing one of the operands as reg optional #6358

We then try to ensure that the correct delay free uses are setup in LSRA: https://github.com/dotnet/coreclr/blob/master/src/jit/lsraxarch.cpp#L2516

And finally just emit the appropriate encoding (based on which are is contained) in codegen: https://github.com/dotnet/coreclr/blob/master/src/jit/hwintrinsiccodegenxarch.cpp#L2410

tannergooding · 2018-10-18T19:53:37Z

Perhaps a possible fix here is telling the register allocator which operand ends up being the RMW operand.... @CarolEidt, do we have a good way to do that (is just setting tgtPref the right thing to do here?)

CarolEidt · 2020-10-12T20:49:46Z

This is really more of a register allocation (and Lowering) issue. I've reclassified this as register-allocator theme, but it may be that it has been addressed already.

kevinjwz · 2020-10-21T05:06:36Z

Another example I recently encountered:

unsafe static void MulKernel2x8(double* C,double* A,double* B,uint A_cols)
{
    Vector256<double> sum00, sum01, sum10, sum11;

    sum00 = sum01 = sum10 = sum11 = default;

    for (uint k = 0; k < A_cols; k += 1) {
        Vector256<double> va, vb0, vb1;
        vb0 = Avx.LoadVector256(B);
        vb1 = Avx.LoadVector256(B + 4);
        va = Avx.BroadcastScalarToVector256(A);
        sum00 = Fma.MultiplyAdd(va,vb0,sum00);
        sum01 = Fma.MultiplyAdd(va,vb1,sum01);
        va = Avx.BroadcastScalarToVector256(A + 1);
        sum10 = Fma.MultiplyAdd(va,vb0,sum10);
        sum11 = Fma.MultiplyAdd(va,vb1,sum11);
        A += 2;
        B += 8;
    }
    Avx.Store(C,sum00);
    Avx.Store(C + 4,sum01);
    Avx.Store(C + 8,sum10);
    Avx.Store(C + 12,sum11);
}

The compiled loop body:

vmovupd     ymm4,ymmword ptr [r8]  
vmovupd     ymm5,ymmword ptr [r8+20h]
vbroadcastsd ymm6,mmword ptr [rdx]  
vmovaps     ymm7,ymm6  
vfmadd213pd ymm7,ymm4,ymm3  
vmovaps     ymm3,ymm7  
vfmadd213pd ymm6,ymm5,ymm2  
vmovaps     ymm2,ymm6  
vbroadcastsd ymm6,mmword ptr [rdx+8] 
vfmadd213pd ymm4,ymm6,ymm1  
vmovaps     ymm1,ymm4  
vfmadd213pd ymm5,ymm6,ymm0  
vmovaps     ymm0,ymm5  
add         rdx,10h  
add         r8,40h

Ideal ouput:

vmovupd     ymm4,ymmword ptr [r8]  
vmovupd     ymm5,ymmword ptr [r8+20h]
vbroadcastsd ymm6,mmword ptr [rdx]  
vfmadd231pd ymm3,ymm6,ymm4  
vfmadd231pd ymm2,ymm6,ymm5  
vbroadcastsd ymm6,mmword ptr [rdx+8] 
vfmadd231pd ymm1,ymm6,ymm4  
vfmadd231pd ymm0,ymm6,ymm5  
add         rdx,10h  
add         r8,40h

It seems vfmadd231pd is hardly selected when both 3 operands are not from memory. I think this "mul and sum" pattern should be recognized.
BTW, why didn't we simply add Fma.MultiplyAdd231, etc?

CarolEidt · 2020-11-14T01:27:06Z

This should be at least be analyzed; it may have been fixed e.g. by dotnet/coreclr/pull/19429

echesakov · 2021-07-26T18:11:39Z

Moving to 7.0.0 since it's not a bug.

echesakov · 2022-03-15T21:54:18Z

Un-assigning myself
cc @BruceForstall

tannergooding · 2024-02-05T05:16:06Z

This has been mostly resolved and we now generate identical code in most cases. There is still a possibility of an extra movaps in some cases, but that's a more general register allocator consideration and no longer specific to FMA

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

CarolEidt modified the milestones: Future, 6.0.0 Nov 14, 2020

JulieLeeMSFT assigned echesakov Mar 23, 2021

JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 23, 2021

JulieLeeMSFT removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Jun 7, 2021

tannergooding mentioned this issue Jul 7, 2021

Improve preferencing and code generation for FMA #12984

Closed

echesakov modified the milestones: 6.0.0, 7.0.0 Jul 26, 2021

echesakov added the optimization label Jul 26, 2021

echesakov mentioned this issue Aug 16, 2021

Possible .NET optimization opportunities for an accumulated multiply-add loop #56855

Open

echesakov removed their assignment Mar 15, 2022

BruceForstall removed this from the 7.0.0 milestone Mar 16, 2022

BruceForstall added this to the Future milestone Mar 16, 2022

tannergooding closed this as completed Feb 5, 2024

github-actions bot locked and limited conversation to collaborators Mar 6, 2024

JulieLeeMSFT added this to .NET Core CodeGen Jun 5, 2024

JulieLeeMSFT moved this to Done in .NET Core CodeGen Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FMA codegen #11260

FMA codegen #11260

helloguo commented Oct 18, 2018 •

edited by CarolEidt

Loading

helloguo commented Oct 18, 2018

tannergooding commented Oct 18, 2018

tannergooding commented Oct 18, 2018

fiigii commented Oct 18, 2018

fiigii commented Oct 18, 2018

tannergooding commented Oct 18, 2018 •

edited

Loading

tannergooding commented Oct 18, 2018

tannergooding commented Oct 18, 2018

CarolEidt commented Oct 12, 2020

kevinjwz commented Oct 21, 2020 •

edited

Loading

CarolEidt commented Nov 14, 2020

echesakov commented Jul 26, 2021

echesakov commented Mar 15, 2022

tannergooding commented Feb 5, 2024

FMA codegen #11260

FMA codegen #11260

Comments

helloguo commented Oct 18, 2018 • edited by CarolEidt Loading

helloguo commented Oct 18, 2018

tannergooding commented Oct 18, 2018

tannergooding commented Oct 18, 2018

fiigii commented Oct 18, 2018

fiigii commented Oct 18, 2018

tannergooding commented Oct 18, 2018 • edited Loading

tannergooding commented Oct 18, 2018

tannergooding commented Oct 18, 2018

CarolEidt commented Oct 12, 2020

kevinjwz commented Oct 21, 2020 • edited Loading

CarolEidt commented Nov 14, 2020

echesakov commented Jul 26, 2021

echesakov commented Mar 15, 2022

tannergooding commented Feb 5, 2024

helloguo commented Oct 18, 2018 •

edited by CarolEidt

Loading

tannergooding commented Oct 18, 2018 •

edited

Loading

kevinjwz commented Oct 21, 2020 •

edited

Loading