Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FMA codegen #11260

Closed
helloguo opened this issue Oct 18, 2018 · 14 comments
Closed

FMA codegen #11260

helloguo opened this issue Oct 18, 2018 · 14 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Milestone

Comments

@helloguo
Copy link

helloguo commented Oct 18, 2018

These two functions are doing the same thing. The only difference is the order of parameters of Fma.MultiplyAdd. However, the codegen of the second function is not as good as the first one. Maybe worth looking at this to see if the JIT can optimize it.

static unsafe void foo2(float scale, float[] src, float[] dst, int count)
{
    fixed (float* psrc = &src[0])
    fixed (float* pdst = &dst[0])
    {
        float* pSrcCurrent = psrc;
        float* pDstCurrent = pdst;
        float* pEnd = pdst + count;

        Vector256<float> scaleVector256 = Avx.SetAllVector256(scale);

        while (pDstCurrent + 8 <= pEnd)
        {
            Vector256<float> dstVector = Avx.LoadVector256(pDstCurrent);
            dstVector = Fma.MultiplyAdd(Avx.LoadVector256(pSrcCurrent), scaleVector256, dstVector);
            Avx.Store(pDstCurrent, dstVector);

            pSrcCurrent += 8;
            pDstCurrent += 8;
        }
    }
}

static unsafe void foo3(float scale, float[] src, float[] dst, int count)
{
    fixed (float* psrc = &src[0])
    fixed (float* pdst = &dst[0])
    {
        float* pSrcCurrent = psrc;
        float* pDstCurrent = pdst;
        float* pEnd = pdst + count;

        Vector256<float> scaleVector256 = Avx.SetAllVector256(scale);

        while (pDstCurrent + 8 <= pEnd)
        {
            Vector256<float> dstVector = Avx.LoadVector256(pDstCurrent);
            dstVector = Fma.MultiplyAdd(scaleVector256, Avx.LoadVector256(pSrcCurrent), dstVector);
            Avx.Store(pDstCurrent, dstVector);

            pSrcCurrent += 8;
            pDstCurrent += 8;
        }
    }
}

Codegen of foo2:
foo2

Codegen of foo3:
foo3

category:cq
theme:register-allocator
skill-level:expert
cost:medium

@helloguo
Copy link
Author

cc\ @fiigii @tannergooding

@tannergooding
Copy link
Member

CC. @CarolEidt, it looks like there is some additional register shuffling going on.

@tannergooding
Copy link
Member

it is unnecessarily shuffling ymm0 to ymm2 before the call and ymm2 to ymm1 after the call.

@fiigii
Copy link
Contributor

fiigii commented Oct 18, 2018

The first unnecessarily shuffling (ymm0 to ymm2) seems from emitter

void emitter::emitIns_SIMD_R_R_R_AR(
    instruction ins, emitAttr attr, regNumber targetReg, regNumber op1Reg, regNumber op2Reg, regNumber base)
{
    assert(IsFMAInstruction(ins));
    assert(UseVEXEncoding());

    if (op1Reg != targetReg)
    {
        // Ensure we aren't overwriting op2
        assert(op2Reg != targetReg);

        emitIns_R_R(INS_movaps, attr, targetReg, op1Reg);
    }

    emitIns_R_R_AR(ins, attr, targetReg, op2Reg, base, 0);
}

and the second one may be from delayfree?

@fiigii
Copy link
Contributor

fiigii commented Oct 18, 2018

The current FMA instruction selection/RA is not optimal, but we need more scenarios to improve it.
@tannergooding As I know, you are going to use HW intrinsic for some math lib, which would be a great opportunity.

@tannergooding
Copy link
Member

tannergooding commented Oct 18, 2018

@fiigii, I don't think the issue is with the emitter, but rather the register allocator deciding that the source and destination cannot be the same here. Maybe I missed something in LSRA or in the containment checks for one of the FMA cases?

@tannergooding
Copy link
Member

At least as of right now, we try to ensure at least one of the operands is contained, preferencing 3, then 2, then 1: https://github.com/dotnet/coreclr/blob/master/src/jit/lowerxarch.cpp#L3006

We then try to ensure that the correct delay free uses are setup in LSRA: https://github.com/dotnet/coreclr/blob/master/src/jit/lsraxarch.cpp#L2516

And finally just emit the appropriate encoding (based on which are is contained) in codegen: https://github.com/dotnet/coreclr/blob/master/src/jit/hwintrinsiccodegenxarch.cpp#L2410

@tannergooding
Copy link
Member

Perhaps a possible fix here is telling the register allocator which operand ends up being the RMW operand.... @CarolEidt, do we have a good way to do that (is just setting tgtPref the right thing to do here?)

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@CarolEidt
Copy link
Contributor

This is really more of a register allocation (and Lowering) issue. I've reclassified this as register-allocator theme, but it may be that it has been addressed already.

@kevinjwz
Copy link

kevinjwz commented Oct 21, 2020

Another example I recently encountered:

unsafe static void MulKernel2x8(double* C,double* A,double* B,uint A_cols)
{
    Vector256<double> sum00, sum01, sum10, sum11;

    sum00 = sum01 = sum10 = sum11 = default;

    for (uint k = 0; k < A_cols; k += 1) {
        Vector256<double> va, vb0, vb1;
        vb0 = Avx.LoadVector256(B);
        vb1 = Avx.LoadVector256(B + 4);
        va = Avx.BroadcastScalarToVector256(A);
        sum00 = Fma.MultiplyAdd(va,vb0,sum00);
        sum01 = Fma.MultiplyAdd(va,vb1,sum01);
        va = Avx.BroadcastScalarToVector256(A + 1);
        sum10 = Fma.MultiplyAdd(va,vb0,sum10);
        sum11 = Fma.MultiplyAdd(va,vb1,sum11);
        A += 2;
        B += 8;
    }
    Avx.Store(C,sum00);
    Avx.Store(C + 4,sum01);
    Avx.Store(C + 8,sum10);
    Avx.Store(C + 12,sum11);
}

The compiled loop body:

vmovupd     ymm4,ymmword ptr [r8]  
vmovupd     ymm5,ymmword ptr [r8+20h]
vbroadcastsd ymm6,mmword ptr [rdx]  
vmovaps     ymm7,ymm6  
vfmadd213pd ymm7,ymm4,ymm3  
vmovaps     ymm3,ymm7  
vfmadd213pd ymm6,ymm5,ymm2  
vmovaps     ymm2,ymm6  
vbroadcastsd ymm6,mmword ptr [rdx+8] 
vfmadd213pd ymm4,ymm6,ymm1  
vmovaps     ymm1,ymm4  
vfmadd213pd ymm5,ymm6,ymm0  
vmovaps     ymm0,ymm5  
add         rdx,10h  
add         r8,40h  

Ideal ouput:

vmovupd     ymm4,ymmword ptr [r8]  
vmovupd     ymm5,ymmword ptr [r8+20h]
vbroadcastsd ymm6,mmword ptr [rdx]  
vfmadd231pd ymm3,ymm6,ymm4  
vfmadd231pd ymm2,ymm6,ymm5  
vbroadcastsd ymm6,mmword ptr [rdx+8] 
vfmadd231pd ymm1,ymm6,ymm4  
vfmadd231pd ymm0,ymm6,ymm5  
add         rdx,10h  
add         r8,40h  

It seems vfmadd231pd is hardly selected when both 3 operands are not from memory. I think this "mul and sum" pattern should be recognized.
BTW, why didn't we simply add Fma.MultiplyAdd231, etc?

@CarolEidt CarolEidt modified the milestones: Future, 6.0.0 Nov 14, 2020
@CarolEidt
Copy link
Contributor

This should be at least be analyzed; it may have been fixed e.g. by dotnet/coreclr/pull/19429

@JulieLeeMSFT JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 23, 2021
@JulieLeeMSFT JulieLeeMSFT removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Jun 7, 2021
@echesakov
Copy link
Contributor

Moving to 7.0.0 since it's not a bug.

@echesakov
Copy link
Contributor

Un-assigning myself
cc @BruceForstall

@echesakov echesakov removed their assignment Mar 15, 2022
@BruceForstall BruceForstall removed this from the 7.0.0 milestone Mar 16, 2022
@BruceForstall BruceForstall added this to the Future milestone Mar 16, 2022
@tannergooding
Copy link
Member

This has been mostly resolved and we now generate identical code in most cases. There is still a possibility of an extra movaps in some cases, but that's a more general register allocator consideration and no longer specific to FMA

@github-actions github-actions bot locked and limited conversation to collaborators Mar 6, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Projects
Status: Done
Development

No branches or pull requests

9 participants