Improve preferencing and code generation for FMA #12984

CarolEidt · 2019-06-26T16:54:37Z

As pointed out in dotnet/coreclr#25387, there are improvements to be made in the handling of the FMA intrinsics:

Lowering should bias the containment such that the overwritten operand is a last use if possible.
The register allocator should preference the overwritten operand.
If all the operands are in registers, the code generator should choose the encoding based on whether the target register matches one of the input registers.

dotnet/coreclr#25387 includes a code snippet illustrating the issue.

category:cq
theme:vector-codegen
skill-level:expert
cost:large

The text was updated successfully, but these errors were encountered:

BruceForstall · 2020-04-29T01:11:32Z

@CarolEidt Should this remain a 5.0 issue, or should it be moved to Future?

CarolEidt · 2020-04-29T16:12:43Z

This probably doesn't prioritize over the other work in progress, but I believe it is a fairly straightforward issue to fix, so perhaps it would be a good task for someone wanting to gain expertise in the backend of the JIT. But perhaps change to Future.

BruceForstall · 2020-04-29T17:07:18Z

I went ahead and moved it to Future, but we can always choose to bring it back.

weilinwa · 2021-07-06T18:33:31Z

Hi, @CarolEidt. I'm interested in this issue and would like to work on it. Could you please help me to confirm if my understanding of the requirement is correct? Thanks!

What we want to improve here is to generate the best fit FMA instruction for situations that the computation result is written into one of the three input operands. To achieve this, we need to add checks and bias containment in lowering. And also do corresponding works in LSRA and Codegen phases.

Here are the three places I'm planning to add code in:

BruceForstall · 2021-07-06T18:37:04Z

@kunalspathak @tannergooding

kunalspathak · 2021-07-06T18:45:27Z

Here are the three places I'm planning to add code in:

@weilinwa - That looks right.

weilinwa · 2021-07-07T00:07:52Z

Hi, @kunalspathak. Thanks for the confirmation!
I have one follow up question about what is the classical method to check whether one of the operand is overwritten. I was able to do this by checking if one of the operand's lclNum (GetLclNum()) equals to this value of the next node(node->gtNext). Is there a more decent way to get the variable that the result is written into instead of getting the next node?

tannergooding · 2021-07-07T00:52:48Z

We also have #12212 and #11260 as related issues.

IIRC, most of these issues were root caused to #6358. That is, to allow the register allocator to set multiple operands as "optional" for the cases where the instruction is either commutative (e.g. Add, Multiply, etc) or where it has multiple forms (e.g. FusedMultiplyAdd).

There are several notes in both of the related issues that provide some additional analysis into the issue and potential avenues for fixing it.

kunalspathak · 2021-07-07T01:55:47Z

method to check whether one of the operand is overwritten

There might be better way to do this, but one of the method that I can think of is NodesAreEquivalentLeaves():

runtime/src/coreclr/jit/lower.cpp

Line 6257 in d4b98b9

    
           return tree1->AsLclVarCommon()->GetLclNum() == tree2->AsLclVarCommon()->GetLclNum();

CarolEidt · 2021-07-09T23:05:57Z

Sorry for the delay in response (I'm now retired). You've got the right general idea. If you've got an expression like STORE_LCL_VAR(FMA(op1,op2,op3)) where one of the operands is a LCL_VAR with the same lclNum as the STORE_LCL_VAR then you want to preference that operand to that lclVar Interval. Unfortunately, it seems that NodesAreEquivalentLeaves() doesn't handle STORE_LCL_VAR, so you'd need to adapt it to do that, or simply check those lclNums directly.

kunalspathak · 2021-07-09T23:07:17Z

Nice to see you @CarolEidt ! :)

weilinwa · 2021-07-12T17:17:13Z

@CarolEidt, thanks for your response!
I will go ahead to do the implementation then.

weilinwa · 2021-07-23T18:55:51Z

@tannergooding @kunalspathak, I implemented a solution by comparing the lclNum of the overwritten with the three ops and choose FMA accordingly if there is a match.
The idea is that only the 3rd case in these three types of computation should be generated to 231 form.

a = a*b+c
b = a*b+c
c = a*b+c

The code is looking like this. Is this on the right track?

    GenTreeLclVarCommon* overwritten = this->gtNext->AsLclVarCommon();
    assert(overwritten->gtOper == GT_STORE_LCL_VAR || overwritten->gtOper == GT_LCL_VAR);
    unsigned                  flag        = 0; // 1->op1, 2->op2, 3->op3
    if (op1->IsLocal() && op1->AsLclVarCommon()->GetLclNum() == overwritten->GetLclNum())
    {
        flag = 1; // 132 form: op1 = (op1 * op3) + [op2]
    }
    else if (op2->IsLocal() && op2->AsLclVarCommon()->GetLclNum() == overwritten->GetLclNum())
    {
        flag = 2; // 213 form: op1 = (op2 * op1) + [op3]
    }
    else if (op3->IsLocal() && op3->AsLclVarCommon()->GetLclNum() == overwritten->GetLclNum())
    {
        flag = 3; // 231 form: op1 = (op2 * op3) + [op1]
    }

    return flag;

weilinwa · 2021-07-23T18:59:56Z

Here are some of the results I got using the test cases from those related issues:

static unsafe float fmaTest1(float[] fb, float[] fc, int d)
{
       Vector256<float> tmp = Vector256<float>.Zero;
       fixed (float* b = &fb[0])
       fixed (float* c = &fc[0])
       {
             for (int k = 0; k + 8 <= d; k += 8)
             {
                    Vector256<float> v1 = Avx.LoadVector256(b + k);
                    Vector256<float> v2 = Avx.LoadVector256(c + k);
                    tmp = Fma.MultiplyAdd(Avx.Multiply(v2, v2), Avx.Multiply(v1, v1), tmp);
              }
       }
       return tmp.ToScalar(); 
}

Original codegen:

G_M42316_IG03:        ; offs=0000A5H, size=005FH, bbWeight=4    PerfScore 147.00, gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, loop=IG03, gcvars, byref, isz

IN0018: 0000A5 mov      rax, qword ptr [V04 rsp+78H]
IN0019: 0000AA mov      edx, dword ptr [V08 rsp+5CH]
IN001a: 0000AE movsxd   rdx, edx
IN001b: 0000B1 vmovups  ymm0, ymmword ptr[rax+4*rdx]
IN001c: 0000B6 vmovupd  ymmword ptr[V09 rsp+20H], ymm0
IN001d: 0000BC mov      rax, qword ptr [V06 rsp+68H]
IN001e: 0000C1 vmovups  ymm0, ymmword ptr[rax+4*rdx]
IN001f: 0000C6 vmulps   ymm0, ymm0, ymm0
IN0020: 0000CA vmovupd  ymm1, ymmword ptr[V09 rsp+20H]
IN0021: 0000D0 vmulps   ymm1, ymm1, ymmword ptr[V09 rsp+20H]
IN0022: 0000D6 vfmadd213ps ymm0, ymm1, ymmword ptr[V03 rsp+80H]
IN0023: 0000E0 vmovupd  ymmword ptr[V03 rsp+80H], ymm0
IN0024: 0000E9 mov      eax, dword ptr [V08 rsp+5CH]
IN0025: 0000ED add      eax, 8
IN0026: 0000F0 mov      dword ptr [V08 rsp+5CH], eax
IN0027: 0000F4 mov      eax, dword ptr [V08 rsp+5CH]
IN0028: 0000F8 add      eax, 8
IN0029: 0000FB cmp      eax, dword ptr [V02 rsp+D0H]
IN002a: 000102 jle      SHORT G_M42316_IG03

New codegen:

G_M42316_IG03:        ; offs=000047H, size=0028H, bbWeight=4    PerfScore 77.00, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, loop=IG03, byref, isz

IN0010: 000047 movsxd   r9, ecx
IN0011: 00004A vmovups  ymm1, ymmword ptr[rax+4*r9]
IN0012: 000050 vmovups  ymm2, ymmword ptr[rdx+4*r9]
IN0013: 000056 vmulps   ymm2, ymm2, ymm2
IN0014: 00005A vmulps   ymm1, ymm1, ymm1
IN0015: 00005E vfmadd231ps ymm0, ymm1, ymm2
IN0016: 000063 add      ecx, 8
IN0017: 000066 lea      r9d, [rcx+8]
IN0018: 00006A cmp      r9d, r8d
IN0019: 00006D jle      SHORT G_M42316_IG03

weilinwa · 2021-07-23T19:03:06Z

static unsafe float fmaTest2(float scale, float[] src, float[] dst, int count)
{
    float result;
    fixed (float* psrc = &src[0])
    fixed (float* pdst = &dst[0])
    {
        float* pSrcCurrent = psrc;
        float* pDstCurrent = pdst;
        float* pEnd = pdst + count;

        Vector256<float> scaleVector256 = Vector256.Create(scale);
        Vector256<float> dstVector = Avx.LoadVector256(pDstCurrent);
        while (pDstCurrent + 8 <= pEnd)
        {
            dstVector = Avx.LoadVector256(pDstCurrent);
            dstVector = Fma.MultiplyAdd(Avx.LoadVector256(pSrcCurrent), scaleVector256, dstVector);
            Avx.Store(pDstCurrent, dstVector);

            pSrcCurrent += 8;
            pDstCurrent += 8;
        }

        result = dstVector.ToScalar();
    }
    return result;
}

Original codegen:

G_M53971_IG03:        ; offs=0000E7H, size=006FH, bbWeight=4    PerfScore 147.00, gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, loop=IG03, gcvars, byref, isz

IN001f: 0000E7 mov      rax, qword ptr [V08 rsp+78H]
IN0020: 0000EC vmovups  ymm0, ymmword ptr[rax]
IN0021: 0000F0 vmovupd  ymmword ptr[V11 rsp+20H], ymm0
IN0022: 0000F6 mov      rax, qword ptr [V07 rsp+80H]
IN0023: 0000FE vmovups  ymm0, ymmword ptr[rax]
IN0024: 000102 vmovupd  ymm1, ymmword ptr[V10 rsp+40H]
IN0025: 000108 vfmadd213ps ymm0, ymm1, ymmword ptr[V11 rsp+20H]
IN0026: 00010F vmovupd  ymmword ptr[V11 rsp+20H], ymm0
IN0027: 000115 mov      rax, qword ptr [V08 rsp+78H]
IN0028: 00011A vmovupd  ymm0, ymmword ptr[V11 rsp+20H]
IN0029: 000120 vmovups  ymmword ptr[rax], ymm0
IN002a: 000124 mov      rax, qword ptr [V07 rsp+80H]
IN002b: 00012C add      rax, 32
IN002c: 000130 mov      qword ptr [V07 rsp+80H], rax
IN002d: 000138 mov      rax, qword ptr [V08 rsp+78H]
IN002e: 00013D add      rax, 32
IN002f: 000141 mov      qword ptr [V08 rsp+78H], rax
IN0030: 000146 mov      rax, qword ptr [V08 rsp+78H]
IN0031: 00014B add      rax, 32
IN0032: 00014F cmp      rax, qword ptr [V09 rsp+70H]
IN0033: 000154 jbe      SHORT G_M53971_IG0

New codegen:

G_M53971_IG03:        ; func=00, offs=000060H, size=0020H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, loop=IG03, BB02 [0001], byref, isz
                            ; byrRegs -[rax rdx]
IN0013: 000060 C5FC100A             vmovups  ymm1, ymmword ptr[rdx] (ECS:5, ACS:4)
Instruction predicted size = 5, actual = 4
Increasing size adj 0 by 1 => 1
IN0014: 000064 C4E27DB808           vfmadd231ps ymm1, ymm0, ymmword ptr[rax]
IN0015: 000069 C5FC110A             vmovups  ymmword ptr[rdx], ymm1 (ECS:5, ACS:4)
Instruction predicted size = 5, actual = 4
Increasing size adj 1 by 1 => 2
IN0016: 00006D 4883C020             add      rax, 32
IN0017: 000071 4883C220             add      rdx, 32
IN0018: 000075 4C8D4220             lea      r8, [rdx+32]
IN0019: 000079 4C3BC1               cmp      r8, rcx
IN001a: 00007C 76E2                 jbe      SHORT G_M53971_IG03
;; bbWeight=4    PerfScore 61.00

kunalspathak · 2021-07-23T19:15:32Z

@weilinwa - Thanks, this looks pretty good.

weilinwa · 2021-07-28T18:25:51Z

@kunalspathak @tannergooding, I also have a question about instructions like

IN002f: 000141 mov      qword ptr [V08 rsp+78H], rax
IN0030: 000146 mov      rax, qword ptr [V08 rsp+78H]

Are these two instructions in sequence redundant? I saw this type of instructions in different places. Should we consider to optimize them?

kunalspathak · 2021-07-28T20:21:42Z

@kunalspathak @tannergooding, I also have a question about instructions like

We should understand why these are getting generated. We detect such redundant moves in arm64 using IsRedundantLdStr](

runtime/src/coreclr/jit/emitarm64.cpp

Lines 15626 to 15648 in cd1b4cf

    
           //---------------------------------------------------------------------------------------- 
        
           // IsRedundantLdStr: 
        
           //    For ldr/str pair next to each other, check if the current load or store is needed or is 
        
           //    the value already present as of previous instruction. 
        
           // 
        
           //    ldr x1,  [x2, #56] 
        
           //    str x1,  [x2, #56]   <-- redundant 
        
           // 
        
           //          OR 
        
           // 
        
           //    str x1,  [x2, #56] 
        
           //    ldr x1,  [x2, #56]   <-- redundant 
        
           // Arguments: 
        
           //    ins  - The current instruction 
        
           //    dst  - The current destination 
        
           //    src  - The current source 
        
           //    imm  - Immediate offset 
        
           //    size - Operand size 
        
           //    fmt  - Format of instruction 
        
           // Return Value: 
        
           //    true if previous instruction already has desired value in register/memory location.

) and should probably extend it for xarch.

tannergooding · 2021-08-06T17:37:14Z

@weilinwa - Thanks, this looks pretty good.

I agree. This looks like a good improvement over the codegen we had. Kunal is probably the best to comment if the LSRA changes are correct or will run into any issues.

We detect such redundant moves in arm64 using IsRedundantLdStr](

@kunalspathak, we don't have an equivalent in x86/x64 AFAIK. We only recently added minimal redundant move detection in .NET 6

DeepakRajendrakumaran · 2021-09-22T19:05:28Z

@kunalspathak @tannergooding, I also have a question about instructions like

We should understand why these are getting generated. We detect such redundant moves in arm64 using IsRedundantLdStr](

runtime/src/coreclr/jit/emitarm64.cpp

Lines 15626 to 15648 in cd1b4cf

//----------------------------------------------------------------------------------------

// IsRedundantLdStr:

// For ldr/str pair next to each other, check if the current load or store is needed or is

// the value already present as of previous instruction.

//

// ldr x1, [x2, #56]

// str x1, [x2, #56] <-- redundant

//

// OR

//

// str x1, [x2, #56]

// ldr x1, [x2, #56] <-- redundant

// Arguments:

// ins - The current instruction

// dst - The current destination

// src - The current source

// imm - Immediate offset

// size - Operand size

// fmt - Format of instruction

// Return Value:

// true if previous instruction already has desired value in register/memory location.

) and should probably extend it for xarch.

I have a commit to extend this to xarch. The commit will essentially mirror what we do for Register->Register movs but with slight modifications to customize it for mem<->reg mov.
There are a couple of further optimizations to be revisited for both this and the existing IsRedundantMov() implementation(These are already mentioned as comments in existing impl)

neither will work for sse on targets with avx encoding since it cannot be determined the current way it's implemented if the upper bits have been modified.
Currently, we check if both instructions(curr and predecessor) are equivalent. We can extend it for some other cases like aligned/unaligned(movups/movaps for eg.)

BruceForstall · 2022-10-30T16:47:12Z

Issue appears fixed; closing

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the 5.0 milestone Jan 31, 2020

BruceForstall modified the milestones: 5.0, Future Apr 29, 2020

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

tannergooding mentioned this issue Jan 15, 2021

Compiler Optimization problem and (Fma.MultiplyAdd) why does vfmadd231ps instruction not exists? #30634

Closed

weilinwa mentioned this issue Aug 26, 2021

Optimize FMA codegen base on the overwritten #58196

Merged

DeepakRajendrakumaran mentioned this issue Sep 29, 2021

XARCH: Eliminate redundant mov to/from mem. #59780

Merged

BruceForstall closed this as completed Oct 30, 2022

BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Oct 30, 2022

ghost locked as resolved and limited conversation to collaborators Nov 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve preferencing and code generation for FMA #12984

Improve preferencing and code generation for FMA #12984

CarolEidt commented Jun 26, 2019

BruceForstall commented Apr 29, 2020

CarolEidt commented Apr 29, 2020

BruceForstall commented Apr 29, 2020

weilinwa commented Jul 6, 2021

BruceForstall commented Jul 6, 2021

kunalspathak commented Jul 6, 2021

weilinwa commented Jul 7, 2021

tannergooding commented Jul 7, 2021

kunalspathak commented Jul 7, 2021

CarolEidt commented Jul 9, 2021

kunalspathak commented Jul 9, 2021

weilinwa commented Jul 12, 2021

weilinwa commented Jul 23, 2021

weilinwa commented Jul 23, 2021

weilinwa commented Jul 23, 2021

kunalspathak commented Jul 23, 2021

weilinwa commented Jul 28, 2021

kunalspathak commented Jul 28, 2021 •

edited

Loading

tannergooding commented Aug 6, 2021

DeepakRajendrakumaran commented Sep 22, 2021 •

edited

Loading

BruceForstall commented Oct 30, 2022

Improve preferencing and code generation for FMA #12984

Improve preferencing and code generation for FMA #12984

Comments

CarolEidt commented Jun 26, 2019

BruceForstall commented Apr 29, 2020

CarolEidt commented Apr 29, 2020

BruceForstall commented Apr 29, 2020

weilinwa commented Jul 6, 2021

BruceForstall commented Jul 6, 2021

kunalspathak commented Jul 6, 2021

weilinwa commented Jul 7, 2021

tannergooding commented Jul 7, 2021

kunalspathak commented Jul 7, 2021

CarolEidt commented Jul 9, 2021

kunalspathak commented Jul 9, 2021

weilinwa commented Jul 12, 2021

weilinwa commented Jul 23, 2021

weilinwa commented Jul 23, 2021

weilinwa commented Jul 23, 2021

kunalspathak commented Jul 23, 2021

weilinwa commented Jul 28, 2021

kunalspathak commented Jul 28, 2021 • edited Loading

tannergooding commented Aug 6, 2021

DeepakRajendrakumaran commented Sep 22, 2021 • edited Loading

BruceForstall commented Oct 30, 2022

kunalspathak commented Jul 28, 2021 •

edited

Loading

DeepakRajendrakumaran commented Sep 22, 2021 •

edited

Loading