Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve preferencing and code generation for FMA #12984

Closed
CarolEidt opened this issue Jun 26, 2019 · 21 comments
Closed

Improve preferencing and code generation for FMA #12984

CarolEidt opened this issue Jun 26, 2019 · 21 comments
Labels
arch-x64 arch-x86 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@CarolEidt
Copy link
Contributor

As pointed out in dotnet/coreclr#25387, there are improvements to be made in the handling of the FMA intrinsics:

  • Lowering should bias the containment such that the overwritten operand is a last use if possible.
  • The register allocator should preference the overwritten operand.
  • If all the operands are in registers, the code generator should choose the encoding based on whether the target register matches one of the input registers.

dotnet/coreclr#25387 includes a code snippet illustrating the issue.

category:cq
theme:vector-codegen
skill-level:expert
cost:large

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@BruceForstall
Copy link
Member

@CarolEidt Should this remain a 5.0 issue, or should it be moved to Future?

@CarolEidt
Copy link
Contributor Author

This probably doesn't prioritize over the other work in progress, but I believe it is a fairly straightforward issue to fix, so perhaps it would be a good task for someone wanting to gain expertise in the backend of the JIT. But perhaps change to Future.

@BruceForstall BruceForstall modified the milestones: 5.0, Future Apr 29, 2020
@BruceForstall
Copy link
Member

I went ahead and moved it to Future, but we can always choose to bring it back.

@weilinwa
Copy link
Contributor

weilinwa commented Jul 6, 2021

Hi, @CarolEidt. I'm interested in this issue and would like to work on it. Could you please help me to confirm if my understanding of the requirement is correct? Thanks!

What we want to improve here is to generate the best fit FMA instruction for situations that the computation result is written into one of the three input operands. To achieve this, we need to add checks and bias containment in lowering. And also do corresponding works in LSRA and Codegen phases.

Here are the three places I'm planning to add code in:

  1. Lowering: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lowerxarch.cpp#L6303
  2. LSRA: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lsraxarch.cpp#L2339
  3. Codegen: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsiccodegenxarch.cpp#L2135

@BruceForstall
Copy link
Member

@kunalspathak
Copy link
Member

Here are the three places I'm planning to add code in:

@weilinwa - That looks right.

@weilinwa
Copy link
Contributor

weilinwa commented Jul 7, 2021

Hi, @kunalspathak. Thanks for the confirmation!
I have one follow up question about what is the classical method to check whether one of the operand is overwritten. I was able to do this by checking if one of the operand's lclNum (GetLclNum()) equals to this value of the next node(node->gtNext). Is there a more decent way to get the variable that the result is written into instead of getting the next node?

@tannergooding
Copy link
Member

We also have #12212 and #11260 as related issues.

IIRC, most of these issues were root caused to #6358. That is, to allow the register allocator to set multiple operands as "optional" for the cases where the instruction is either commutative (e.g. Add, Multiply, etc) or where it has multiple forms (e.g. FusedMultiplyAdd).

There are several notes in both of the related issues that provide some additional analysis into the issue and potential avenues for fixing it.

@kunalspathak
Copy link
Member

method to check whether one of the operand is overwritten

There might be better way to do this, but one of the method that I can think of is NodesAreEquivalentLeaves():

return tree1->AsLclVarCommon()->GetLclNum() == tree2->AsLclVarCommon()->GetLclNum();

@CarolEidt
Copy link
Contributor Author

Sorry for the delay in response (I'm now retired). You've got the right general idea. If you've got an expression like STORE_LCL_VAR(FMA(op1,op2,op3)) where one of the operands is a LCL_VAR with the same lclNum as the STORE_LCL_VAR then you want to preference that operand to that lclVar Interval. Unfortunately, it seems that NodesAreEquivalentLeaves() doesn't handle STORE_LCL_VAR, so you'd need to adapt it to do that, or simply check those lclNums directly.

@kunalspathak
Copy link
Member

Nice to see you @CarolEidt ! :)

@weilinwa
Copy link
Contributor

@CarolEidt, thanks for your response!
I will go ahead to do the implementation then.

@weilinwa
Copy link
Contributor

@tannergooding @kunalspathak, I implemented a solution by comparing the lclNum of the overwritten with the three ops and choose FMA accordingly if there is a match.
The idea is that only the 3rd case in these three types of computation should be generated to 231 form.

  1. a = a*b+c
  2. b = a*b+c
  3. c = a*b+c

The code is looking like this. Is this on the right track?

    GenTreeLclVarCommon* overwritten = this->gtNext->AsLclVarCommon();
    assert(overwritten->gtOper == GT_STORE_LCL_VAR || overwritten->gtOper == GT_LCL_VAR);
    unsigned                  flag        = 0; // 1->op1, 2->op2, 3->op3
    if (op1->IsLocal() && op1->AsLclVarCommon()->GetLclNum() == overwritten->GetLclNum())
    {
        flag = 1; // 132 form: op1 = (op1 * op3) + [op2]
    }
    else if (op2->IsLocal() && op2->AsLclVarCommon()->GetLclNum() == overwritten->GetLclNum())
    {
        flag = 2; // 213 form: op1 = (op2 * op1) + [op3]
    }
    else if (op3->IsLocal() && op3->AsLclVarCommon()->GetLclNum() == overwritten->GetLclNum())
    {
        flag = 3; // 231 form: op1 = (op2 * op3) + [op1]
    }

    return flag; 

@weilinwa
Copy link
Contributor

Here are some of the results I got using the test cases from those related issues:

static unsafe float fmaTest1(float[] fb, float[] fc, int d)
{
       Vector256<float> tmp = Vector256<float>.Zero;
       fixed (float* b = &fb[0])
       fixed (float* c = &fc[0])
       {
             for (int k = 0; k + 8 <= d; k += 8)
             {
                    Vector256<float> v1 = Avx.LoadVector256(b + k);
                    Vector256<float> v2 = Avx.LoadVector256(c + k);
                    tmp = Fma.MultiplyAdd(Avx.Multiply(v2, v2), Avx.Multiply(v1, v1), tmp);
              }
       }
       return tmp.ToScalar(); 
}

Original codegen:

G_M42316_IG03:        ; offs=0000A5H, size=005FH, bbWeight=4    PerfScore 147.00, gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, loop=IG03, gcvars, byref, isz

IN0018: 0000A5 mov      rax, qword ptr [V04 rsp+78H]
IN0019: 0000AA mov      edx, dword ptr [V08 rsp+5CH]
IN001a: 0000AE movsxd   rdx, edx
IN001b: 0000B1 vmovups  ymm0, ymmword ptr[rax+4*rdx]
IN001c: 0000B6 vmovupd  ymmword ptr[V09 rsp+20H], ymm0
IN001d: 0000BC mov      rax, qword ptr [V06 rsp+68H]
IN001e: 0000C1 vmovups  ymm0, ymmword ptr[rax+4*rdx]
IN001f: 0000C6 vmulps   ymm0, ymm0, ymm0
IN0020: 0000CA vmovupd  ymm1, ymmword ptr[V09 rsp+20H]
IN0021: 0000D0 vmulps   ymm1, ymm1, ymmword ptr[V09 rsp+20H]
IN0022: 0000D6 vfmadd213ps ymm0, ymm1, ymmword ptr[V03 rsp+80H]
IN0023: 0000E0 vmovupd  ymmword ptr[V03 rsp+80H], ymm0
IN0024: 0000E9 mov      eax, dword ptr [V08 rsp+5CH]
IN0025: 0000ED add      eax, 8
IN0026: 0000F0 mov      dword ptr [V08 rsp+5CH], eax
IN0027: 0000F4 mov      eax, dword ptr [V08 rsp+5CH]
IN0028: 0000F8 add      eax, 8
IN0029: 0000FB cmp      eax, dword ptr [V02 rsp+D0H]
IN002a: 000102 jle      SHORT G_M42316_IG03

New codegen:

G_M42316_IG03:        ; offs=000047H, size=0028H, bbWeight=4    PerfScore 77.00, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, loop=IG03, byref, isz

IN0010: 000047 movsxd   r9, ecx
IN0011: 00004A vmovups  ymm1, ymmword ptr[rax+4*r9]
IN0012: 000050 vmovups  ymm2, ymmword ptr[rdx+4*r9]
IN0013: 000056 vmulps   ymm2, ymm2, ymm2
IN0014: 00005A vmulps   ymm1, ymm1, ymm1
IN0015: 00005E vfmadd231ps ymm0, ymm1, ymm2
IN0016: 000063 add      ecx, 8
IN0017: 000066 lea      r9d, [rcx+8]
IN0018: 00006A cmp      r9d, r8d
IN0019: 00006D jle      SHORT G_M42316_IG03

@weilinwa
Copy link
Contributor

static unsafe float fmaTest2(float scale, float[] src, float[] dst, int count)
{
    float result;
    fixed (float* psrc = &src[0])
    fixed (float* pdst = &dst[0])
    {
        float* pSrcCurrent = psrc;
        float* pDstCurrent = pdst;
        float* pEnd = pdst + count;

        Vector256<float> scaleVector256 = Vector256.Create(scale);
        Vector256<float> dstVector = Avx.LoadVector256(pDstCurrent);
        while (pDstCurrent + 8 <= pEnd)
        {
            dstVector = Avx.LoadVector256(pDstCurrent);
            dstVector = Fma.MultiplyAdd(Avx.LoadVector256(pSrcCurrent), scaleVector256, dstVector);
            Avx.Store(pDstCurrent, dstVector);

            pSrcCurrent += 8;
            pDstCurrent += 8;
        }

        result = dstVector.ToScalar();
    }
    return result;
}

Original codegen:

G_M53971_IG03:        ; offs=0000E7H, size=006FH, bbWeight=4    PerfScore 147.00, gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, loop=IG03, gcvars, byref, isz

IN001f: 0000E7 mov      rax, qword ptr [V08 rsp+78H]
IN0020: 0000EC vmovups  ymm0, ymmword ptr[rax]
IN0021: 0000F0 vmovupd  ymmword ptr[V11 rsp+20H], ymm0
IN0022: 0000F6 mov      rax, qword ptr [V07 rsp+80H]
IN0023: 0000FE vmovups  ymm0, ymmword ptr[rax]
IN0024: 000102 vmovupd  ymm1, ymmword ptr[V10 rsp+40H]
IN0025: 000108 vfmadd213ps ymm0, ymm1, ymmword ptr[V11 rsp+20H]
IN0026: 00010F vmovupd  ymmword ptr[V11 rsp+20H], ymm0
IN0027: 000115 mov      rax, qword ptr [V08 rsp+78H]
IN0028: 00011A vmovupd  ymm0, ymmword ptr[V11 rsp+20H]
IN0029: 000120 vmovups  ymmword ptr[rax], ymm0
IN002a: 000124 mov      rax, qword ptr [V07 rsp+80H]
IN002b: 00012C add      rax, 32
IN002c: 000130 mov      qword ptr [V07 rsp+80H], rax
IN002d: 000138 mov      rax, qword ptr [V08 rsp+78H]
IN002e: 00013D add      rax, 32
IN002f: 000141 mov      qword ptr [V08 rsp+78H], rax
IN0030: 000146 mov      rax, qword ptr [V08 rsp+78H]
IN0031: 00014B add      rax, 32
IN0032: 00014F cmp      rax, qword ptr [V09 rsp+70H]
IN0033: 000154 jbe      SHORT G_M53971_IG0

New codegen:

G_M53971_IG03:        ; func=00, offs=000060H, size=0020H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, loop=IG03, BB02 [0001], byref, isz
                            ; byrRegs -[rax rdx]
IN0013: 000060 C5FC100A             vmovups  ymm1, ymmword ptr[rdx] (ECS:5, ACS:4)
Instruction predicted size = 5, actual = 4
Increasing size adj 0 by 1 => 1
IN0014: 000064 C4E27DB808           vfmadd231ps ymm1, ymm0, ymmword ptr[rax]
IN0015: 000069 C5FC110A             vmovups  ymmword ptr[rdx], ymm1 (ECS:5, ACS:4)
Instruction predicted size = 5, actual = 4
Increasing size adj 1 by 1 => 2
IN0016: 00006D 4883C020             add      rax, 32
IN0017: 000071 4883C220             add      rdx, 32
IN0018: 000075 4C8D4220             lea      r8, [rdx+32]
IN0019: 000079 4C3BC1               cmp      r8, rcx
IN001a: 00007C 76E2                 jbe      SHORT G_M53971_IG03
;; bbWeight=4    PerfScore 61.00

@kunalspathak
Copy link
Member

@weilinwa - Thanks, this looks pretty good.

@weilinwa
Copy link
Contributor

@kunalspathak @tannergooding, I also have a question about instructions like

IN002f: 000141 mov      qword ptr [V08 rsp+78H], rax
IN0030: 000146 mov      rax, qword ptr [V08 rsp+78H]

Are these two instructions in sequence redundant? I saw this type of instructions in different places. Should we consider to optimize them?

@kunalspathak
Copy link
Member

kunalspathak commented Jul 28, 2021

@kunalspathak @tannergooding, I also have a question about instructions like

We should understand why these are getting generated. We detect such redundant moves in arm64 using IsRedundantLdStr](

//----------------------------------------------------------------------------------------
// IsRedundantLdStr:
// For ldr/str pair next to each other, check if the current load or store is needed or is
// the value already present as of previous instruction.
//
// ldr x1, [x2, #56]
// str x1, [x2, #56] <-- redundant
//
// OR
//
// str x1, [x2, #56]
// ldr x1, [x2, #56] <-- redundant
// Arguments:
// ins - The current instruction
// dst - The current destination
// src - The current source
// imm - Immediate offset
// size - Operand size
// fmt - Format of instruction
// Return Value:
// true if previous instruction already has desired value in register/memory location.
) and should probably extend it for xarch.

@tannergooding
Copy link
Member

@weilinwa - Thanks, this looks pretty good.

I agree. This looks like a good improvement over the codegen we had. Kunal is probably the best to comment if the LSRA changes are correct or will run into any issues.

We detect such redundant moves in arm64 using IsRedundantLdStr](

@kunalspathak, we don't have an equivalent in x86/x64 AFAIK. We only recently added minimal redundant move detection in .NET 6

@DeepakRajendrakumaran
Copy link
Contributor

DeepakRajendrakumaran commented Sep 22, 2021

@kunalspathak @tannergooding, I also have a question about instructions like

We should understand why these are getting generated. We detect such redundant moves in arm64 using IsRedundantLdStr](

//----------------------------------------------------------------------------------------
// IsRedundantLdStr:
// For ldr/str pair next to each other, check if the current load or store is needed or is
// the value already present as of previous instruction.
//
// ldr x1, [x2, #56]
// str x1, [x2, #56] <-- redundant
//
// OR
//
// str x1, [x2, #56]
// ldr x1, [x2, #56] <-- redundant
// Arguments:
// ins - The current instruction
// dst - The current destination
// src - The current source
// imm - Immediate offset
// size - Operand size
// fmt - Format of instruction
// Return Value:
// true if previous instruction already has desired value in register/memory location.

) and should probably extend it for xarch.

I have a commit to extend this to xarch. The commit will essentially mirror what we do for Register->Register movs but with slight modifications to customize it for mem<->reg mov.
There are a couple of further optimizations to be revisited for both this and the existing IsRedundantMov() implementation(These are already mentioned as comments in existing impl)

  1. neither will work for sse on targets with avx encoding since it cannot be determined the current way it's implemented if the upper bits have been modified.
  2. Currently, we check if both instructions(curr and predecessor) are equivalent. We can extend it for some other cases like aligned/unaligned(movups/movaps for eg.)

@BruceForstall
Copy link
Member

Issue appears fixed; closing

@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Oct 30, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Nov 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 arch-x86 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

7 participants