Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: limitations in hoisting (loop invariant code motion) #35735

Open
AndyAyersMS opened this issue May 2, 2020 · 12 comments
Open

JIT: limitations in hoisting (loop invariant code motion) #35735

AndyAyersMS opened this issue May 2, 2020 · 12 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@AndyAyersMS
Copy link
Member

AndyAyersMS commented May 2, 2020

Have been looking into #13811 and have found that the current implementation of loop invariant code motion has some awkward limitations.

In particular if the invariant computations are distributed across statements connected by temps, only the first computation in the chain ends up getting hoisted. In the particular example from #13811 the invariant chain was:

         Vector128<byte> result = CreateScalarUnsafe(value);
         return Avx2.BroadcastScalarToVector128(result);

where value was constant. This ended up in a loop after some inlining. Only the CreateScalarUnsafe gets hoisted.

Note the chains can be arbitrary computation and involve more than two statements.

When hoisting we walk statement by statement looking for hoistable subtrees. Local assignments are not considered hoistable -- only their right hand sides. If we hoist a tree we produce an unconsumed copy in the preheader and let CSE come along later and clean things up.

When the analysis gets to the second statement in a dependent chain, it sees the def for the local conveying the value from the first statement as loop varying, and so does not hoist.

We could try fixing this in a variety of ways:

  • forward substitution might be able to glue together trees connected by single-def single use temps, however it is a big hammer, potentially tricky to get right, and costly to run in full generality
  • we could try and fuse these trees in the importer, say if we see back to back stloc/ldloc and no other references to the local
  • we could fix hoisting to handle this case, with a few options:
    • we could check if the subtree's VN is already hoisted, and so effectively do forward sub for the temp -- then let CSE clean all this up like we do now; this would potentially end up with quadratic amounts of cloning, though in practice, it might be acceptable;
    • we could hoist assignments; this requires some care and rewiring of SSA which might be risky
    • we could introduce new temps and/or modify the unconsumed hoisted tree to write to a temp, and use that to propagate the hoisted value from the first clone to second clone.

I am trying to assess how often we see this; it is a bit tricky because while I can spot the second link being blocked I can't easily tell how long the chains are so anything beyond that is harder to spot.

Rough guess based on some crude prototyping is around 2700 hoistable expressions that are second links in the usual FX diff set. There are 152 in the crossgen of SPC, including some sort and span methods.

I'm encouraged enough that I will build a more realistic prototype.

category:cq
theme:loop-opt
skill-level:expert
cost:large

@AndyAyersMS AndyAyersMS added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 2, 2020
@AndyAyersMS AndyAyersMS added this to the 5.0 milestone May 2, 2020
@AndyAyersMS AndyAyersMS self-assigned this May 2, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label May 2, 2020
@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label May 2, 2020
@EgorBo
Copy link
Member

EgorBo commented May 2, 2020

Rough guess based on some crude prototyping is around 2700 hoistable expressions that are second links in the usual FX diff set.

I guess BCL developers try to hoist such simd expressions by hands everywhere so the diff could be way higher.

Btw, is there an issue for constant folding for vectors? e,g. Sse.Add(Vector128<float>.One, Vector128<float>.One) (not sure it's profitable though)

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented May 2, 2020

I guess BCL developers try to hoist such simd expressions by hands everywhere

Perhaps? This is the kind of thing that is easy to overlook or take for granted. I haven't looked at the types of computations being hoisted.

I was curious how much hoisting was going on in general (even with current limitations) to try and so did a bit more instrumentation. There are

  • 38378 hoistable trees that look profitable
  • 19431 hoistable trees that are deemed unprofitable

So second level hoisting would increased the amount of hoistable expressions by around 5%.

The unprofitable percentage seems high and is worth deeper exploration. There are hits in some of the key BCL routines, eg (random sample from my instrumentation run):

System.Private.CoreLib.dasm:@@ LICM NOTPROFITABLE at [000276] in Dictionary`2:FindValue(__Canon):byref:this
System.Private.CoreLib.dasm:@@ LICM NOTPROFITABLE at [000120] in Dictionary`2:FindValue(__Canon):byref:this
System.Private.CoreLib.dasm:@@ LICM NOTPROFITABLE at [000170] in Collection`1:System.Collections.ICollection.CopyTo(Array,int):this
System.Private.CoreLib.dasm:@@ LICM NOTPROFITABLE at [000170] in ReadOnlyCollection`1:System.Collections.ICollection.CopyTo(Array,int):this
System.Private.CoreLib.dasm:@@ LICM NOTPROFITABLE at [000088] in ArraySortHelper`1:InternalBinarySearch(ref,int,int,__Canon,IComparer`1):int
System.Private.CoreLib.dasm:@@ LICM NOTPROFITABLE at [000057] in ArraySortHelper`1:InternalBinarySearch(ref,int,int,ubyte,IComparer`1):int
System.Private.CoreLib.dasm:@@ LICM NOTPROFITABLE at [000064] in ArraySortHelper`1:InternalBinarySearch(ref,int,int,ubyte,IComparer`1):int

Also realized there are a few other issues to consider here:

  • the optimizer is evaluating the profitability of hoisting expression by expression. This means that if / when we do fix the dependent hoisting limitation, we may block hoisting of a profitable dependent because the hoistable expression it depends on does not look profitable on its own
  • profitability involves a register pressure computation, and evaluating profitability of expressions in isolation may mis-state their overall impact to pressure. Say we have 3 loop invariant computations A, B, and C, with A and B hoistable currently, and C dependent on both A and B. The pressure cost of hoisting A and B may be greater than the cost of hoisting A, B, and C. To "solve" this we need to know how the results of these computations are used; if the only use of A and B is in C then hoisting all 3 is cheaper; if A and B have other uses in the loop then hoisting all 3 is more costly.
  • this argues perhaps that hoisting should run as a multi-pass algorithm: for each loop first find the hoistable expressions and their inter-dependencies; then determine profitable sets of expressions to hoist and hoist them. Also track SSA use counts so we can refine cost estimates for dependent sets of expressions.
  • it looks like we only ever hoist out one loop level, so if an expression in a loop nest is invariant in all the loops, it is still executed more often then necessary (see first example below). Multi-level hosting is more complex so it would be good to assess how often this can happen before attempting any sort of serious implementation effort. You might be able to approximate this by running loop opt repeatedly (rebuilding ssa, etc in between) and watching how far expressions can migrate.
  • hoisting runs outer->inner. This seems like a bad choice for two reasons. First, it makes multi level hoisting trickier. I think multi level hoists would fall out fairly naturally if we ran hoisting from inner to outer, as invariant expressions could move out loop by loop as we work our way from inner to outer. Second, we should prioritize spending the limited "pressure budget" on improving inner loop codegen, since inner loops are likely to execute more than outer loops; as things are now, outer loop candidates are given preferential treatment.
  • simply changing hoist order to inner->outer is not sufficient. We also will need to prune away unreachable edges left behind by loop rotation (fgOptWhileLoop). In the double nest below each loop is rotated and so creates the proper conditions for one level of hoisting, but at the time we run hoisting, we still see a path in the outer loop that avoids the inner loop; this path blocks hoisting of fully invariant expressions from inner all the way to top level.

@AndyAyersMS
Copy link
Member Author

Some examples.

Computation invariant in both inner and outer loop:

[MethodImpl(MethodImplOptions.NoInlining)]
static int InvariantInBoth(int y)
{
    int r = 0;
    for (int i = 0; i < 10; i++)
    {
        for (int j = 0; j < 10; j++)
        {
            r += y * y;
        }
    }
    return r;
}

The multiply is only hoisted up one level, not two:

G_M48869_IG03:
       4533C0               xor      r8d, r8d
       448BC9               mov      r9d, ecx
       440FAFC9             imul     r9d, ecx
						;; bbWeight=4    PerfScore 10.00
G_M48869_IG04:
       4103C1               add      eax, r9d
       41FFC0               inc      r8d
       4183F80A             cmp      r8d, 10
       7CF4                 jl       SHORT G_M48869_IG04
						;; bbWeight=16    PerfScore 28.00
G_M48869_IG05:
       FFC2                 inc      edx
       83FA0A               cmp      edx, 10
       7CE3                 jl       SHORT G_M48869_IG03

Computations invariant in inner and outer loop, second one dependent on the first.

    [MethodImpl(MethodImplOptions.NoInlining)]
    static int InvariantInBothDependent(int y)
    {
        int r = 0;
        for (int i = 0; i < 10; i++)
        {
            for (int j = 0; j < 10; j++)
            {
                int t = y * y;
                r += y * t + 1;
            }
        }
        return r;
    }

Here the independent statment is hoisted one level, the dependent statement is not hoisted.

G_M60708_IG03:
       4533C0               xor      r8d, r8d
       448BC9               mov      r9d, ecx
       440FAFC9             imul     r9d, ecx
						;; bbWeight=4    PerfScore 10.00
G_M60708_IG04:
       458BD1               mov      r10d, r9d
       440FAFD1             imul     r10d, ecx
       418D440201           lea      eax, [r10+rax+1]
       41FFC0               inc      r8d
       4183F80A             cmp      r8d, 10
       7CEB                 jl       SHORT G_M60708_IG04
						;; bbWeight=16    PerfScore 76.00
G_M60708_IG05:
       FFC2                 inc      edx
       83FA0A               cmp      edx, 10
       7CDA                 jl       SHORT G_M60708_IG03

Same invariant computation in both nested loops:

[MethodImpl(MethodImplOptions.NoInlining)]
static int InvariantInnerTwo(int y)
{
    int r = 0;
    for (int i = 0; i < 10; i++)
    {
        for (int j = 0; j < 10; j++)
        {
            r += i * y;
        }

        for (int j = 0; j < 10; j++)
        {
            r += i * y;
        }
    }
    return r;
}

Computations are hoisted, but not CSE'd:

G_M31873_IG03:
       4533C0               xor      r8d, r8d
       448BCA               mov      r9d, edx
       440FAFC9             imul     r9d, ecx
						;; bbWeight=4    PerfScore 10.00
G_M31873_IG04:
       4103C1               add      eax, r9d
       41FFC0               inc      r8d
       4183F80A             cmp      r8d, 10
       7CF4                 jl       SHORT G_M31873_IG04
						;; bbWeight=16    PerfScore 28.00
G_M31873_IG05:
       4533C0               xor      r8d, r8d
       448BCA               mov      r9d, edx
       440FAFC9             imul     r9d, ecx
						;; bbWeight=4    PerfScore 10.00
G_M31873_IG06:
       4103C1               add      eax, r9d
       41FFC0               inc      r8d
       4183F80A             cmp      r8d, 10
       7CF4                 jl       SHORT G_M31873_IG06
						;; bbWeight=16    PerfScore 28.00
G_M31873_IG07:
       FFC2                 inc      edx
       83FA0A               cmp      edx, 10
       7CCD                 jl       SHORT G_M31873_IG03

@AndyAyersMS
Copy link
Member Author

See also related issues:

@JulieLeeMSFT JulieLeeMSFT modified the milestones: 5.0, Future May 20, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall modified the milestones: Future, 6.0.0 Nov 14, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 14, 2020
@AndyAyersMS
Copy link
Member Author

Going to reassign to Bruce. Guessing we won't address this in 6.0.

@kunalspathak
Copy link
Member

Example where we don't hoist constants out of the loop - example 1 and example 2.

@am11
Copy link
Member

am11 commented Aug 19, 2021

LICM / LIH doesn't kick in for pure methods, even when invariant is self-assigned, e.g.

static int GetX()
{
    int x = 42;
    // int y = 0;
    for (int i = 0; i < 100_000; ++i)
      // y = x;
      x = x;
    // return y;
    return x;
}

@EgorBo
Copy link
Member

EgorBo commented Aug 19, 2021

@am11 do you mean you expect GetX method to be recognized as pure and all callers should hoist calls to GetX ?

@am11
Copy link
Member

am11 commented Aug 19, 2021

@am11 do you mean you expect GetX method to be recognized as pure and all callers should hoist calls to GetX ?

Inlining 42 at callsites would be an additional bonus. :)
but I was expecting that in this isolated example, the method body would first reduce to just:

GetX()
        mov     eax, 42
        ret

@EgorBo
Copy link
Member

EgorBo commented Aug 19, 2021

@am11 do you mean you expect GetX method to be recognized as pure and all callers should hoist calls to GetX ?

Inlining 42 at callsites would be an additional bonus. :)
but I was expecting that in this isolated example, the method body would first reduce to just:

GetX()
        mov     eax, 42
        ret

There is a separate issue here - we currently never remove empty loops: https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACMhgYQYG8aHunHiUGAWQAUASi492PaROncAZtAbCAlgDsMDFQF4ADAG4tAHnK6DWgNQXRHAL6yG96raA==

@am11
Copy link
Member

am11 commented Sep 4, 2021

we currently never remove empty loops

btw, we do remove empty loop if we return from it. e.g. with main (2a1eaa2):

    static int GetY(int x)
    {
        int y = 0;
        for (int i = 0; i < 100_000; ++i)
        {
            y = x;
            return y; // line 7
        }
        return y;
    }

comment out line 7, so it goes from:

G_M39264_IG01:
						;; bbWeight=1    PerfScore 0.00
G_M39264_IG02:
       mov      eax, edi
						;; bbWeight=1    PerfScore 0.25
G_M39264_IG03:
       ret      
						;; bbWeight=1    PerfScore 1.00

; Total bytes of code 3, prolog size 0, PerfScore 1.55, instruction count 2, allocated bytes for code 3 (MethodHash=40e0669f) for method Program:GetY(int):int
; ============================================================

to

; Total bytes of code 21, prolog size 4, PerfScore 12.35, instruction count 11, allocated bytes for code 21 (MethodHash=40e0669f) for method Program:GetY(int):int

@kunalspathak
Copy link
Member

Computations are hoisted, but not CSE'd:

This is done now.

G_M28431_IG03:
       xor      r8d, r8d
       mov      r9d, edx
       imul     r9d, ecx
       align    [0 bytes for IG04]
						;; size=10 bbWeight=4    PerfScore 10.00
G_M28431_IG04:
       add      eax, r9d
       inc      r8d
       cmp      r8d, 10
       jl       SHORT G_M28431_IG04
						;; size=12 bbWeight=16    PerfScore 28.00
G_M28431_IG05:
       xor      r8d, r8d
       align    [3 bytes for IG06]
						;; size=6 bbWeight=4    PerfScore 2.00
G_M28431_IG06:
       add      eax, r9d
       inc      r8d
       cmp      r8d, 10
       jl       SHORT G_M28431_IG06
						;; size=12 bbWeight=16    PerfScore 28.00
G_M28431_IG07:
       inc      edx
       cmp      edx, 10
       jl       SHORT G_M28431_IG03
						;; size=7 bbWeight=4    PerfScore 6.00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

7 participants