Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

JIT: null out inline gc type locals after inline body #9479

Merged
merged 1 commit into from
Feb 15, 2017

Conversation

AndyAyersMS
Copy link
Member

Inlining can sometimes end up stretching GC lifetimes for inlinee
locals past the end of the inlinee method body.

This change extends the work done for inlining methods with pinned
locals to null out all gc ref type locals. If there are any gc type
locals the return value is copied to a temp inside the inlinee body,
and then at the end of the body the gc type locals are set to null.

In most cases these null stores end up getting removed by dead code
elimination, but if the local ends up untracked, the store remains
and limits the active GC lifetime.

Closes #9218.

@AndyAyersMS AndyAyersMS added * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) area-CodeGen optimization labels Feb 10, 2017
@AndyAyersMS
Copy link
Member Author

@sivarv this is preliminary, but take a look
cc @dotnet/jit-contrib

Overall code size impact is surprisingly small, 0.02% across the jit-diff set. Most of the null stores get removed because most inlinee locals are tracked refs.

There are some good and bad RA diffs. Will post some examples subsequently.

binarytrees3 improves by about 2% in my local runs.

}

if (impInlineInfo->lclVarInfo[lclNum + impInlineInfo->argCnt].lclIsPinned)
// Look for GC refs that we will null out later on.
if (lclTyp == TYP_REF || lclTyp == TYP_BYREF)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  if (lclTyp == TYP_REF || lclTyp == TYP_BYREF) [](start = 1, length = 52)

Minor: we can replace this with varTypeIsGC()

if (!lvaTable[tmpNum].lvPinned)
// Is the local TYP_REF or TYP_BYREF?
const var_types lclTyp = lvaTable[tmpNum].lvType;
if ((lclTyp != TYP_REF) && (lclTyp != TYP_BYREF))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Here too varTypeIsGC() can be used.

@sivarv
Copy link
Member

sivarv commented Feb 10, 2017

Code changes look good to me.

Please run SPMI diffs on desktop as well.

Few questions

  1. Is BinaryTrees benchmark in parity with Legacy JIT64 after this change? May be not? when I measured on my machine the difference was close to 15%. With this change, we will be confining the gc-ness of in-linee local of bottomUpTree() but ChildTreeNodes() is still not in-lined. In-lining ChildTreeNodes() would have the added benefit of not only avoiding call overhead but also avoiding write-barrier to update its out parameter.

  2. Particularly, are there any other CqPerf benchmarks that has benefited from this change other than BinaryTrees?

@JosephTremoulet
Copy link

@dotnet-bot test Windows_NT perf

@AndyAyersMS
Copy link
Member Author

@JosephTremoulet any idea how long that takes to run? Also whose credentials does it show up under?

@JosephTremoulet
Copy link

@AndyAyersMS looks like it didn't actually trigger -- it should show up in the inline CI status list here. Maybe if you say the magic words? It worked for me in #9169

@JosephTremoulet
Copy link

(and in that case it showed up under my credentials after the CI task completed)

@AndyAyersMS
Copy link
Member Author

@dotnet-bot test Windows_NT perf

@AndyAyersMS
Copy link
Member Author

Looks like perf job triggering may be broken. @DrewScoggins any ideas?

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Feb 10, 2017

(Edit: commented on the wrong PR, leaving relevant x-ref bits behind)

It would be really nice if jit-diff would also report how many methods remained unchaged; I'm curious what % of methods this change impacts.

@AndyAyersMS
Copy link
Member Author

@sivarv locally I measure about a 20% improvement on binarytrees3 (765 -> 591) but there is a lot of run to run variance.

@DrewScoggins
Copy link
Member

Sorry, this is my fault. We recently added x86 runs and when I did I changed the CI phrases to include the architecture. So you now need Windows_NT_{arch}. I want to keep this so you can launch just the jobs you want, but I will see if I can add back the old trigger and have it trigger both x86 and x64.

@DrewScoggins
Copy link
Member

test Windows_NT_x64 perf
test Windows_NT_x86 perf

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Feb 10, 2017

Ah, the perf tests entries have shown up (edit, see Drew fixed this -- thanks!)

@AndyAyersMS
Copy link
Member Author

Some unfortunate diffs, mostly in cases where the null stores get optimized away.
Likely because we are now using the return value spill temp more often.

;; System.DateTimeOffset:.ctor(struct):this

;; Before

       mov      rcx, gword ptr [rbx+8]
       test     rcx, rcx
       jne      SHORT G_M55312_IG03
       mov      rcx, rbx
       call     CachedData:CreateLocal():ref:this
       mov      rcx, rax

;; After
;; must save/restore rbp -- only appearances below

       mov      rbp, gword ptr [rbx+8]
       test     rbp, rbp
       jne      SHORT G_M55312_IG03
       mov      rcx, rbx
       call     CachedData:CreateLocal():ref:this
       mov      rbp, rax
G_M55312_IG03:
       mov      rcx, rbp

@AndyAyersMS
Copy link
Member Author

Looks like the perf job sees 11% or so for BinaryTrees3. Nothing else moves much; csc/dataflow claims 4% regression but this is within the noise level and the instruction retired stayed flat.

Similar but slightly smaller win on x86.

@AndyAyersMS
Copy link
Member Author

Desktop SPMI: 12200 method impacted, 103768 bytes net size increase; 0.8% net size increase for the impacted methods. IIRC SPMI has a base set of nearly 2M methods, so about 1 in 200 methods is impacted.

@AndyAyersMS
Copy link
Member Author

Looks like this has a bad interaction with the inline implicit tail call work. If an inlinee introduces an implicit tail call then we don't want to place nulling stores after the inlinee body, since the subsequent tail call transformation only expects to see side-effect free operations there. Odd though that this only caused failures on non-windows tests.

Seems sensible enough to suppress nulling inlinee locals if the inlinee is in tail position, since presumably control is about to leave the frame anyways, so the lifetime trimming from nulling locals in such cases is minimal at best.

@briansull
Copy link

LGTM

@AndyAyersMS
Copy link
Member Author

Ah, windows crossgen also failing, but we don't run this CHECKED so we don't see it.

@AndyAyersMS
Copy link
Member Author

The assert firing off is:

Assertion failed '(popStmt->gtStmtExpr->gtFlags & GTF_ALL_EFFECT) == 0' in 'System.Internal:WinRT_IReadOnlyList(ref,ref,ref)' (IL size 38)
c:\repos\coreclr\src\jit\morph.cpp Line: 7946

This is paranoia in tail call expansion, trying to ensure that any post-call operation can be safely deleted without removing important work, since we only expect to see pop and nop there in the IL stream.

The root cause of this assert is actually bad flag maintenance when updating return value expression placeholders.

Return value placeholders are always given the GTF_CALL side effect (see gtNewInlineCandidateReturnExpr) because there needs to be some side effect on the return value placeholder to ensure a subsequent pop doesn't just throw the return value away, since the real return value expression (as of yet unknown) may well have a side effect.

If the inline fails, the call will be swapped in for the placeholder and the tree flags will at least be somewhat correct.

But if the inline succeeds and the post-inline return value expression doesn't have any side effect (say because it is a spilled return value temp) then the tree flags are stale. And if the inline introduces a call in tail position then the stale flags cause the above assert to fire, since there is now a tail call followed by an apparently side-effecting operation.

This is a latent issue that was introduced by allowing inlinee call sites to be recognized as tail calls, exposed now because this change makes the use of the return value spill temp much more common. It could happen even without this change (say by inlining a tail call to a method with pinned locals that itself has a non-inlined tail call) but evidently such cases are quite rare.

The behavior of the retail jit is fine, since it is actually OK to remove the post-call operation; the bug is that its side effects are overstated.

Ideally we'd get the side effects fully and properly updated once the return value placeholder is removed. Looking at how to do that now.

Inlining can sometimes end up stretching GC lifetimes for inlinee
locals past the end of the inlinee method body.

This change extends the work done for inlining methods with pinned
locals to null out all gc ref type locals. If there are any gc type
locals, the return value is copied to a temp inside the inlinee body,
and then at the end of the body the gc type locals are set to null.

In most cases these null stores end up getting removed by dead code
elimination, but if the local ends up untracked, the store remains
and limits the active GC lifetime.

This change requires more extensive use of the return value spill temp
(since we must be absolutely sure the return value expression does not
depend on any of the locals). We now use the spill temp if any local is
a gc ref type (used or not).

More widespread use of the spill temp exposed an issue in the tail call
transformation, where the side effect flags were overly pessimistic for
a post tail-call GT_COMMA statement that represented an ignored return
value from an inlined call. The root issue is that the return value
placeholder is given a GTF_CALL effect but later when the actual return
value expression is substituted in place there may be no side effect --
in particular this happens when the return value expression is the spill
temp.

So, we also update the post tail-call side-effect-free statment check to
look through the potentially pessimistic GT_COMMA flags and check for side
effects on the child nodes.

Closes #9218.
@AndyAyersMS
Copy link
Member Author

Revised this a bit:

  • look for GC ref locals up front rather than as they're used, and force use of the return spill temp early. I was worried the earlier approach might see a return before the first use of a gc local and potentially drop return values.
  • when nulling, account for all GC type locals, then screen out ones that aren't used
  • deal with pessimistic stmt flags in tail call opt.

Local testing looks good; going to run this through desktop as well.

@AndyAyersMS
Copy link
Member Author

Running some jit stress legs to try out variant inlining patterns...

@dotnet-bot test Windows_NT jitstress2
@dotnet-bot test Ubuntu jitstress2

@AndyAyersMS AndyAyersMS removed the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Feb 14, 2017
@AndyAyersMS
Copy link
Member Author

@dotnet-bot test Windows_NT_x64 perf

@AndyAyersMS
Copy link
Member Author

@sivarv think this is ready to go....

// Notes:
// If the call we're inlining is in tail position then
// we skip nulling the locals, since it can interfere
// with tail calls introduced by the local.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? "by the call" instead of "by the local".

return;
}

JITDUMP("Unpin inlinee locals:\n");
if (inlineInfo->iciCall->IsImplicitTailCall())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there a possibility that we create a spillTemp for return value if the inlinee method has gc-ref locals but later don't append null assignments because the inlinee happens to be an implicit tail call?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can that lead to degraded CQ because of spillTemp for return value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we won't insert null assignments after a potential tail call, and it's intentional.

Since we're upstream of the tail call transform, we can't put the null assignments after since they will cause asserts during the tail call transformation.

And even if the tail call transformation doesn't happen, the method is going to return immediately anyways after the call (it is a call in tail position after all), so there's no real benefit to the null assignments, since the frame is about to be torn down.

There are some small CQ issues here and there with register shuffles; I mentioned one up thread. I'll open an issue and point out a few others.

if (inlineInfo->iciCall->IsImplicitTailCall())
{
JITDUMP("fgInlineAppendStatements: implicit tail call; skipping nulling.\n");
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A call marked as implicit tail call doesn't guarantee that it will be honored. fgMorphCall() makes further checks and could ignore implicit-tail-call flag if any of those checks fail. The fact now the same condition is used for appending un-pinning and nulling out gc-ref locals, there is a possibility that we don't append unpin stmts in some cases?

I am trying to understand is this a conservative check? If yes, can there be cases JIT misses to append unpins?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted above it is intentional to skip this case; even if we could put assignments there the method is about to return.

@AndyAyersMS
Copy link
Member Author

Desktop SPMI now shows a size increase of 101905 bytes over 11998 functions, down a bit from the numbers above, presumably because we now skip some tail call cases.

@sivarv
Copy link
Member

sivarv commented Feb 15, 2017

Looks good to me.

@AndyAyersMS AndyAyersMS merged commit 995b869 into dotnet:master Feb 15, 2017
@AndyAyersMS
Copy link
Member Author

See #9612 for a few examples where we generate unnecessary extra code.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants