-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: optimizations for multi-use boxes #9118
Comments
Considering for 2.1. |
Yes. Can be removed by proper value numbering of I think that the subsequent unbox can be removed too - we need a proper VN for But that still leaves us with the box: G_M33790_IG01:
4883EC28 sub rsp, 40
G_M33790_IG02:
48B968D0F2DEFC7F0000 mov rcx, 0x7FFCDEF2D068
E83DAF805F call CORINFO_HELP_NEWSFAST
C7400804000000 mov dword ptr [rax+8], 4
83780803 cmp dword ptr [rax+8], 3
0F94C0 sete al
0FB6C0 movzx rax, al
85C0 test eax, eax
750A jne SHORT G_M33790_IG04
B864000000 mov eax, 100
G_M33790_IG03:
4883C428 add rsp, 40
C3 ret
G_M33790_IG04:
33C0 xor eax, eax
G_M33790_IG05:
4883C428 add rsp, 40
C3 ret It would be nice if the optimizer could handle this but I don't think there's any mechanism in the JIT that would allow the elimination of useless stores. So even with the fixes I mentioned above it would still be preferable to get rid of the box sooner. Another related example: struct RGB { public byte r, g, b; }
static int Main() => Test<RGB>() ? 42 : 0;
static bool Test<T>() => return default(T) == null;
G_M1515_IG01:
57 push rdi
56 push rsi
4883EC28 sub rsp, 40
G_M1515_IG02:
33F6 xor esi, esi
33FF xor edi, edi
48B9805AFC7FFC7F0000 mov rcx, 0x7FFC7FFC5A80
E827AF845F call CORINFO_HELP_NEWSFAST
488D5008 lea rdx, bword ptr [rax+8]
408832 mov byte ptr [rdx], sil
40887A01 mov byte ptr [rdx+1], dil
40887A02 mov byte ptr [rdx+2], dil
4885C0 test rax, rax
0F94C0 sete al
0FB6C0 movzx rax, al
G_M1515_IG03:
4883C428 add rsp, 40
5E pop rsi
5F pop rdi
C3 ret The null check can be eliminated but we're again stuck with the box. And this time it's completely useless, the code doesn't even attempt to read from the allocated memory. It's not clear to me why the importer is so eager to expand |
Expansion of some runtime constructs like Hence the wrapper idea from #9056 where we try and have our cake and eat it too. We at least need to do enough sanity checking in importation that we can spot the cases where inlining will fail. |
Looked a bit at the reference counting idea. For instance Having IR cross links is obviously questionable but we already have these in At any rate, assuming we can somehow track how many uses there are, |
In the RGB example the jit doesn't try optimizing the box early as it is in a return tree. The importer currently only calls We could presumably invoke the expression folder more often during importation and catch more cases. I suppose the question then is "how much more often" -- likely whether the size reduction wins counteract the extra cost of searching for foldable patterns. I'll play around with this and see how likely more eager folding is to lead to changes. We also try to fold the box tree again during morph but the struct assignment to copy the box payload has been expanded into an exception-flagged comma laden tree that the box cleanup utility does not expect, so it bails out.
In this case the exception flags on the morphed copy tree are bogus as we're copying from a local struct to the box payload. The copy tree doesn't have an exception flag initially but |
Looked at the flags issue a bit without gaining much insight. The GTF_EXCEPT flag is flipped on for GT_BLK during morph regardless of whether it was set before. So any upstream knowledge about potentially safe memory operations is seemingly just tossed out. And if we don't set the flag then |
Aha! So the trick I used for static bool Test<T>() => default(T) == null ? true : false; generates G_M8956_IG02:
33C0 xor eax, eax
G_M8956_IG03:
C3 ret Funny. |
Or in this static bool Test<T>(object value) => (value is T) || (value == null && (default(T) == null ? true : false)); generates G_M8956_IG02:
488BC1 mov rax, rcx
4885C0 test rax, rax
7411 je SHORT G_M8956_IG03
48BA585CC5F0FE7F0000 mov rdx, 0x7FFEF0C55C58
483910 cmp qword ptr [rax], rdx
7402 je SHORT G_M8956_IG03
33C0 xor rax, rax
G_M8956_IG03:
4885C0 test rax, rax
750B jne SHORT G_M8956_IG07
4885C9 test rcx, rcx
7503 jne SHORT G_M8956_IG05
33C0 xor eax, eax
G_M8956_IG04:
C3 ret
G_M8956_IG05:
33C0 xor eax, eax
G_M8956_IG06:
C3 ret
G_M8956_IG07:
B801000000 mov eax, 1
G_M8956_IG08:
C3 ret While not ideal at least it no longer does boxing. |
Seems like importer folding at relational ops is pretty cheap and effective. Will send a PR. |
Eagerly fold expression trees for non-branch conditional operations. Leads to elimination of boxes in some idiomatic uses. See notes and examples in #14472.
Think the more general opt needs more conceptual bake time, so moving out of 2.1. |
Eagerly fold expression trees for non-branch conditional operations. Leads to elimination of boxes in some idiomatic uses. See notes and examples in #14472.
Though I like this proposal I'm still not quite sure how to handle the case where there are multiple uses and the uses have different demands on the box tree. Say for some uses are type tests (and so just want the type handle) and some others are feeding an unboxed entry point (and so just want a local copy of the payload) and some others feed a null test. Say we refcount and there are N uses. Until we understand all those use contexts we can't decide what shape the box tree should take on. So in addition to the reference count we need to track the kinds of references (basically the Also since optimizations like the |
@AndyAyersMS Is it possible to also recognize + optimize box/unbox sequence? It seems this IL sequence is common in F#. |
Adjacent If there is IL in between or a |
Looks like a longshot for .NET 9 now too, so moving to future. |
Enable object stack allocation for ref classes and extend the support to include boxed value classes. Use a specialized unbox helper for stack allocated boxes, both to avoid apparent escape of the box by the helper, and to ensure all box field accesses are visible to the JIT. Update the local address visitor to rewrite trees involving address of stack allocated boxes in some cases to avoid address exposure. Disable old promotion for stack allocated boxes (since we have no field handles) and allow physical promotion to enregister the box method table and/or payload as appropriate. In OSR methods handle the fact that the stack allocation may actually have been a heap allocation by the Tier0 method. The analysis TP cost is around 0.4-0.7% (notes below). Boxes are much less likely to escape than ref classes (roughly ~90% of boxes escape, ~99.8% of ref classes escape). Codegen impact is diminished somewhat because many of the boxes are dead and were already getting optimized away. Fixes #4584, #9118, #10195, #11192, #53585, #58554, #85570 --------- Co-authored-by: Jakob Botsch Nielsen <[email protected]> Co-authored-by: Jan Kotas <[email protected]>
Fixed by #103361 |
A fairly common pattern (especially after inlining) is to see a box that feeds an
isinst
and if that succeeds, anunbox.any
. For example:We get pretty far when optimizing
Main
here -- we can devirtualize the call toEquals
, inline it and remove the null checks since we have a value type, then inline the inner call toEquals
. But along the way we have to boxy
and the innerEquals
has the following IL:With the advent of dotnet/coreclr#14420 the jit will now optimize away the
isinst
, but the box cleanup opts forunbox.any
don't fire because there is usually a temp in the way, and so we generate the following code forMain
:If when optimizing a successful cast we copy the result to a new more strongly typed temp (see #9117) we might be able to optimize away the type equality check in the downstream
unbox.any
(see dotnet/coreclr#14473). And perhaps if we are lucky and the box is simple we might be able to propagate the value to be boxed through the box/unbox to the ultimate use, and so not need the unbox. But the box would remain as it is difficult to remove unless it is known to be dead and whatever transformation makes it dead explicitly cleans it up.A couple of ways we could approach this:
BOX
is just an expression "wrapper" in the spirit of JIT: some ideas on high-level representation of runtime operations in IR #9056. So we could allow the inliner to giveBOX(y)
the same treatment asy
and duplicate it within the inlinee body (essentially, generalize the logic inimpInlineFetchArg
that begins withelse if (argInfo.argIsLclVar && !argCanBeModified)
to also apply toBOX(y)
). If we added suitable "reference counting" to boxes to track the duplicates then optimizing away the last use of the box could trigger the box cleanup. We have this today but the reference count is implicit and always = 1 since we don't duplicate the boxed values.If all this kicked in, the code for Main above would collapse to simply returning a constant.
category:cq
theme:importer
skill-level:expert
cost:medium
The text was updated successfully, but these errors were encountered: