-
Notifications
You must be signed in to change notification settings - Fork 2.7k
JIT: Fix value type box optimization #13016
Conversation
@erozenfeld @JosephTremoulet PTAL This is still somewhat preliminary, curious if anyone knows a better way to extract just the source side effects from an assignment. If so I am happy to modify this. Also I have a number of test cases to add. Code size impact on jit-diffs FX is a small net win, but there are some sizable regressions. I haven't looked at them all but the biggest ones come from loop cloning kicking in because we now preserve ldelems which were being lost in methods like Interestingly enough we got "lucky" before with these methods because the ldelem was followed by a ldelema with the same array and index, so any side effects lost by discarding the former were recreated by the latter. So the bug was masked. And I got "unlucky" with loop cloning on these methods because it's unable to spot opportunities from ldelema and so didn't clone these loops before. Was hoping to use GT_NULLCHECK for the residual bit of copy (which would have forestalled loop cloning) but that can't handle side effecting operands. So am just loading a scalar or byte depending on the type of the value being boxed. The pattern match tries to conservative and bail if it doesn't recognize the IR shapes. Currently it bails out on 8 cases in the jit-diff set, all ones where the copy statement assign is embedded in a comma. |
case GT_EQ: | ||
case GT_NE: | ||
case GT_GT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where's the check for GTF_UNSIGNED?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: need to account for operand order...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't get CSC to emit the ldnull first, but it could happen in IL. And we shouldn't see signed compares but to be on the safe side I'll add a bail out for that too.
// For struct types read the first byte of the | ||
// source struct; there's no need to read the | ||
// entire thing, and no place to put it. | ||
assert(copySrc->gtOper == GT_OBJ || copySrc->gtOper == GT_IND || copySrc->gtOper == GT_FIELD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you take the logic you're using for scalars and apply it to structs as well? Does it produce something incorrect, or just inefficient? And if inefficient, is it common?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that doing this on x86 will have a negative impact on register allocation, we need to generate something like mov reg, [mem]
and reg
has to be a "byteable" register.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably we are just reading and throwing away the result here, and
usually EAX or ECX will be available, so it shouldn't be much of an issue for x86.
If AsmDiffs do show some issue, then you coudl read 4 bytes when the struct is at least 4 bytes in size.
That allows ESI, EDI or EBP to be used, however typically those registers are much more likely to be used to
enregister LclVars than EAX or ECX.
Thus I doubt that there will be a case where we we lose because of this byte sized read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as treating structs like scalars goes -- because we're removing the destination of the copy entirely (since it was the box payload in the newobj which we've zapped) we either need to suppress/alter the copy to something the jit can handle as a standalone source, or need to add a suitable destination to ensure the copy ends up somewhere.
GT_OBJ can't be at the root of a tree, there is code in morph (and possibly elsewhere) that assumes it has a parent. So for that case at least we'd need to allocate a new temp to be the LHS. Not sure about GT_IND and GT_FIELD cases -- perhaps if they are at the root of a tree the jit will be able to create the right kind of temp to hold the result.
Copying the entire struct just to cause the side effects from address formation and null checking is indeed inefficient, though as @mikedn notes this evidently is what JIT64 does.
The struct case is fairly common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am going to change the struct case to read a 32 bit unsigned value, so we end up with slightly smaller encodings on x86/x64; should also mitigate any worries about byteable dests.
458B11 mov r10d, dword ptr [r9]
instead of
4D0FBE11 movsx r10, byte ptr [r9]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still will have to handle the smaller struct case (such as a three byte struct) differently. Since you are not allowed to read beyond the end of the struct (as that may be an unmapped page)
Arm64 failure |
Fixed the arm64 issue, GT_RET_VAL can't stand on its own either, as there is some magic parent-aware stuff that happens for successful inlines. Going back to just reading one byte as doing the extra analysis to figure out if we could safely read more didn't seem worth the trouble. |
I think this Tizen failure is unrelated, as the test case doesn't appear to box anything. Will retry... @dotnet-bot retest Tizen armel Cross Debug Build |
Got a different Tizen failure this time. Leg seems flaky in some other recent runs too. Will give it one more shot... @dotnet-bot retest Tizen armel Cross Debug Buil |
MIssed a letter. @dotnet-bot retest Tizen armel Cross Debug Build |
@dotnet/jit-contrib this is about as pretty as it's going to get, though it still reads too much from some struct types. For instance, given public static bool Test(X<K> a)
{
return (a.k != null);
} with 488D4108 lea rax, bword ptr [rcx+8] ;; &a.k
C4E17A104008 vmovss xmm0, dword ptr [rax+8]
C4E17B1008 vmovsd xmm1, qword ptr [rax]
C4E170C6C844 vshufps xmm1, xmm0, 68 ;; assemble the vector
B801000000 mov eax, 1
C3 ret Without this fix we lose the potential null deref entirely: ;; code from current jit
50 push rax
B801000000 mov eax, 1
4883C408 add rsp, 8
C3 ret I could look again at trying to make this part work more like a null check... Current jit-diff results:
As noted above, the size regressions in the various |
What causes these improvements?! |
Thanks for adding the regression tests. |
Moving the copy to an earlier statement apparently has some advantages when the box is part of some bigger computation, in particular if the box feeds a call with a lot of other arguments. For example (distilled from the Roslyn code): using System;
using System.Collections.Generic;
public struct Av
{
public readonly int Q;
public Av(int q)
{
Q = q;
}
}
public struct V
{
public readonly int Token;
public readonly Av A;
public V(int t, Av a)
{
Token = t;
A = a;
}
}
public class F : Dictionary<string, List<V>>
{
public void Add(string s, int t, Av a, object b)
{
List<V> values;
if (!TryGetValue(s, out values))
{
values = new List<V>();
Add(s, values);
}
values.Add(new V(t, a));
}
}
class M
{
private static readonly F s_F = new F()
{
{"abc", 33, new Av(11), new Av(12) } // calls Add and boxes the last arg
};
public static int Main()
{
List<V> v = s_F["abc"];
return v[0].Token + 67;
}
} The code for M::.cctor is much better with my changes:
Haven't yet drilled into exactly what causes the current version to generate so much code. That being said I probably need to ensure that the value being boxed can't be modified by some part of the tree that contains the box. I can probably rework the above to illustrate what can go wrong and put the appropriate spill check into the importer. |
Yeah, we need to spill. About to push an update to the code and a new test case. |
Jit-diffs with spilling, a few more methods impacted, now a tiny size increase overall.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Am going to squash locally, rebase, and force push, so one more "update". |
Boxing a value type produces a non-null result. If the result of the box is only used to feed a compare against null, the jit tries to optimize the box away entirely since the result of the comparison is known. Such idiomatic expressions arise fairly often in generics instantiated over value types. In the current implementation the box expands into two parts. The first is an upstream statement to allocate a boxed object and assign a reference to the boxed object to a local var known as the "box temp". The second is an expression tree whose value is the box temp that also contains an an encapsulated copy from the value being boxed to the payload section of the boxed object. The box node also contains a pointer back to the first statement (more on this later). In the examples being discussed here this second tree is a child of a compare node whose other child is a null pointer. When the optimization fires, the upstream allocation statement is located via the pointer in the box node and removed, and the entire compare is replaced with a constant 0 or 1 as appropriate. Unfortunately the encapsulated copy in the box subtree may include side effects that should be preserved, and so this transformation is unsafe. Note that the copy subtree as a whole will always contain side effects, since the copy is storing values into the heap, and that copy now will not happen. But the side effects that happen when producing the value to box must remain. In the initial example from #12949 the side effects in question were introduced by the jit's optimizer to capure a CSE definition. dotnet#13016 gives several other examples where the side effects are present in the initial user code. For instance the value being boxed might come from an array, in which case the encapsulated copy in the box expression tree would contain the array null check and bounds check. So removing the entire tree can alter behavior. This fix attempts to carefully preserve the important side effects by reworking how a box is imported. The copy is now moved out from under the box into a second upstream statement. The box itself is then just a trivial side-effect-free reference to the box temp. To ensure proper ordering of side effects the jit spills the evaluation stack before appending the copy statement. When the optimization fires the jit removes the upstream heap allocation as before, as well as the now-trivial compare tree. It analyzes the source side of the upstream copy. If it is side effect free, the copy is removed entirely. If not, the jit modifies the copy into a minimal load of the boxed value, and this load should reproduce the necessary side effects. The optimization is only performed when the tree shape of the copy matches expected patterns. There are some expected cases where the tree won't match, for instance if the optimization is invoked while the jit is inlining. Because this optimization runs at several points the jit can catch these cases once inlining completes. There is one case that is not handled that could be -- if the assignment part of the copy is itself a subtree of a comma. This doesn't happen often. The optimization is now also extended to handle the case where the comparision operation is `cgt.un`. This doesn't catch any new cases but causes the optimization to happen earlier, typically during importation, which should reduce jit time slightly. Generally the split of the box into two upstream statements reduces code size, especially when the box expression is incorporated into a larger tree -- for example a call. However in some cases where the value being boxed comes from an array, preserving the array bounds check now causes loop cloning to kick in and increase code size. Hence the overall size impact on the jit-diff set is essentially zero. Added a number of new test cases showing the variety of situations that must be handled and the need to spill before appending the copy statement. Fixes #12949.
Looked at desktop SPMI diffs (and ran full DDRs, which passed). Diffs are net favorable, but with some regressions. Sampled some of these and it looks like some missed CSEs of array bounds checks and index computations. Tizen failure almost certainly unrelated, but will give it one more shot. @dotnet-bot test Tizen armel Cross Debug Build |
Picks up dotnet/coreclr#13016.
Picks up dotnet/coreclr#13016.
Boxing a value type produces a non-null result. If the result of the box is only used to feed a compare against null, the jit tries to optimize the box away entirely since the result of the comparison is known. Such idiomatic expressions arise fairly often in generics instantiated over value types. In the current implementation the box expands into two parts. The first is an upstream statement to allocate a boxed object and assign a reference to the boxed object to a local var known as the "box temp". The second is an expression tree whose value is the box temp that also contains an an encapsulated copy from the value being boxed to the payload section of the boxed object. The box node also contains a pointer back to the first statement (more on this later). In the examples being discussed here this second tree is a child of a compare node whose other child is a null pointer. When the optimization fires, the upstream allocation statement is located via the pointer in the box node and removed, and the entire compare is replaced with a constant 0 or 1 as appropriate. Unfortunately the encapsulated copy in the box subtree may include side effects that should be preserved, and so this transformation is unsafe. Note that the copy subtree as a whole will always contain side effects, since the copy is storing values into the heap, and that copy now will not happen. But the side effects that happen when producing the value to box must remain. In the initial example from #12949 the side effects in question were introduced by the jit's optimizer to capure a CSE definition. #13016 gives several other examples where the side effects are present in the initial user code. For instance the value being boxed might come from an array, in which case the encapsulated copy in the box expression tree would contain the array null check and bounds check. So removing the entire tree can alter behavior. This fix attempts to carefully preserve the important side effects by reworking how a box is imported. The copy is now moved out from under the box into a second upstream statement. The box itself is then just a trivial side-effect-free reference to the box temp. To ensure proper ordering of side effects the jit spills the evaluation stack before appending the copy statement. When the optimization fires the jit removes the upstream heap allocation as before, as well as the now-trivial compare tree. It analyzes the source side of the upstream copy. If it is side effect free, the copy is removed entirely. If not, the jit modifies the copy into a minimal load of the boxed value, and this load should reproduce the necessary side effects. The optimization is only performed when the tree shape of the copy matches expected patterns. There are some expected cases where the tree won't match, for instance if the optimization is invoked while the jit is inlining. Because this optimization runs at several points the jit can catch these cases once inlining completes. There is one case that is not handled that could be -- if the assignment part of the copy is itself a subtree of a comma. This doesn't happen often. The optimization is now also extended to handle the case where the comparision operation is `cgt.un`. This doesn't catch any new cases but causes the optimization to happen earlier, typically during importation, which should reduce jit time slightly. Generally the split of the box into two upstream statements reduces code size, especially when the box expression is incorporated into a larger tree -- for example a call. However in some cases where the value being boxed comes from an array, preserving the array bounds check now causes loop cloning to kick in and increase code size. Hence the overall size impact on the jit-diff set is essentially zero. Added a number of new test cases showing the variety of situations that must be handled and the need to spill before appending the copy statement. Fixes #12949.
Boxing a value type produces a non-null result. If the result of the
box is only used to feed a compare against null, the jit tries to
optimize the box away entirely since the result of the comparison is
known. Such idiomatic expressions arise fairly often in generics
instantiated over value types.
In the current implementation the box expands into two parts: an
upstream statement to allocate heap space, and then an expression
tree containing an encapsulated copy from the value being boxed to
the payload of the new heap object. Wrapping around that is a
reference to the new object, which is the result of the box.
When the optimization fires, the upstream allocation is removed,
and the current implementation also removes the entire box expression
tree. Howver this tree can include important side effects from
the evaluation of the value being boxed that must be preserved.
For instance the value might come from an array, in which case
and the box expression tree would contain the array null check
and bounds check. So removing the entire tree can alter behavior.
This fix attempts to carefull preserve the important side
effects by moving the copy into a second statement upstream from the
box. The box itself is then just a trivial side-effect-free reference
to the box temp local.
When the optimization fires the jit removes the upstream heap allocation
as before, as well as the now-trivial box tree. It analyzes the source
side of the upstream copy. If it is side effect free the copy is removed
entirely. If not, the jit modifies the copy into a minimal load of the
boxed value, and this load should reproduce the necessary side effects.
Fixes #12949.