-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad x64 codegen in VerifyUnifierConsistency #8606
Comments
I'll take a look. |
Looks your diagnosis is spot-on. JIT is using G_M65020_IG12:
...
4C8B4610 mov r8, gword ptr [rsi+16] ;; get _entries from this
453B7008 cmp r14d, dword ptr [r8+8] ;; bounds check walk1
0F835F010000 jae G_M65020_IG19
4D8D44F810 lea r8, bword ptr [r8+8*rdi+16] ;; get address of _entries[walk1]...??? |
And it looks like the cast from |
Also a regression from 1.1. Note array bounds check and access will generally need two different registers, because this is an array of structs, so the physical index is the logical index * 24. 1.1 has *8 in the addressing mode and *3 earlier. 2.0 seems to lose the *3 somehow, presumably that is what should have been in G_M65020_IG12:
...
4863D5 movsxd rdx, ebp
4C8D3C52 lea r15, [rdx+2*rdx]
...
4C8B4610 mov r8, gword ptr [rsi+16]
413B6808 cmp ebp, dword ptr [r8+8]
0F8375010000 jae G_M65020_IG20
4F8D44F810 lea r8, bword ptr [r8+8*r15+16] Will keep digging... seems like maybe an optimizer bug at this point. cc @dotnet/jit-contrib |
Best guess so far is that
Most of the other cases in this method check for side effects and defer; seems like we should verify the box arg is side-effect free here too. |
Because key is a value type, the assert below is trivially true. But it also generates a CSE used in the subsequent statement. Should be easy enough to create a small repro... Debug.Assert(_entries[walk1]._key != null);
int hashCode = _entries[walk1]._key.GetHashCode(); |
Somewhat simpler repro. Codegen bug is in using System;
public struct S<K>
{
public int x;
public int y;
public K val;
}
public class X<K,V>
{
public X(K k)
{
a = new S<K>[2];
a[1].val = k;
a[1].x = 3;
a[1].y = 4;
}
public void Assert(bool b)
{
if (!b) throw new Exception("bad!");
}
public int Test()
{
int r = 0;
for (int i = 0; i < a.Length; i++)
{
Assert(a[i].val != null);
r += a[i].val.GetHashCode();
}
return r;
}
S<K>[] a;
}
class B
{
public static int Main()
{
var a = new X<int, string>(11);
int z = a.Test();
return (z == 11 ? 100 : 0);
}
} |
The underlying assume-no-side-effects weakness in Not clear yet if we can expose this bug in older jits. It is showing up in 2.0 because of more aggressive CSEs. |
Adding a side effect check fixes the bad codegen, but is overly pessimistic. Likely need to rework the box code in the importer to forcibly move possible side effects out from under the box. |
Or maybe not, the optimization done in This argues for drilling deeper into the box operand during |
Simpler repro that does not require CSE. Here the call to using System;
using System.Runtime.CompilerServices;
public class X<K>
{
public X(K k1)
{
k = k1;
}
[MethodImpl(MethodImplOptions.NoInlining)]
public K Get()
{
Console.WriteLine("Get called");
count++;
return k;
}
public bool Test()
{
bool x = Get() != null;
bool y = (count == 1);
return x && y;
}
K k;
int count;
}
class B
{
public static int Main()
{
var a = new X<int>(11);
bool result = a.Test();
Console.WriteLine("Passed: {0}", result);
return result ? 100 : 0;
}
} |
One more example, this time we lose an exception-causing dereference. using System;
using System.Runtime.CompilerServices;
public class X<K>
{
public X(K k1)
{
k = k1;
}
[MethodImpl(MethodImplOptions.NoInlining)]
public static bool Test(X<K> a)
{
return (a.k != null);
}
public K k;
}
class B
{
public static int Main()
{
X<int> a = null;
bool result = false;
try
{
X<int>.Test(a);
}
catch (Exception)
{
result = true;
}
Console.WriteLine("Passed: {0}", result);
return result ? 100 : 0;
}
} |
Still searching for the right fix. Couple more notes.
|
Hmm, when I added the GT.UN->EQ/NE transform I considered doing it right in the importer but I ended up adding it in morph because that's where most of such transforms are done. It looks like doing this in the importer might actually be a good idea, after all it just compensates an IL deficiency (lack of |
fgMorphRecognizeBoxNullable is a similar optimization although there we have a helper call instead of GT_BOX. I recently fixed it to bail out when called after morphing the args and also to be called on gt.un: |
Is it worth having a different form of box IR operator? I'm not following whether "boxTemp" is holding a copy of the struct (and the reason we introduce boxTemps is that our BOX operation needs an lvalue?), or whether "boxTemp" is holding a pointer to the box on the heap (in which case I'm a little confused what the assign is doing..), but presumably if it's the first case we could have e.g. a BoxValue operator that just takes the value and implicitly needs a temp (which could be manifest at morph/lower/wherever), and in the latter case we could have e.g. a NewBox operator that again just takes the value but carries an implied heap allocation. |
The box temp is a reference to the heap instance and is the result of the box. The general sequence for the inline box is:
The Am wondering now if we also have to consider possible side effects from the newobj, eg if this triggers class init. Will try and work up a repro. |
IR example:
|
Ok, that makes sense, but then what is the point of the |
Can't really defend the current approach, just explaining what it is. The Box points back at the earlier assign statement and this link is used in So yeah the Box flags this node as an optimization opportunity and tracks some of the context needed to pull it off. Not sure if exposing the parts and normal deadcoding could do the same. Perhaps encapsulation could reveal the parts only when we know they're needed, but that starts taking us outside the realm of a suitable fix for a 2.0 regression. |
I don't think we have the facility to remove heap allocations and stores, no.
That's what I was trying to suggest, though I'm not sure how problematic the phase ordering issues would be (can't lower the copy until it gets revealed, don't want to reveal it until after having all chances to remove the box)
Ah, I'd missed that bit of context. Yes, I agree.
That suggests we might be able to expand the pattern match -- see what the LHS of the assign is, then if the argument to the box is a comma of an assign that stores to an indir off the LHS and then returns the LHS, then only extract side-effects from the arguments to the newobj and struct-assign RHS; otherwise, extract side-effects from the box itself? |
We should be able to check the side effect flags on any node that we are considering for the optimization. If we see an unsafe flag set, such as GTF_ASG then we can decline to do the optimization. |
@briansull sure, the issue is (quoting Andy's example above) that if the optimization is removing the box at Yes it would be correct to just go conservative, but per above,
|
Indeed -- in that example (which is taken from the latest repro above) we must keep just this bit of the tree under the BOX:
Otherwise we lose the NPE that is required, since arg0 is null. |
CSE does something like this: https://github.com/dotnet/coreclr/blob/master/src/jit/optcse.cpp#L2073. Though it doesn't looks like that code could easily be reused. |
All the "must keep" side effects initially come from the exprToBox tree. Since BOX could operate with address of exprToBox, conceptually we should be producing this address its own tree that is not under the BOX and we'd have all the must keep side effects walled off outside the scope of the BOX. But later on, if we're unlucky, perhaps that new side effects (eg a CSE def) could be introduced into tree that would still sit under the BOX (where we are copying from the exprToBox to the box payload). Maybe this is rare or not actually possible, but it seems like it could happen. If it's not possible for the tree under the BOX to get additional side effects, then the split above would conceptually solve the problem and we could zap the BOX and the ASSIGN and the side effects would stay behind. To make this work though we'd still have to figure out where to put the address computation tree and whether it is viable to force exprToBox to memory so we can take its address. We could form the exprToBox address in a comma above the BOX and fix |
@erozenfeld I haven't been able to find similar issues with nullables yet, but can't rule it out either... |
Looking a bit at why this doesn't happen in 1.1 or in Jit32, I think it is from the GT_GT -> GT_NE change we put into 2.0. Also explains why 4.7 doesn't have this issue and 4.7.1 will unless we push whatever fix we come up with here... Odd then that the pattern in Adding GT_GT support to |
Signed compares? Can't be, only |
Would that be changing the semantics of
we could have
the rules for the transform would be to nuke If we want to do something in that vein but avoid putting another pointer on
and the optimization could verify that the statement the box points to has the expected form (comma of two assigns)... I wouldn't say it's an elegant suggestion, and I don't know if we're using commas of assigns elsewhere or if those might cause trouble, but it would be a way to shuffle the handshake w/o changing the semantics of the box operator (except that the meaning of the link to the statement on |
Yeah, using a 2-statement preamble occurred to me too. There is some variability in how the copy is represented, but perhaps it is easy enough to see how to preserve the necessary side effects. A second pointer on the Box is probably ok. |
Hmm, looking at what |
Incomplete fix attempt: https://github.com/AndyAyersMS/coreclr/tree/FixDeadBox. Copy is now walled off in second statement, but still not clear how to extract just the "source" side effects on the copy. |
Added some tweaks to the fix attempt. It can now get through corelib without blowing up, though the preservation of source side effects for struct copies is ad-hoc and ugly. 511 methods impacted, overall net around zero size, but some big regressions. One such case is generic ObjectEqualityComparer IndexOf on value types. Here the IL contains the following (where CSC looks like it already clones the IL in various places to specialize the "looking for null " case)
Current jit's dead box opt will toss the whole subtree under the box, which includes the ldelem. So it loses potential null and bounds checks. But the subsequent ldelema does the same checks and so masks the fact that the box opt transformation is unsound. My prototype preserves the side effects of the ldelem and so there are now two index computations. For some reason I have yet to drill into, this now triggers loop cloning. So these methods see large size increases. |
Ah, it seems loop cloning can't spot array indices in the We don't need an explicit load from the source of the box, we just need a null check on the address. However GT_NULLCHECK has limitations and expects its operand to be side effect free. So while it would make sense to use it it would require a new temp. |
To be clear, this bug exists in both JIT32 and RuyJIT that ships with .NET Framework 4.7. You just need to use |
I suspected JIT32 and older RyuJits had the same bug. Thanks for pointing out an example. Would be interesting to know if JIT64 has it too (not as likely) and how far back it goes. |
Looks like JIT64 is ok. The bug is there in JIT32 in v2.0. |
JIT64 works fine. It does the transform and keeps the relevant side effects. For mov eax,dword ptr [rcx+8]
xor eax,eax
ret
Hmm, all my machines already have .NET 4.7 so I don't know about other 4.x versions. But I tried with the .NET 2.0 x86 runtime and the bug is there too. It's probably safe to say that the bug exists since this optimization was introduced. |
It's kind of funny that if a |
My prototype just tries to read the first byte, though it would be somewhat less hacky to just copy the struct into a temporary. I had hoped to use |
Boxing a value type produces a non-null result. If the result of the box is only used to feed a compare against null, the jit tries to optimize the box away entirely since the result of the comparison is known. Such idiomatic expressions arise fairly often in generics instantiated over value types. In the current implementation the box expands into two parts: an upstream statement to allocate heap space, and then an expression tree containing an encapsulated copy from the value being boxed to the payload of the new heap object. Wrapping around that is a reference to the new object, which is the result of the box. When the optimization fires, the upstream allocation is removed, and the current implementation also removes the entire box expression tree. Howver this tree can include important side effects from the evaluation of the value being boxed that must be preserved. For instance the value might come from an array, in which case and the box expression tree would contain the array null check and bounds check. So removing the entire tree can alter behavior. This fix attempts to carefull preserve the important side effects by moving the copy into a second statement upstream from the box. The box itself is then just a trivial side-effect-free reference to the box temp local. When the optimization fires the jit removes the upstream heap allocation as before, as well as the now-trivial box tree. It analyzes the source side of the upstream copy. If it is side effect free the copy is removed entirely. If not, the jit modifies the copy into a minimal load of the boxed value, and this load should reproduce the necessary side effects. Fixes #12949.
Boxing a value type produces a non-null result. If the result of the box is only used to feed a compare against null, the jit tries to optimize the box away entirely since the result of the comparison is known. Such idiomatic expressions arise fairly often in generics instantiated over value types. In the current implementation the box expands into two parts. The first is an upstream statement to allocate a boxed object and assign a reference to the boxed object to a local var known as the "box temp". The second is an expression tree whose value is the box temp that also contains an an encapsulated copy from the value being boxed to the payload section of the boxed object. The box node also contains a pointer back to the first statement (more on this later). In the examples being discussed here this second tree is a child of a compare node whose other child is a null pointer. When the optimization fires, the upstream allocation statement is located via the pointer in the box node and removed, and the entire compare is replaced with a constant 0 or 1 as appropriate. Unfortunately the encapsulated copy in the box subtree may include side effects that should be preserved, and so this transformation is unsafe. Note that the copy subtree as a whole will always contain side effects, since the copy is storing values into the heap, and that copy now will not happen. But the side effects that happen when producing the value to box must remain. In the initial example from #12949 the side effects in question were introduced by the jit's optimizer to capure a CSE definition. dotnet#13016 gives several other examples where the side effects are present in the initial user code. For instance the value being boxed might come from an array, in which case the encapsulated copy in the box expression tree would contain the array null check and bounds check. So removing the entire tree can alter behavior. This fix attempts to carefully preserve the important side effects by reworking how a box is imported. The copy is now moved out from under the box into a second upstream statement. The box itself is then just a trivial side-effect-free reference to the box temp. To ensure proper ordering of side effects the jit spills the evaluation stack before appending the copy statement. When the optimization fires the jit removes the upstream heap allocation as before, as well as the now-trivial compare tree. It analyzes the source side of the upstream copy. If it is side effect free, the copy is removed entirely. If not, the jit modifies the copy into a minimal load of the boxed value, and this load should reproduce the necessary side effects. The optimization is only performed when the tree shape of the copy matches expected patterns. There are some expected cases where the tree won't match, for instance if the optimization is invoked while the jit is inlining. Because this optimization runs at several points the jit can catch these cases once inlining completes. There is one case that is not handled that could be -- if the assignment part of the copy is itself a subtree of a comma. This doesn't happen often. The optimization is now also extended to handle the case where the comparision operation is `cgt.un`. This doesn't catch any new cases but causes the optimization to happen earlier, typically during importation, which should reduce jit time slightly. Generally the split of the box into two upstream statements reduces code size, especially when the box expression is incorporated into a larger tree -- for example a call. However in some cases where the value being boxed comes from an array, preserving the array bounds check now causes loop cloning to kick in and increase code size. Hence the overall size impact on the jit-diff set is essentially zero. Added a number of new test cases showing the variety of situations that must be handled and the need to spill before appending the copy statement. Fixes #12949.
Boxing a value type produces a non-null result. If the result of the box is only used to feed a compare against null, the jit tries to optimize the box away entirely since the result of the comparison is known. Such idiomatic expressions arise fairly often in generics instantiated over value types. In the current implementation the box expands into two parts. The first is an upstream statement to allocate a boxed object and assign a reference to the boxed object to a local var known as the "box temp". The second is an expression tree whose value is the box temp that also contains an an encapsulated copy from the value being boxed to the payload section of the boxed object. The box node also contains a pointer back to the first statement (more on this later). In the examples being discussed here this second tree is a child of a compare node whose other child is a null pointer. When the optimization fires, the upstream allocation statement is located via the pointer in the box node and removed, and the entire compare is replaced with a constant 0 or 1 as appropriate. Unfortunately the encapsulated copy in the box subtree may include side effects that should be preserved, and so this transformation is unsafe. Note that the copy subtree as a whole will always contain side effects, since the copy is storing values into the heap, and that copy now will not happen. But the side effects that happen when producing the value to box must remain. In the initial example from #12949 the side effects in question were introduced by the jit's optimizer to capure a CSE definition. #13016 gives several other examples where the side effects are present in the initial user code. For instance the value being boxed might come from an array, in which case the encapsulated copy in the box expression tree would contain the array null check and bounds check. So removing the entire tree can alter behavior. This fix attempts to carefully preserve the important side effects by reworking how a box is imported. The copy is now moved out from under the box into a second upstream statement. The box itself is then just a trivial side-effect-free reference to the box temp. To ensure proper ordering of side effects the jit spills the evaluation stack before appending the copy statement. When the optimization fires the jit removes the upstream heap allocation as before, as well as the now-trivial compare tree. It analyzes the source side of the upstream copy. If it is side effect free, the copy is removed entirely. If not, the jit modifies the copy into a minimal load of the boxed value, and this load should reproduce the necessary side effects. The optimization is only performed when the tree shape of the copy matches expected patterns. There are some expected cases where the tree won't match, for instance if the optimization is invoked while the jit is inlining. Because this optimization runs at several points the jit can catch these cases once inlining completes. There is one case that is not handled that could be -- if the assignment part of the copy is itself a subtree of a comma. This doesn't happen often. The optimization is now also extended to handle the case where the comparision operation is `cgt.un`. This doesn't catch any new cases but causes the optimization to happen earlier, typically during importation, which should reduce jit time slightly. Generally the split of the box into two upstream statements reduces code size, especially when the box expression is incorporated into a larger tree -- for example a call. However in some cases where the value being boxed comes from an array, preserving the array bounds check now causes loop cloning to kick in and increase code size. Hence the overall size impact on the jit-diff set is essentially zero. Added a number of new test cases showing the variety of situations that must be handled and the need to spill before appending the copy statement. Fixes #12949.
Boxing a value type produces a non-null result. If the result of the box is only used to feed a compare against null, the jit tries to optimize the box away entirely since the result of the comparison is known. Such idiomatic expressions arise fairly often in generics instantiated over value types. In the current implementation the box expands into two parts. The first is an upstream statement to allocate a boxed object and assign a reference to the boxed object to a local var known as the "box temp". The second is an expression tree whose value is the box temp that also contains an an encapsulated copy from the value being boxed to the payload section of the boxed object. The box node also contains a pointer back to the first statement (more on this later). In the examples being discussed here this second tree is a child of a compare node whose other child is a null pointer. When the optimization fires, the upstream allocation statement is located via the pointer in the box node and removed, and the entire compare is replaced with a constant 0 or 1 as appropriate. Unfortunately the encapsulated copy in the box subtree may include side effects that should be preserved, and so this transformation is unsafe. Note that the copy subtree as a whole will always contain side effects, since the copy is storing values into the heap, and that copy now will not happen. But the side effects that happen when producing the value to box must remain. In the initial example from #12949 the side effects in question were introduced by the jit's optimizer to capure a CSE definition. #13016 gives several other examples where the side effects are present in the initial user code. For instance the value being boxed might come from an array, in which case the encapsulated copy in the box expression tree would contain the array null check and bounds check. So removing the entire tree can alter behavior. This fix attempts to carefully preserve the important side effects by reworking how a box is imported. The copy is now moved out from under the box into a second upstream statement. The box itself is then just a trivial side-effect-free reference to the box temp. To ensure proper ordering of side effects the jit spills the evaluation stack before appending the copy statement. When the optimization fires the jit removes the upstream heap allocation as before, as well as the now-trivial compare tree. It analyzes the source side of the upstream copy. If it is side effect free, the copy is removed entirely. If not, the jit modifies the copy into a minimal load of the boxed value, and this load should reproduce the necessary side effects. The optimization is only performed when the tree shape of the copy matches expected patterns. There are some expected cases where the tree won't match, for instance if the optimization is invoked while the jit is inlining. Because this optimization runs at several points the jit can catch these cases once inlining completes. There is one case that is not handled that could be -- if the assignment part of the copy is itself a subtree of a comma. This doesn't happen often. The optimization is now also extended to handle the case where the comparision operation is `cgt.un`. This doesn't catch any new cases but causes the optimization to happen earlier, typically during importation, which should reduce jit time slightly. Generally the split of the box into two upstream statements reduces code size, especially when the box expression is incorporated into a larger tree -- for example a call. However in some cases where the value being boxed comes from an array, preserving the array bounds check now causes loop cloning to kick in and increase code size. Hence the overall size impact on the jit-diff set is essentially zero. Added a number of new test cases showing the variety of situations that must be handled and the need to spill before appending the copy statement. Fixes #12949.
@sergiy-k and I hit this in CoreRT, but this has a repro in CoreCLR too.
Compile the attached Program.txt
as
csc /noconfig /nostdlib /r:System.Private.CoreLib.dll Program.cs /define:DEBUG /O
(note: we define the DEBUG symbol and enable optimizations) and run the resulting executable with CoreRun. The code will hit an assert. Now drop the/O
and the code won't hit the assert.@sergiy-k found some suspicious use of
rdi
register inVerifyUnifierConsistency
. SettingCOMPlus_JitDisasm=VerifyUnifierConsistency
and inspecting the use ofrdi
in the disassembly might be a good starting point.The text was updated successfully, but these errors were encountered: