Bad x64 codegen in VerifyUnifierConsistency #8606

MichalStrehovsky · 2017-07-20T01:22:14Z

@sergiy-k and I hit this in CoreRT, but this has a repro in CoreCLR too.

Compile the attached Program.txt
as csc /noconfig /nostdlib /r:System.Private.CoreLib.dll Program.cs /define:DEBUG /O (note: we define the DEBUG symbol and enable optimizations) and run the resulting executable with CoreRun. The code will hit an assert. Now drop the /O and the code won't hit the assert.

@sergiy-k found some suspicious use of rdi register in VerifyUnifierConsistency. Setting COMPlus_JitDisasm=VerifyUnifierConsistency and inspecting the use of rdi in the disassembly might be a good starting point.

The text was updated successfully, but these errors were encountered:

AndyAyersMS · 2017-07-20T15:07:26Z

I'll take a look.

AndyAyersMS · 2017-07-20T15:21:09Z

Looks your diagnosis is spot-on. JIT is using r14d for walk1 but when it indexes into _entries it mysteriously switches over to using rdi:

G_M65020_IG12:
...
4C8B4610         mov      r8, gword ptr [rsi+16]      ;; get _entries from this
453B7008         cmp      r14d, dword ptr [r8+8]      ;; bounds check walk1
0F835F010000     jae      G_M65020_IG19
4D8D44F810       lea      r8, bword ptr [r8+8*rdi+16] ;; get address of _entries[walk1]...???

mikedn · 2017-07-20T15:23:31Z

it mysteriously switches over to using rdi

And it looks like the cast from int to long is missing too.

AndyAyersMS · 2017-07-20T15:42:32Z

Also a regression from 1.1. Note array bounds check and access will generally need two different registers, because this is an array of structs, so the physical index is the logical index * 24. 1.1 has *8 in the addressing mode and *3 earlier. 2.0 seems to lose the *3 somehow, presumably that is what should have been in rdi.

G_M65020_IG12:
 ...
 4863D5               movsxd   rdx, ebp
 4C8D3C52             lea      r15, [rdx+2*rdx]
 ...
 4C8B4610             mov      r8, gword ptr [rsi+16]
 413B6808             cmp      ebp, dword ptr [r8+8]
 0F8375010000         jae      G_M65020_IG20
 4F8D44F810           lea      r8, bword ptr [r8+8*r15+16]

Will keep digging... seems like maybe an optimizer bug at this point.

cc @dotnet/jit-contrib

AndyAyersMS · 2017-07-20T16:54:25Z

Best guess so far is that gtFoldExprSpecial sees a box(..) == null and deadcodes it, but hidden in the ... is a side effect to initialize the *3 cse (V37 here) that gets lost in the process:

Bashing [000121] to NOP as part of dead box operation
N040 (  1,  1) [000131] ------------             /--*  CNS_INT   ref    null $VN.Null
N041 ( 55, 53) [000132] NA-XGO------             *  NE        int    $26e
N039 ( 50, 51) [000130] -A-XGO------             \--*  BOX       ref    $5d0
N037 (  1,  1) [000128] ------------                |  /--*  LCL_VAR   ref    V08 tmp3         u:3 (last use) $2d5
N038 ( 44, 47) [000129] -A-XGO------                \--*  COMMA     ref    $2d5
N035 ( 38, 41) [000116] *A-XG-------                   |  /--*  IND       int    <l:$26d, c:$609>
N033 (  1,  1) [000571] ------------                   |  |  |  /--*  CNS_INT   long   8 field offset Fseq[_key] $1c3
N034 ( 36, 39) [000572] -A-XG--N----                   |  |  \--*  ADD       byref  $215
N030 ( 23, 23) [000592] -A--G-------                   |  |     |     /--*  ADDR      byref  $82
N029 ( 12, 12) [000114] aA--G--N----                   |  |     |     |  \--*  IND       struct <l:$680, c:$6c0>
N027 (  1,  1) [000584] ------------                   |  |     |     |     |  /--*  CNS_INT   long   16 Fseq[#FirstElem] $1c2
N028 ( 11, 11) [000585] -A-----N----                   |  |     |     |     \--*  ADD       byref  $82
N024 (  1,  1) [000581] -------N----                   |  |     |     |        |     /--*  CNS_INT   long   3 $1c8
N025 (  9,  9) [000582] -A-----N----                   |  |     |     |        |  /--*  LSH       long   $545
N022 (  1,  1) [000899] ------------                   |  |     |     |        |  |  |  /--*  LCL_VAR   long   V37 cse2          $544
N023 (  8,  8) [000900] -A----------                   |  |     |     |        |  |  \--*  COMMA     long   $544
N018 (  1,  1) [000590] ------------                   |  |     |     |        |  |     |     /--*  CNS_INT   long   3 $1c8
N019 (  7,  7) [000591] ------------                   |  |     |     |        |  |     |  /--*  MUL       long   $544
N017 (  2,  3) [000580] ------------                   |  |     |     |        |  |     |  |  \--*  CAST      long <- int $543
N016 (  1,  1) [000577] i-----------                   |  |     |     |        |  |     |  |     \--*  LCL_VAR   int    V03 loc2         u:4 $484
N021 (  7,  7) [000898] -A------R---                   |  |     |     |        |  |     \--*  ASG       long   $VN.Void
N020 (  1,  1) [000897] D------N----                   |  |     |     |        |  |        \--*  LCL_VAR   long   V37 cse2          $544
N026 ( 10, 10) [000583] -A-----N----                   |  |     |     |        \--*  ADD       byref  <l:$210, c:$20f>
N015 (  1,  1) [000576] ------------                   |  |     |     |           \--*  LCL_VAR   ref    V25 tmp20        u:3 (last use) <l:$2d7, c:$5cd>
N031 ( 31, 34) [000586] -A-XG-------                   |  |     |  /--*  COMMA     byref  <l:$214, c:$213>
N014 (  8, 11) [000579] ---X--------                   |  |     |  |  \--*  ARR_BOUNDS_CHECK_Rng void   <l:$2dd, c:$2dc>
N011 (  1,  1) [000113] ------------                   |  |     |  |     +--*  LCL_VAR   int    V03 loc2         u:4 $484
N013 (  3,  3) [000578] ---X--------                   |  |     |  |     \--*  ARR_LENGTH int    <l:$448, c:$447>
N012 (  1,  1) [000575] ------------                   |  |     |  |        \--*  LCL_VAR   ref    V25 tmp20        u:3 <l:$2d7, c:$5cd>
N032 ( 35, 38) [000587] -A-XG-------                   |  |     \--*  COMMA     byref  <l:$214, c:$213>
N008 (  4,  4) [000112] ---XG-------                   |  |        |  /--*  IND       ref    <l:$2d7, c:$5cd>
N006 (  1,  1) [000588] ------------                   |  |        |  |  |  /--*  CNS_INT   long   16 field offset Fseq[_entries] $1c2
N007 (  2,  2) [000589] -------N----                   |  |        |  |  \--*  ADD       byref  $206
N005 (  1,  1) [000111] ------------                   |  |        |  |     \--*  LCL_VAR   ref    V00 this         u:2 $100
N010 (  4,  4) [000574] -A-XG---R---                   |  |        \--*  ASG       ref    <l:$2d7, c:$5cd>
N009 (  1,  1) [000573] D------N----                   |  |           \--*  LCL_VAR   ref    V25 tmp20        d:3 <l:$2d7, c:$5cd>
N036 ( 43, 46) [000127] -A-XGO------                   \--*  ASG       int    $VN.Void
N004 (  4,  4) [000126] -----O-N----                      \--*  IND       int    <l:$26c, c:$609>
N002 (  1,  1) [000124] ------------                         |  /--*  CNS_INT   long   8 $1c3
N003 (  2,  2) [000125] -------N----                         \--*  ADD       byref  $20e
N001 (  1,  1) [000123] ------------                            \--*  LCL_VAR   ref    V08 tmp3         u:3 $2d5

Most of the other cases in this method check for side effects and defer; seems like we should verify the box arg is side-effect free here too.

AndyAyersMS · 2017-07-20T17:16:10Z

Because key is a value type, the assert below is trivially true. But it also generates a CSE used in the subsequent statement. Should be easy enough to create a small repro...

                    Debug.Assert(_entries[walk1]._key != null);
                    int hashCode = _entries[walk1]._key.GetHashCode();

AndyAyersMS · 2017-07-20T17:39:30Z

Somewhat simpler repro. Codegen bug is in Test as above.

using System;

public struct S<K>
{
    public int x;
    public int y;
    public K val;
}

public class X<K,V> 
{
    public X(K k)
    {
        a = new S<K>[2];
        a[1].val = k;
        a[1].x = 3;
        a[1].y = 4;
    }

    public void Assert(bool b)
    {
        if (!b) throw new Exception("bad!");
    }

    public int Test()
    {
        int r = 0;
        for (int i = 0; i < a.Length; i++)
        {
            Assert(a[i].val != null);
            r += a[i].val.GetHashCode();
        }
        return r;
    }

    S<K>[] a;
}

class B
{
    public static int Main()
    {
        var a = new X<int, string>(11);
        int z = a.Test();
        return (z == 11 ? 100 : 0);
    }
}

AndyAyersMS · 2017-07-20T17:46:19Z

The underlying assume-no-side-effects weakness in gtFoldExprSpecial predates 2.0 -- indeed it is there in the jit32 sources as well.

Not clear yet if we can expose this bug in older jits. It is showing up in 2.0 because of more aggressive CSEs.

AndyAyersMS · 2017-07-20T18:48:04Z

Adding a side effect check fixes the bad codegen, but is overly pessimistic. Likely need to rework the box code in the importer to forcibly move possible side effects out from under the box.

AndyAyersMS · 2017-07-20T19:29:36Z

Or maybe not, the optimization done in gtFoldExprSpecial effectively dead-codes a struct copy because it's folded in under the box. If we split it out it may become problematic.

This argues for drilling deeper into the box operand during gtFoldExprSpecial to try sort which of the side effects are "expected" and can be ignored and which others aren't... not thrilled about this but it is probably the most surgical fix.

AndyAyersMS · 2017-07-20T23:04:12Z

Simpler repro that does not require CSE. Here the call to Get is dropped.

using System;
using System.Runtime.CompilerServices;

public class X<K> 
{
    public X(K k1)
    {
        k = k1;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public K Get()
    {
        Console.WriteLine("Get called");
        count++;
        return k;
    }

    public bool Test()
    {
        bool x = Get() != null;
        bool y = (count == 1);
        return x && y;
    }

    K k;
    int count;
}

class B
{
    public static int Main()
    {
        var a = new X<int>(11);
        bool result = a.Test();
        Console.WriteLine("Passed: {0}", result);
        return result ? 100 : 0;
    }
}

AndyAyersMS · 2017-07-21T03:04:49Z

One more example, this time we lose an exception-causing dereference.

using System;
using System.Runtime.CompilerServices;

public class X<K> 
{
    public X(K k1)
    {
        k = k1;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static bool Test(X<K> a)
    {
        return (a.k != null);
    }

    public K k;
}

class B
{
    public static int Main()
    {
        X<int> a = null;
        bool result = false;
        try 
        {
            X<int>.Test(a);
        }
        catch (Exception)
        {
            result = true;
        }
        Console.WriteLine("Passed: {0}", result);
        return result ? 100 : 0;
    }
}

AndyAyersMS · 2017-07-21T16:23:19Z

Still searching for the right fix. Couple more notes.

The initial trees produced by the box are roughly
Stmt1: newobj into a newly live local "boxTemp"
Stmt2: ... tree of BOX(COMMA(boxTemp, ASG(exprToBox,boxTemp)
If BOX of a value type appears under a EQ/NE compare to null we optimize it in gtFoldExprSpecial. This can be called at various times and is even invoked in the importer.
The optimization replaces (EQ/NE, null (BOX ...)) with (0/1) as appropriate. Thus all side effects under the BOX are lost. It also nulls out Stmt1.
Most often the only side effect under the box is a modification of the "boxTemp" which dies at the compare so we are ok with losing those side effects.
But if the evaluation of exprToBox contains other side effects we may lose them too, and this is the bug we're trying to fix. Examples above show cases where these effects include embedded assignments for CSEs, calls, and exceptions.
Interestingly CSC now may use cgt.un for the compare of box vs null which breaks the pattern in gtFoldExprSpecial and prevents this optimization from happening early, in the importer -- later assertion prop changes the comparison operator to EQ/NE and it fires during optimization. So we should broaden the predicate types we look for in gtFoldExprSpecial to try and catch more cases of this earlier. Since this can nuke relatively large trees it is worth doing as early as possible.
Whether invoked early or late we run the risk of losing side effects. There is potentially some advantage to catching this early as the shape of the tree under the box is more constrained so presumably the necessary side effects could be more readily extracted.
Generally we would expect almost all cases could be found early since this is largely a type based optimization -- however if the null required const-prop or similar perhaps we would miss some if we only did this early.
One could imagine splitting off any side effects from exprToBox into yet another statement and (if anything is split) storing the result in another temp. Since the cases where exprToBox has side effects are rare this seems like a viable approach. However...
The BOX operator IR only appears for simple inline box cases; this has two sub cases, one where exprToBox is a simple scalar type and the other where it is a struct. Both cases are vulnerable to this. bug. So if we do the split as in 9, in the struct case we run the risk of creating a dead struct copy if we end up optimizing the box away, and an unnecessary struct copy if we don't optimize (say the BOX is not under a compare). I haven't gotten far enough along to see if we can optimize this away. This case may be rare enough that we'd be willing to pay this cost to avoid the bug.
Ideally we'd look ahead during importation and see that the BOX was feeding a compare and only then decide to split off the effects, but that's likely not viable.
But if we only did this optimization early, we might be able to look back when we did the optimization into the BOX tree and effectively split off the side effects (since as noted above the tree shapes are limited).
An attempt to do this splitting within gtFoldExprSpecial foundered because JTRUE is not well prepared to handle COMMA, so there's no easy way in an expression context to hold onto the effects.
This optimization is important and we need to inhibit as few cases as possible. There are many examples in corelib (presumably all correctly optimized?) so getting that down to minimal diffs is a good indicator of progress.

mikedn · 2017-07-21T16:45:30Z

Interestingly CSC now may use cgt.un for the compare of box vs null which breaks the pattern in gtFoldExprSpecial and prevents this optimization from happening early, in the importer -- later assertion prop changes the comparison operator to EQ/NE and it fires during optimization. So we should broaden the predicate types we look for in gtFoldExprSpecial to try and catch more cases of this earlier. Since this can nuke relatively large trees it is worth doing as early as possible.

Hmm, when I added the GT.UN->EQ/NE transform I considered doing it right in the importer but I ended up adding it in morph because that's where most of such transforms are done. It looks like doing this in the importer might actually be a good idea, after all it just compensates an IL deficiency (lack of cne) rather than being an actual optimization.

erozenfeld · 2017-07-21T16:50:27Z

fgMorphRecognizeBoxNullable is a similar optimization although there we have a helper call instead of GT_BOX. I recently fixed it to bail out when called after morphing the args and also to be called on gt.un:
dotnet/coreclr#12395. May be worth double checking that we can't possibly remove side effects there.

JosephTremoulet · 2017-07-21T16:57:13Z

Is it worth having a different form of box IR operator? I'm not following whether "boxTemp" is holding a copy of the struct (and the reason we introduce boxTemps is that our BOX operation needs an lvalue?), or whether "boxTemp" is holding a pointer to the box on the heap (in which case I'm a little confused what the assign is doing..), but presumably if it's the first case we could have e.g. a BoxValue operator that just takes the value and implicitly needs a temp (which could be manifest at morph/lower/wherever), and in the latter case we could have e.g. a NewBox operator that again just takes the value but carries an implied heap allocation.

AndyAyersMS · 2017-07-21T17:16:22Z

The box temp is a reference to the heap instance and is the result of the box. The general sequence for the inline box is:

boxtemp = newobj
copy contents from exprToBox to payload portion of boxtemp
result is boxtemp

The newobj is in a preceding statement, the copy is in comma under the box.

Am wondering now if we also have to consider possible side effects from the newobj, eg if this triggers class init. Will try and work up a repro.

AndyAyersMS · 2017-07-21T17:28:53Z

IR example:

[000008] ------------        *  STMT      void  (IL 0x000...  ???)
[000005] --C-G-------        |  /--*  CALL help ref    HELPER.CORINFO_HELP_NEWSFAST
[000003] ------------ arg0   |  |  \--*  CNS_INT(h) long   0x7ffbd8d991c0 class
[000007] -AC-G-------        \--*  ASG       ref   
[000006] D------N----           \--*  LCL_VAR   ref    V01 tmp0        ;; tmp0 == boxTemp   

[000020] ------------        *  STMT      void  (IL   ???...  ???)
[000019] -A-XG-------        \--*  RETURN    int   
[000017] ------------           |  /--*  CNS_INT   ref    null
[000018] NA-XG----U--           \--*  GT        int   
[000016] -A-XG-------              \--*  BOX       ref   
[000014] ------------                 |  /--*  LCL_VAR   ref    V01 tmp0            ;; box result
[000015] -A-XG-------                 \--*  COMMA     ref   
[000002] ---XG-------                    |  /--*  FIELD     int    k
[000001] ------------                    |  |  \--*  LCL_VAR   ref    V00 arg0      ;; exprToBox
[000013] -A-XG-------                    \--*  ASG       int                        ;; copy
[000012] -------N----                       \--*  IND       int   
[000010] ------------                          |  /--*  CNS_INT   long   8
[000011] ------------                          \--*  ADD       byref 
[000009] ------------                             \--*  LCL_VAR   ref    V01 tmp0

JosephTremoulet · 2017-07-21T17:30:56Z

boxtemp = newobj
copy contents from exprToBox to payload portion of boxtemp
result is boxtemp

Ok, that makes sense, but then what is the point of the GT_BOX if we already have the allocation and the copy separately? Indeed, rationalize treats BOX like identity. So does it just serve as a flag to optimizations like this one? Seems a bit hokey to have an operator that means "I'm probably next to a copy and assignment that are dead if I am". Should we consider making the allocation and the copy part of the semantics of the GT_BOX itself? We'd need to manifest them, maybe at rationalize (except we may need to give morph a shot expanding the struct copy...), or maybe at morph (if that's late enough to catch the cases this is trying to optimize)...

AndyAyersMS · 2017-07-21T18:30:32Z

Can't really defend the current approach, just explaining what it is. The Box points back at the earlier assign statement and this link is used ingtFoldExprSpecial to null out the assignment if the box is optimized away.

So yeah the Box flags this node as an optimization opportunity and tracks some of the context needed to pull it off. Not sure if exposing the parts and normal deadcoding could do the same. Perhaps encapsulation could reveal the parts only when we know they're needed, but that starts taking us outside the realm of a suitable fix for a 2.0 regression.

JosephTremoulet · 2017-07-21T19:18:14Z

Not sure if exposing the parts and normal deadcoding could do the same

I don't think we have the facility to remove heap allocations and stores, no.

Perhaps encapsulation could reveal the parts only when we know they're needed

That's what I was trying to suggest, though I'm not sure how problematic the phase ordering issues would be (can't lower the copy until it gets revealed, don't want to reveal it until after having all chances to remove the box)

but that starts taking us outside the realm of a suitable fix for a 2.0 regression.

Ah, I'd missed that bit of context. Yes, I agree.

Box points back at the earlier assign statement

That suggests we might be able to expand the pattern match -- see what the LHS of the assign is, then if the argument to the box is a comma of an assign that stores to an indir off the LHS and then returns the LHS, then only extract side-effects from the arguments to the newobj and struct-assign RHS; otherwise, extract side-effects from the box itself?

briansull · 2017-07-21T19:54:46Z

We should be able to check the side effect flags on any node that we are considering for the optimization. If we see an unsafe flag set, such as GTF_ASG then we can decline to do the optimization.

JosephTremoulet · 2017-07-21T20:09:47Z

@briansull sure, the issue is (quoting Andy's example above) that if the optimization is removing the box at [000016] (and replacing the GT at [000018] with "true"), then we don't want to decline -- we want to remove [000013] even though it has GTF_ASG set (and likewise remove [000007]/[000005] despite their side-effect flags), but need to preserve side-effects in their sub-trees.

Yes it would be correct to just go conservative, but per above,

This optimization is important and we need to inhibit as few cases as possible. There are many examples in corelib

AndyAyersMS · 2017-07-21T20:16:26Z

Indeed -- in that example (which is taken from the latest repro above) we must keep just this bit of the tree under the BOX:

[000002] ---XG-------                    |  /--*  FIELD     int    k
[000001] ------------                    |  |  \--*  LCL_VAR   ref    V00 arg0      ;; exprToBox

Otherwise we lose the NPE that is required, since arg0 is null.

mikedn · 2017-07-21T20:49:34Z

but need to preserve side-effects in their sub-trees

CSE does something like this: https://github.com/dotnet/coreclr/blob/master/src/jit/optcse.cpp#L2073. Though it doesn't looks like that code could easily be reused.

AndyAyersMS · 2017-07-21T21:01:56Z

All the "must keep" side effects initially come from the exprToBox tree. Since BOX could operate with address of exprToBox, conceptually we should be producing this address its own tree that is not under the BOX and we'd have all the must keep side effects walled off outside the scope of the BOX.

But later on, if we're unlucky, perhaps that new side effects (eg a CSE def) could be introduced into tree that would still sit under the BOX (where we are copying from the exprToBox to the box payload). Maybe this is rare or not actually possible, but it seems like it could happen.

If it's not possible for the tree under the BOX to get additional side effects, then the split above would conceptually solve the problem and we could zap the BOX and the ASSIGN and the side effects would stay behind.

To make this work though we'd still have to figure out where to put the address computation tree and whether it is viable to force exprToBox to memory so we can take its address.

We could form the exprToBox address in a comma above the BOX and fix gtFoldExprSimple to tunnel via gtEffectiveVal but that would reproduce the (JTRUE, COMMA(..., 0/1)) that was problematic in an earlier prototype. So perhaps better to put it a prior statement and pass to the BOX via a temp? This would be simple for the scalar case since that's close to what happens now. Not sure about the struct case just yet.

AndyAyersMS · 2017-07-21T21:15:46Z

@erozenfeld I haven't been able to find similar issues with nullables yet, but can't rule it out either...

AndyAyersMS · 2017-07-21T21:59:21Z

Looking a bit at why this doesn't happen in 1.1 or in Jit32, I think it is from the GT_GT -> GT_NE change we put into 2.0. Also explains why 4.7 doesn't have this issue and 4.7.1 will unless we push whatever fix we come up with here...

Odd then that the pattern in gtExprFoldSpecial only fires for GT_EQ and GT_NE cases. Seems like either something else; maybe older CSCs used signed compares?

Adding GT_GT support to gtExprFoldSpecial helps dead box a few more cases (19 methods impacted in the default jit-diff set).

mikedn · 2017-07-21T22:33:42Z

Odd then that the pattern in gtExprFoldSpecial only fires for GT_EQ and GT_NE cases. Seems like either something else; maybe older CSCs used signed compares?

Signed compares? Can't be, only cgt.un, not cgt, can be used on object references.

JosephTremoulet · 2017-07-21T23:07:36Z

BOX could operate with address of exprToBox...

Would that be changing the semantics of GT_BOX? Another way we could shape the IR differently in the importer would be to hoist out the assign that's under the comma as well, so instead of

assign_stmt:  Assign x, (call newobj())
box_stmt:  box (comma (assign *x, exprToBox), x)

we could have

assign_stmt1: Assign x, (call newobj())
assign_stmt2: Assign *x, exprToBox
box_stmt:  box x

the rules for the transform would be to nuke assign_stmt1, and from assign_stmt2 to preserve the side-effects of exprToBox.

If we want to do something in that vein but avoid putting another pointer on GenTreeBox, we could possibly do something like:

assign_stmt: comma (Assign x, (call newobj())), (Assign *x, exprToBox)
box_stmt:  box x

and the optimization could verify that the statement the box points to has the expected form (comma of two assigns)...

I wouldn't say it's an elegant suggestion, and I don't know if we're using commas of assigns elsewhere or if those might cause trouble, but it would be a way to shuffle the handshake w/o changing the semantics of the box operator (except that the meaning of the link to the statement on GenTreeBox would change, but this optimization appears to be the only consumer of that)

AndyAyersMS · 2017-07-21T23:27:41Z

Yeah, using a 2-statement preamble occurred to me too. There is some variability in how the copy is represented, but perhaps it is easy enough to see how to preserve the necessary side effects. A second pointer on the Box is probably ok.

AndyAyersMS · 2017-07-21T23:42:37Z

Hmm, looking at what gtCloneExpr already does for BOX makes me wonder about the viability of having it pointing at other statements. Seems like there is potential for real trouble here. I suppose adding a second statement shouldn't get us any deeper into trouble than we already are.

AndyAyersMS · 2017-07-22T16:23:02Z

Incomplete fix attempt: https://github.com/AndyAyersMS/coreclr/tree/FixDeadBox. Copy is now walled off in second statement, but still not clear how to extract just the "source" side effects on the copy.

AndyAyersMS · 2017-07-23T17:06:02Z

Added some tweaks to the fix attempt. It can now get through corelib without blowing up, though the preservation of source side effects for struct copies is ad-hoc and ugly.

511 methods impacted, overall net around zero size, but some big regressions. One such case is generic ObjectEqualityComparer IndexOf on value types. Here the IL contains the following (where CSC looks like it already clones the IL in various places to specialize the "looking for null " case)

IL_002f  03                ldarg.1     
IL_0030  08                ldloc.2     
IL_0031  a3 a5 00 00 1b    ldelem       0x1B0000A5    ;; fetch a[i]
IL_0036  8c a5 00 00 1b    box          0x1B0000A5    ;; result is non-null for value types
IL_003b  2c 19             brfalse.s    25 (IL_0056)  ;; will never branch for value types
IL_003d  03                ldarg.1     
IL_003e  08                ldloc.2     
IL_003f  fe 1e             readonly.   
IL_0041  8f a5 00 00 1b    ldelema      0x1B0000A5    ;;  fetch &a[i]
IL_0046  04                ldarg.2     
IL_0047  fe 16 a5 00 00 1b constrained. 0x1B0000A5
IL_004d  6f 80 07 00 0a    callvirt     0xA000780
IL_0052  2c 02             brfalse.s    2 (IL_0056)

Current jit's dead box opt will toss the whole subtree under the box, which includes the ldelem. So it loses potential null and bounds checks. But the subsequent ldelema does the same checks and so masks the fact that the box opt transformation is unsound.

My prototype preserves the side effects of the ldelem and so there are now two index computations. For some reason I have yet to drill into, this now triggers loop cloning. So these methods see large size increases.

AndyAyersMS · 2017-07-23T17:41:50Z

Ah, it seems loop cloning can't spot array indices in the &a[i] case. So the current jit doesn't spot that by cloning it could remove the range check (probably a good thing...). But now the preserved side effects from the box create something like a[i] as well, which loop cloning spots and decides to optimize.

We don't need an explicit load from the source of the box, we just need a null check on the address. However GT_NULLCHECK has limitations and expects its operand to be side effect free. So while it would make sense to use it it would require a new temp.

mikedn · 2017-07-23T19:12:35Z

Looking a bit at why this doesn't happen in 1.1 or in Jit32, I think it is from the GT_GT -> GT_NE change we put into 2.0. Also explains why 4.7 doesn't have this issue and 4.7.1 will unless we push whatever fix we come up with here...

To be clear, this bug exists in both JIT32 and RuyJIT that ships with .NET Framework 4.7. You just need to use == null instead of != null. The GT_GT -> GT_NE just enables this broken transform to also run for != null.

AndyAyersMS · 2017-07-24T14:38:25Z

I suspected JIT32 and older RyuJits had the same bug. Thanks for pointing out an example. Would be interesting to know if JIT64 has it too (not as likely) and how far back it goes.

AndyAyersMS · 2017-07-24T14:50:27Z

Looks like JIT64 is ok. The bug is there in JIT32 in v2.0.

mikedn · 2017-07-24T14:50:48Z

Would be interesting to know if JIT64 has it too (not as likely)

JIT64 works fine. It does the transform and keeps the relevant side effects. For return (a.k == null); it generates:

 mov         eax,dword ptr [rcx+8] 
 xor         eax,eax 
 ret

how far back it goes.

Hmm, all my machines already have .NET 4.7 so I don't know about other 4.x versions. But I tried with the .NET 2.0 x86 runtime and the bug is there too. It's probably safe to say that the bug exists since this optimization was introduced.

mikedn · 2017-07-24T14:58:43Z

It's kind of funny that if a struct is used instead of int JIT64 keeps a copy of a.k to a temporary. Loading a single byte from a.k would have been enough to preserve the side effect.

AndyAyersMS · 2017-07-24T15:06:04Z

My prototype just tries to read the first byte, though it would be somewhat less hacky to just copy the struct into a temporary.

I had hoped to use GT_NULLCHECK but it has limitations.

Boxing a value type produces a non-null result. If the result of the box is only used to feed a compare against null, the jit tries to optimize the box away entirely since the result of the comparison is known. Such idiomatic expressions arise fairly often in generics instantiated over value types. In the current implementation the box expands into two parts: an upstream statement to allocate heap space, and then an expression tree containing an encapsulated copy from the value being boxed to the payload of the new heap object. Wrapping around that is a reference to the new object, which is the result of the box. When the optimization fires, the upstream allocation is removed, and the current implementation also removes the entire box expression tree. Howver this tree can include important side effects from the evaluation of the value being boxed that must be preserved. For instance the value might come from an array, in which case and the box expression tree would contain the array null check and bounds check. So removing the entire tree can alter behavior. This fix attempts to carefull preserve the important side effects by moving the copy into a second statement upstream from the box. The box itself is then just a trivial side-effect-free reference to the box temp local. When the optimization fires the jit removes the upstream heap allocation as before, as well as the now-trivial box tree. It analyzes the source side of the upstream copy. If it is side effect free the copy is removed entirely. If not, the jit modifies the copy into a minimal load of the boxed value, and this load should reproduce the necessary side effects. Fixes #12949.

Boxing a value type produces a non-null result. If the result of the box is only used to feed a compare against null, the jit tries to optimize the box away entirely since the result of the comparison is known. Such idiomatic expressions arise fairly often in generics instantiated over value types. In the current implementation the box expands into two parts. The first is an upstream statement to allocate a boxed object and assign a reference to the boxed object to a local var known as the "box temp". The second is an expression tree whose value is the box temp that also contains an an encapsulated copy from the value being boxed to the payload section of the boxed object. The box node also contains a pointer back to the first statement (more on this later). In the examples being discussed here this second tree is a child of a compare node whose other child is a null pointer. When the optimization fires, the upstream allocation statement is located via the pointer in the box node and removed, and the entire compare is replaced with a constant 0 or 1 as appropriate. Unfortunately the encapsulated copy in the box subtree may include side effects that should be preserved, and so this transformation is unsafe. Note that the copy subtree as a whole will always contain side effects, since the copy is storing values into the heap, and that copy now will not happen. But the side effects that happen when producing the value to box must remain. In the initial example from #12949 the side effects in question were introduced by the jit's optimizer to capure a CSE definition. dotnet#13016 gives several other examples where the side effects are present in the initial user code. For instance the value being boxed might come from an array, in which case the encapsulated copy in the box expression tree would contain the array null check and bounds check. So removing the entire tree can alter behavior. This fix attempts to carefully preserve the important side effects by reworking how a box is imported. The copy is now moved out from under the box into a second upstream statement. The box itself is then just a trivial side-effect-free reference to the box temp. To ensure proper ordering of side effects the jit spills the evaluation stack before appending the copy statement. When the optimization fires the jit removes the upstream heap allocation as before, as well as the now-trivial compare tree. It analyzes the source side of the upstream copy. If it is side effect free, the copy is removed entirely. If not, the jit modifies the copy into a minimal load of the boxed value, and this load should reproduce the necessary side effects. The optimization is only performed when the tree shape of the copy matches expected patterns. There are some expected cases where the tree won't match, for instance if the optimization is invoked while the jit is inlining. Because this optimization runs at several points the jit can catch these cases once inlining completes. There is one case that is not handled that could be -- if the assignment part of the copy is itself a subtree of a comma. This doesn't happen often. The optimization is now also extended to handle the case where the comparision operation is `cgt.un`. This doesn't catch any new cases but causes the optimization to happen earlier, typically during importation, which should reduce jit time slightly. Generally the split of the box into two upstream statements reduces code size, especially when the box expression is incorporated into a larger tree -- for example a call. However in some cases where the value being boxed comes from an array, preserving the array bounds check now causes loop cloning to kick in and increase code size. Hence the overall size impact on the jit-diff set is essentially zero. Added a number of new test cases showing the variety of situations that must be handled and the need to spill before appending the copy statement. Fixes #12949.

Boxing a value type produces a non-null result. If the result of the box is only used to feed a compare against null, the jit tries to optimize the box away entirely since the result of the comparison is known. Such idiomatic expressions arise fairly often in generics instantiated over value types. In the current implementation the box expands into two parts. The first is an upstream statement to allocate a boxed object and assign a reference to the boxed object to a local var known as the "box temp". The second is an expression tree whose value is the box temp that also contains an an encapsulated copy from the value being boxed to the payload section of the boxed object. The box node also contains a pointer back to the first statement (more on this later). In the examples being discussed here this second tree is a child of a compare node whose other child is a null pointer. When the optimization fires, the upstream allocation statement is located via the pointer in the box node and removed, and the entire compare is replaced with a constant 0 or 1 as appropriate. Unfortunately the encapsulated copy in the box subtree may include side effects that should be preserved, and so this transformation is unsafe. Note that the copy subtree as a whole will always contain side effects, since the copy is storing values into the heap, and that copy now will not happen. But the side effects that happen when producing the value to box must remain. In the initial example from #12949 the side effects in question were introduced by the jit's optimizer to capure a CSE definition. #13016 gives several other examples where the side effects are present in the initial user code. For instance the value being boxed might come from an array, in which case the encapsulated copy in the box expression tree would contain the array null check and bounds check. So removing the entire tree can alter behavior. This fix attempts to carefully preserve the important side effects by reworking how a box is imported. The copy is now moved out from under the box into a second upstream statement. The box itself is then just a trivial side-effect-free reference to the box temp. To ensure proper ordering of side effects the jit spills the evaluation stack before appending the copy statement. When the optimization fires the jit removes the upstream heap allocation as before, as well as the now-trivial compare tree. It analyzes the source side of the upstream copy. If it is side effect free, the copy is removed entirely. If not, the jit modifies the copy into a minimal load of the boxed value, and this load should reproduce the necessary side effects. The optimization is only performed when the tree shape of the copy matches expected patterns. There are some expected cases where the tree won't match, for instance if the optimization is invoked while the jit is inlining. Because this optimization runs at several points the jit can catch these cases once inlining completes. There is one case that is not handled that could be -- if the assignment part of the copy is itself a subtree of a comma. This doesn't happen often. The optimization is now also extended to handle the case where the comparision operation is `cgt.un`. This doesn't catch any new cases but causes the optimization to happen earlier, typically during importation, which should reduce jit time slightly. Generally the split of the box into two upstream statements reduces code size, especially when the box expression is incorporated into a larger tree -- for example a call. However in some cases where the value being boxed comes from an array, preserving the array bounds check now causes loop cloning to kick in and increase code size. Hence the overall size impact on the jit-diff set is essentially zero. Added a number of new test cases showing the variety of situations that must be handled and the need to spill before appending the copy statement. Fixes #12949.

MichalStrehovsky changed the title ~~Bad x64 codegen in VerifyUnifierConsistency (CoreRT)~~ Bad x64 codegen in VerifyUnifierConsistency Jul 20, 2017

RussKeldorph assigned AndyAyersMS Jul 20, 2017

AndyAyersMS closed this as completed in dotnet/coreclr#13016 Jul 31, 2017

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

ghost locked as resolved and limited conversation to collaborators Dec 21, 2020

Bad x64 codegen in VerifyUnifierConsistency #8606

Bad x64 codegen in VerifyUnifierConsistency #8606

Comments

MichalStrehovsky commented Jul 20, 2017

AndyAyersMS commented Jul 20, 2017

AndyAyersMS commented Jul 20, 2017

mikedn commented Jul 20, 2017

AndyAyersMS commented Jul 20, 2017

AndyAyersMS commented Jul 20, 2017

AndyAyersMS commented Jul 20, 2017

AndyAyersMS commented Jul 20, 2017

AndyAyersMS commented Jul 20, 2017

AndyAyersMS commented Jul 20, 2017

AndyAyersMS commented Jul 20, 2017

AndyAyersMS commented Jul 20, 2017

AndyAyersMS commented Jul 21, 2017

AndyAyersMS commented Jul 21, 2017

mikedn commented Jul 21, 2017

erozenfeld commented Jul 21, 2017

JosephTremoulet commented Jul 21, 2017

AndyAyersMS commented Jul 21, 2017

AndyAyersMS commented Jul 21, 2017

JosephTremoulet commented Jul 21, 2017

AndyAyersMS commented Jul 21, 2017

JosephTremoulet commented Jul 21, 2017

briansull commented Jul 21, 2017

JosephTremoulet commented Jul 21, 2017 • edited Loading

AndyAyersMS commented Jul 21, 2017

mikedn commented Jul 21, 2017

AndyAyersMS commented Jul 21, 2017

AndyAyersMS commented Jul 21, 2017

AndyAyersMS commented Jul 21, 2017

mikedn commented Jul 21, 2017

JosephTremoulet commented Jul 21, 2017

AndyAyersMS commented Jul 21, 2017

AndyAyersMS commented Jul 21, 2017

AndyAyersMS commented Jul 22, 2017

AndyAyersMS commented Jul 23, 2017

AndyAyersMS commented Jul 23, 2017

mikedn commented Jul 23, 2017

AndyAyersMS commented Jul 24, 2017

AndyAyersMS commented Jul 24, 2017

mikedn commented Jul 24, 2017

mikedn commented Jul 24, 2017

AndyAyersMS commented Jul 24, 2017

JosephTremoulet commented Jul 21, 2017 •

edited

Loading