Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.067 branch: Enormous compile time increase after dtor rework #1010

Closed
dnadlinger opened this issue Jul 29, 2015 · 45 comments
Closed

2.067 branch: Enormous compile time increase after dtor rework #1010

dnadlinger opened this issue Jul 29, 2015 · 45 comments

Comments

@dnadlinger
Copy link
Member

For example, the std.uni debug mode unit tests take several times longer to compile now (probably more than 10x). 70% of the time is spent in llvm::MachineModuleInfo::getOrCreateLandingPadInfo(llvm::MachineBasicBlock*).

@dnadlinger
Copy link
Member Author

We seem to generate almost 300k landing pads (for dtors) in a single function. I find it rather impressive that the compilation finishes at all.

@dnadlinger
Copy link
Member Author

The function in question is std.uni.__unittestL2982_14.

@dnadlinger
Copy link
Member Author

Small sample: (line numbers in a separate file)

postinvoke649:                                    ; preds = %temporariesLandingPad646
  call void @_d_eh_resume_unwind(i8* %125), !dbg !5629 ; [debug line = 6:5]
  unreachable, !dbg !5629                         ; [debug line = 6:5]

eh.collision641:                                  ; preds = %temporariesLandingPad646
  %landing_pad642 = landingpad { i8*, i32 } personality i32 (i32, i32, i64, i8*, i8*)* @_d_eh_personality
          cleanup, !dbg !5627                     ; [#uses = 1 type = { i8*, i32 }] [debug line = 15:5]
  %124 = extractvalue { i8*, i32 } %landing_pad642, 0, !dbg !5627 ; [#uses = 1 type = i8*] [debug line = 15:5]
  invoke void @_d_eh_handle_collision(i8* %124, i8* %135)
          to label %postinvoke644 unwind label %temporariesLandingPad643, !dbg !5627 ; [debug line = 15:5]

postinvoke644:                                    ; preds = %eh.collision641
  unreachable, !dbg !5627                         ; [debug line = 15:5]

postinvoke647:                                    ; preds = %postinvoke640
  call void @_d_eh_resume_unwind(i8* %135), !dbg !5629 ; [debug line = 6:5]
  unreachable, !dbg !5629                         ; [debug line = 6:5]

temporariesLandingPad646:                         ; preds = %postinvoke640
  %landing_pad648 = landingpad { i8*, i32 } personality i32 (i32, i32, i64, i8*, i8*)* @_d_eh_personality
          cleanup, !dbg !5629                     ; [#uses = 2 type = { i8*, i32 }] [debug line = 6:5]
  %125 = extractvalue { i8*, i32 } %landing_pad648, 0, !dbg !5629 ; [#uses = 1 type = i8*] [debug line = 6:5]
  %126 = extractvalue { i8*, i32 } %landing_pad648, 1, !dbg !5629 ; [#uses = 0 type = i32] [debug line = 6:5]
  invoke void @_D3std3uni38__T13InversionListTS3std3uni8GcPolicyZ13InversionList11__fieldDtorMFNaNbNiNeZv(%"std.uni.InversionList!(GcPolicy).InversionList"* %__tmpfordtor1983)
          to label %postinvoke649 unwind label %eh.collision641, !dbg !5629 ; [debug line = 6:5] [display name = std.uni.InversionList!(GcPolicy).InversionList.~this]

temporariesLandingPad643:                         ; preds = %eh.collision641
  %landing_pad645 = landingpad { i8*, i32 } personality i32 (i32, i32, i64, i8*, i8*)* @_d_eh_personality
          cleanup, !dbg !5627                     ; [#uses = 2 type = { i8*, i32 }] [debug line = 15:5]
  %127 = extractvalue { i8*, i32 } %landing_pad645, 0, !dbg !5627 ; [#uses = 1 type = i8*] [debug line = 15:5]
  %128 = extractvalue { i8*, i32 } %landing_pad645, 1, !dbg !5627 ; [#uses = 0 type = i32] [debug line = 15:5]
  call void @_D3std3uni38__T13InversionListTS3std3uni8GcPolicyZ13InversionList11__fieldDtorMFNaNbNiNeZv(%"std.uni.InversionList!(GcPolicy).InversionList"* %__tmpfordtor1983), !dbg !5627 ; [debug line = 15:5] [display name = std.uni.InversionList!(GcPolicy).InversionList.~this]
  call void @_d_eh_resume_unwind(i8* %127), !dbg !5627 ; [debug line = 15:5]
  unreachable, !dbg !5627                         ; [debug line = 15:5]

eh.collision622:                                  ; preds = %temporariesLandingPad637
  %landing_pad623 = landingpad { i8*, i32 } personality i32 (i32, i32, i64, i8*, i8*)* @_d_eh_personality
          cleanup, !dbg !5625                     ; [#uses = 1 type = { i8*, i32 }] [debug line = 18:5]
  %129 = extractvalue { i8*, i32 } %landing_pad623, 0, !dbg !5625 ; [#uses = 1 type = i8*] [debug line = 18:5]
  invoke void @_d_eh_handle_collision(i8* %129, i8* %155)
          to label %postinvoke625 unwind label %temporariesLandingPad624, !dbg !5625 ; [debug line = 18:5]

postinvoke625:                                    ; preds = %eh.collision622
  unreachable, !dbg !5631                         ; [debug line = 6:5]

postinvoke638:                                    ; preds = %postinvoke621
  invoke void @_D3std3uni38__T13InversionListTS3std3uni8GcPolicyZ13InversionList11__fieldDtorMFNaNbNiNeZv(%"std.uni.InversionList!(GcPolicy).InversionList"* %set)
          to label %postinvoke656 unwind label %temporariesLandingPad655, !dbg !5633 ; [debug line = 6:5] [display name = std.uni.InversionList!(GcPolicy).InversionList.~this]

postinvoke658:                                    ; preds = %temporariesLandingPad655
  call void @_d_eh_resume_unwind(i8* %131), !dbg !5633 ; [debug line = 6:5]
  unreachable, !dbg !5633                         ; [debug line = 6:5]

eh.collision650:                                  ; preds = %temporariesLandingPad655
  %landing_pad651 = landingpad { i8*, i32 } personality i32 (i32, i32, i64, i8*, i8*)* @_d_eh_personality
          cleanup, !dbg !5629                     ; [#uses = 1 type = { i8*, i32 }] [debug line = 6:5]
  %130 = extractvalue { i8*, i32 } %landing_pad651, 0, !dbg !5629 ; [#uses = 1 type = i8*] [debug line = 6:5]
  invoke void @_d_eh_handle_collision(i8* %130, i8* %155)
          to label %postinvoke653 unwind label %temporariesLandingPad652, !dbg !5629 ; [debug line = 6:5]

postinvoke653:                                    ; preds = %eh.collision650
  unreachable, !dbg !5629    

@dnadlinger
Copy link
Member Author

@kinke: Could you have a quick look at this? Not quite minimal test case:

void main() {
    import std.uni;
    import std.typecons : tuple;
    import std.algorithm : equal;

    auto set = CodepointSet('A', 'D'+1, 'a', 'd'+1);
    set = unicode.ASCII;
    set = CodepointSet('a', 'z'+1, 'а', 'я'+1);
    foreach(v; 'a'..'z'+1)
        assert(set[v]);
    // Cyrillic lowercase interval
    foreach(v; 'а'..'я'+1)
        assert(set[v]);
    //specific order is not required, intervals may interesect
    auto set2 = CodepointSet('а', 'я'+1, 'a', 'd', 'b', 'z'+1);
    assert(set2.byInterval.equal(set.byInterval));

    auto gothic = unicode.Gothic;
    // Gothic letter ahsa
    assert(gothic['\U00010330']);
    // no ascii in Gothic obviously
    assert(!gothic['$']);

    CodepointSet emptySet;
    assert(emptySet.length == 0);
    assert(emptySet.empty);

    set = unicode.ASCII;
    // union with the inverse gets all of code points in the Unicode
    assert((set | set.inverted).length == 0x110000);
    // no intersection with inverse
    assert((set & set.inverted).empty);
}

@kinke
Copy link
Member

kinke commented Jul 30, 2015

Unfortunately, I don't have time these days, there's a bachelor party this weekend. What surely leads to loads of landing pads is the fact that each temporary is destructed safely, i.e., for a potentially throwing dtor, there's a corresponding landing pad destructing all older temporaries (safely again) - see #914 (comment).

@dnadlinger
Copy link
Member Author

Okay. Guess I need to figure out why we generate thousands of landing pads for a single function then and didn't before.

@dnadlinger
Copy link
Member Author

In other words, what I guess I'm asking is what the difference to the old code is. Didn't we handle throwing constructors correctly before?

@kinke
Copy link
Member

kinke commented Jul 30, 2015

See #914 (comment). Previously, toElemDtor(e) would only destruct all (nested) temporaries after invoking toElem(e). A landing pad would only be created if the root of e was an AssertExp or a CallExp (i.e., missing all potential throws in its subexpressions), and that one would destruct all temporaries, regardless of whether they had actually been constructed in the first place (throw before constructing the temporary).
Additionally, D2.067 requires that temporaries in call/constructor arguments are properly destructed if another argument expression (evaluated after the temporary) yields an exception. So the front-end now features this new NewExp::argprefix and rewrites a CallExp as (argprefix, CallExp), expecting the back-end to take care of proper destruction.
I'm sure relaxing the destruction of the temporaries to an 'unsafe' one, i.e., not recursively spawning new temporariesLandingPads inside another temporariesLandingPad, will cut the number of landing pads down tremendously. Currently, inside a temporariesLandingPad, so after there was a throw already, we spawn a new landing pad for each potentially throwing call, including each temporary's dtor - and this obviously recursively. In case a single dtor threw, all other temporaries will have been successfully destructed before passing the dtor exception up, leading to an exception chain then. Don't know how DMD handles this stuff - probably not like this. ;)

@dnadlinger
Copy link
Member Author

If I remember correctly, the way DMD does it is by emitting boolean local "gates" to track which arguments have been constructed yet, and then having a single landing pad to destruct all fully constructed objects. I need to look into the source to verify this, though.

@dnadlinger
Copy link
Member Author

Hmm, I wonder whether they actually also do throwing dtors that way (looping back to the same landing pad). Seems a bit unlikely, given that the argprefix code is rather simple.

@kinke
Copy link
Member

kinke commented Aug 9, 2015

Afaik the argprefix code uses a single gate (bool) for all dtor expressions, but that's because of implicit moving of the temporaries for the function call, i.e., eliding postblit and dtor if there was no throw. But it doesn't keep track of what has been constructed yet - makes sense, as that needs to be done for any expression tree anyway, not just the special argprefix one.
While checking the MSVC cmake flags, I came across [https://msdn.microsoft.com/en-us/library/1deeycx5.aspx /EH], which allows to control this destruction logic for MS C++. In particular, you may set the assumption that extern(C) functions never throw. Iirc we almost always emit a invoke instruction instead of a direct call, as D nothrow isn't enough for a LLVM nothrow. This surely accounts for countless landing pads, as every trivial dtor I've tested (numDtor++ etc.) was considered potentially throwing.

@dnadlinger
Copy link
Member Author

DMD's implementation seems hopelessly broken anyway, unless I'm missing something:

import core.stdc.stdio;

struct ThrowingCtor {
    const char* name;

    this(const char* name) {
        printf("Constructing %s\n", name);
        this.name = name;
        throw new Exception("somebody set up us the bomb");
    }

    ~this() {
        printf("Destroying %s\n", name);
    }
}

struct ThrowingDtor {
    const char* name;

    this(const char* name) {
        printf("Constructing %s\n", name);
        this.name = name;
    }

    ~this() {
        printf("Destroying %s\n", name);
        throw new Exception("for great justice");
    }
}

void foo(ThrowingDtor a, ThrowingDtor b, ThrowingCtor c) {}

void main() {
    auto z = ThrowingDtor("z");
    foo(ThrowingDtor("a"), ThrowingDtor("b"), ThrowingCtor("c"));
}

prints

Constructing z
Constructing a
Constructing b
Constructing c
Destroying z
[email protected](9): somebody set up us the bomb
----------------
5   test                                0x000000010d3a9ad9 _Dmain + 141
6   test                                0x000000010d3beabc D2rt6dmain211_d_run_mainUiPPaPUAAaZiZ6runAllMFZ9__lambda1MFZv + 40
7   test                                0x000000010d3bea01 void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).tryExec(scope void delegate()) + 45
8   test                                0x000000010d3bea61 void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).runAll() + 45
9   test                                0x000000010d3bea01 void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).tryExec(scope void delegate()) + 45
10  test                                0x000000010d3be974 _d_run_main + 504
11  test                                0x000000010d3a9b68 main + 20
12  libdyld.dylib                       0x00007fff916c65c9 start + 1
13  ???                                 0x0000000000000001 0x0 + 1
[email protected](27): for great justice
----------------
5   test                                0x000000010d3a9b43 _Dmain + 247
6   test                                0x000000010d3beabc D2rt6dmain211_d_run_mainUiPPaPUAAaZiZ6runAllMFZ9__lambda1MFZv + 40
7   test                                0x000000010d3bea01 void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).tryExec(scope void delegate()) + 45
8   test                                0x000000010d3bea61 void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).runAll() + 45
9   test                                0x000000010d3bea01 void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).tryExec(scope void delegate()) + 45
10  test                                0x000000010d3be974 _d_run_main + 504
11  test                                0x000000010d3a9b68 main + 20
12  libdyld.dylib                       0x00007fff916c65c9 start + 1
13  ???                                 0x0000000000000001 0x0 + 1

i.e. a and b are never destructed.

@dnadlinger
Copy link
Member Author

We, in addition to that, seem not to implement exception chaining correctly in this example (the "for great justice" is printed instead of being chained to the first one).

@kinke
Copy link
Member

kinke commented Aug 9, 2015

Confirmed on Win64 with DMD 2.067.1:

Constructing z
Constructing a
Constructing b
Constructing c
Destroying z
object.Exception@\ldc\dtor.d(9): somebody set up us the bomb
----------------
0x00402063
0x0040225C
0x00402A3E
0x00402A13
0x0040292B
0x00402303
0x7608850D in BaseThreadInitThunk
0x7756BF39 in RtlInitializeExceptionChain
0x7756BF0C in RtlInitializeExceptionChain
object.Exception@\ldc\dtor.d(27): for great justice
----------------
0x004020F6
0x004022D0
0x00402A3E
0x00402A13
0x0040292B
0x00402303
0x7608850D in BaseThreadInitThunk
0x7756BF39 in RtlInitializeExceptionChain
0x7756BF0C in RtlInitializeExceptionChain

Even better, when not throwing in ThrowingDtor::~this():

Constructing z
Constructing a
Constructing b
Constructing c
Destroying z
object.Exception@\ldc\dtor.d(9): somebody set up us the bomb
----------------
0x00402063
0x00402228
0x00402A0A
0x004029DF
0x004028F7
0x004022CF
0x7608850D in BaseThreadInitThunk
0x7756BF39 in RtlInitializeExceptionChain
0x7756BF0C in RtlInitializeExceptionChain

a and b still not destructed, for a regular throw when building the args...

@dnadlinger
Copy link
Member Author

Simpler example that does not throw in destructors (which should probably work due to exception chaining, but I couldn't find any confirmation in the spec for):

import core.stdc.stdio;

struct Foo {
    const char* name;

    this(const char* name, bool fail) {
        printf("Constructing %s\n", name);
        this.name = name;

        if (fail) {
            throw new Exception("somebody set up us the bomb");
        }
    }

    ~this() {
        printf("Destroying %s\n", name);
    }
}

void foo(Foo a, Foo b) {}

void main() {
    foo(Foo("a", false), Foo("b", true));
}

@kinke
Copy link
Member

kinke commented Aug 10, 2015

Oh wow, this one's cool:

import core.stdc.stdio;
import core.stdc.string;

struct Foo {
    const char* name;
    bool throwInDtor;

    this(const char* name, bool throwInDtor) {
        printf("Constructing %s\n", name);
        this.name = name;
        this.throwInDtor = throwInDtor;
    }

    ~this() {
        printf("Destroying %s\n", name);
        if (throwInDtor)
            throw new Exception("dtor " ~ name[0]);
    }
}

Foo make(const char* name, bool doThrow) {
    if (doThrow)
        throw new Exception("make()");
    return Foo(name, false);
}

void foo(Foo a, Foo b, Foo c, Foo d) { throw new Exception("foo()!"); }

void main() {
    foo(Foo("a", true), make("b", true), Foo("c", false), make("d", true));
}

DMD 2.067.1, Win32:

1) foo(Foo("a", true), make("b", false), Foo("c", true), make("d", false))
Constructing a
Constructing b
Constructing c
Constructing d
Destroying a
Destroying b
Destroying c
Destroying d
object.Exception@..\dtor.d(27): foo()!
object.Exception@..\dtor.d(17): dtor a
object.Exception@..\dtor.d(17): dtor c

2) foo(Foo("a", true), make("b", false), Foo("c", true), make("d", true))
Constructing a
Constructing b
Constructing c
object.Exception@..\dtor.d(23): make()

3) foo(Foo("a", false), make("b", false), Foo("c", false), make("d", true))
Constructing a
Constructing b
Constructing c
object.Exception@..\dtor.d(23): make()

4) foo(Foo("a", false), make("b", true), Foo("c", false), make("d", true))
Constructing a
Destroying a
object.Exception@..\dtor.d(23): make()

5) foo(Foo("a", true), make("b", true), Foo("c", false), make("d", true))
Constructing a
Destroying a
Destroying a
...
<repetition until crash>

Initially I thought DMD would just support regular function calls and could be missing ctor calls, but there's way more to it.
Cutting-edge LDC Win64 yields the same as DMD in case 1 (apparently correct, although I would have expected a reversed order when destructing - but that's just a normal exception inside the callee, so the temporary args have been implicitly moved into the function params which are destructed by the callee).
LDC handles cases 2 & 3 like DMD (all 3 temporaries not destructed at all, even in case 3 with no throwing dtors!).
No difference between DMD and LDC for case 4 either; it doesn't need exception chaining and is actually correct. :P
Case 5 kind of works for LDC, but as David said, it doesn't chain the original exception (I thought that was due to basic Win64 EH support in LDC):

Constructing a
Destroying a
object.Exception@..\dtor.d(17): dtor a

I suspect cases 2 & 3 to be a front-end bug when generating argprefix. Case 5 is definitely a DMD bug which may have been fixed in the latest DMD 2.068 RC or beta.

@kinke
Copy link
Member

kinke commented Aug 10, 2015

Just tested DMD 2.068-rc1 - same results as DMD 2.067.1. :(

/edit: https://issues.dlang.org/show_bug.cgi?id=14902

@dnadlinger
Copy link
Member Author

Seems like DMD 2.068.0 on OS X gets 4) wrong; it does not destruct a.

Edit: Sorry, forgot to mention link for my report here: https://issues.dlang.org/show_bug.cgi?id=14903

@dnadlinger
Copy link
Member Author

On the plus side, it does not enter an infinite loop on 5.

@dnadlinger
Copy link
Member Author

Maybe I'm missing something obvious, but I think I just realized something regarding the complexity issue: Instead of the current version with its quadratic (or possibly even worse, didn't think much about it) growth, can't we just generate a linear number of nested EH contexts, i.e. the equivalent of a try-finally block per local variable with a dtor?

@kinke
Copy link
Member

kinke commented Aug 11, 2015

possibly even worse

Most likely worse! :D
Yep, that'd be the original idea, i.e., putting the dtor in a finally-block of a nested try{} scope per temporary. I've had 0 experience with LLVM landing pads when tackling this, so I had no idea (and still don't) how to translate D's more abstract try { auto x = Struct(); ... } finally { x.~this(); } concept to LLVM with its invoke instruction, landing pads and EH intrinsics/runtime functions - except for the current, most likely exponential approach. Afaik there's no try-finally front-end expression suited for a rewrite; a try is a statement.
Another viable solution would be the additional live-flag per temporary, related to the front-end's argprefix codegen. All flags would be declared and initialized at the beginning of the expression tree. Each would be set right after successful construction of the corresponding temporary. We'd only need a single landing pad with those flags as guards (new dtor expression for each temporary: isLive && (~this(), isLive = false)). Resetting the flag to false is important, as the dtor is invoked normally at the end of the expression tree if there was no throw. If another temporary's dtor throws later on, the already destructed temporaries must not be destructed again in the landing pad!

/edit: Well no, the single landing pad wouldn't allow for exception chaining, we'd leave it after the first throwing dtor [I remember now]! That would require the exponential expansion again, in the landing pad, to make sure all dtors are invoked and all exceptions are chained - unless one constructs the exception chain manually, in that case the complexity would only be linear...

@dnadlinger
Copy link
Member Author

Hmm, I knew I missed something. Even for nested try-finally blocks the number of basic blocks is currently quadratic. I get the feeling that this std.uni unit test behaves much worse than that, though.

@kinke
Copy link
Member

kinke commented Aug 11, 2015

I think the total number of dtor calls in a temporariesLandingPad (incl. all its nested postinvoke-temporariesLandingPad scopes) really is exponential in the number of temporaries with potentially throwing dtors N: 2^N - 1
Eeach dtor may or may not throw; we need to cover each throw in a dedicated landing pad, in addition to the regular postinvoke block. Only exception: the last dtor, which may freely throw as there's nothing left to be destructed. So it's a binary tree of depth N-1.
And such a binary tree of variable depth is used for each potentially throwing nested call in the expression tree - and once at the end of the expression tree.

@dnadlinger
Copy link
Member Author

Yep. What DMD seems to do is to branch to the next landing pad at the end of each block (in case no exception is thrown), which looks like it could help things a lot. To be able to implement this, we might have to rework exception chaining support such that we do not need the special collision landing pad "branches". I need to think about this some more, though. (Plus, I think we'd have to split up the landing pad into the header instruction and the actual code content. This would be a comparatively minor tweak, though.)

@kinke
Copy link
Member

kinke commented Aug 11, 2015

The single landing pad with manual exception chaining would be tempting, e.g., for an expression tree auto temp1 = Struct(), auto temp2 = Struct(), auto temp3 = Struct(), auto x = throwingFunction(), nonThrowingFoo(temp1, temp2, temp3, x)

bool isAlive1, isAlive2, isAlive3;
auto temp1 = Struct(), isAlive1 = true;
auto temp2 = Struct(), isAlive2 = true;
auto temp3 = Struct(), isAlive3 = true;
invoke throwingFunction

postInvoke:
    call nonThrowingFoo
    // possibly more invokes, all landing in the same temporariesLandingPad
    // end of expression tree: destruct temporaries normally
    initialize exception chain with no exception
    goto destructTemporaries

temporariesLandingPad:
    extract exception
    initialize exception chain with exception
    goto destructTemporaries

destructTemporaries:
    if (!isAlive3) goto postInvoke3
    else invoke temp3.~this()

    landingPad3:
        extract exception
        add to chain
        goto postInvoke3

    postInvoke3:
        if (!isAlive2) goto postInvoke2
        else invoke temp2.~this()

        landingPad2:
            extract exception
            add to chain
            goto postInvoke2

        postInvoke2:
            if (!isAlive1) goto postInvoke1
            else invoke temp1.~this()

            landingPad1:
                extract exception
                add to chain
                goto postInvoke1

            postInvoke1:
                check for exception (chain) and rethrow

No idea whether that would be feasible, I have no idea about exception collision/chaining.

@dnadlinger
Copy link
Member Author

Even not considering exception chaining for the moment, wouldn't we rather have proper landing pads set up for the various "construction stages" that branch off to somewhere in the list of destruction BBs instead of keeping "isAlive" booleans around?

@dnadlinger
Copy link
Member Author

If we fix this properly, we can finally also get rid of having to be able to emit code more than once for finally handlers.

@kinke
Copy link
Member

kinke commented Aug 11, 2015

Wouldn't we rather have proper landing pads set up for the various "construction stages" that branch off to somewhere in the list of destruction BBs instead of keeping "isAlive" booleans around?

Hmm, that's more or less what it amounts to anyway and would obviously be preferrable, but eliding the bools would mean that if the 4th temporary is to be destructed, that all 3 below it (declared before it) need to be alive too. So we'll probably need to be careful about not destructing any temporary earlier (no more nested toElemDtor() scopes), otherwise the current temporaries stack size doesn't uniquely map to the most recent temporary (=> destruction BB to begin with). I.e., the temporaries stack should constantly grow in declaration order.

@dnadlinger
Copy link
Member Author

Is there a valid situation where temporaries would not be destructed in reverse order of construction? It seems to me that all such occurrences would likely be bugs anyway.

@kinke
Copy link
Member

kinke commented Aug 11, 2015

Nope, it's just got to do with unique mapping:

construct temp1
  construct temp2 (nested toElemDtor, e.g. argprefix)
  destruct temp2 (eager destruction)
  construct temp3
  destruct temp3
destruct temp1

Now the problem is that at the end of temp3's destruction BB, we'd need to jump to temp1's destruction BB, not to temp2's, as that one is already taken care of. Well actually now that I've typed this down, I'm thinking that that is actually not a problem, it's just some additional complexity when determining the branch targets, as the parent temporary remains the same throughout the lifetime of a temporary (and always outlives it).

@dnadlinger
Copy link
Member Author

I have an idea how to avoid the separate collision handling by moving that code back into the personality routine. However, another issue is that we still generate a separate finally basic block each time we leave a try (or scope with dtors), for example with multiple returns.

@kinke
Copy link
Member

kinke commented Aug 12, 2015

I have an idea how to avoid the separate collision handling by moving that code back into the personality routine.

Sounds great.

However, another issue is that we still generate a separate finally basic block each time we leave a try (or scope with dtors), for example with multiple returns.

Where's the code responsible for this? The codegen for a TryFinallyStatement seems fine - a single landing pad is set up for the whole try-block if there's a finally body. So is it the front-end already generating the whole expansion of try-finallys?
Edit: okay found it, so DtoEnclosingHandlers() it is.

The .ll for this really seems excessive: ;)

struct S {
    int a;
    ~this() {}
}

int main(string[] args) {
    int foo;
    try {
        S s1;
        if (args.length <= 1) return 1;

        S s2;
        if (args.length <= 2) return 2;

        try {
            S s3;
            if (args.length <= 3) return 3;

            S s4;
        }
        finally {
            foo = 333;
        }
    }
    finally {
        if (foo == 0)
            foo = 666;
    }

    return foo;
}

dnadlinger added a commit to ldc-developers/druntime that referenced this issue Aug 13, 2015
…e collision landing pads in user code

Also fixes a chaining bug noticed in ldc-developers/ldc#1010.

Support for HP_LIBUNWIND is removed; it wasn't functional before
either. In case somebody wants to port LDC to a platform without
an unwinder, they can use the Apple unwinder implementation
available as part of the LLVM project (repository at
http://llvm.org/git/libunwind).
@dnadlinger
Copy link
Member Author

With my exception chaining fixes, we handle 4) and 5) from your test case set correctly.

@dnadlinger
Copy link
Member Author

See #1019.

Regarding the DtoEnclosingHandlers blow-up issue, I think one possible solution that allows us to only codegen each finally block contents once is to have a local boolean that keeps track of whether an exception has been thrown in this scope. At the end of the cleanup chain we check it and continue unwinding or return normally as appropriate. For multiple returns of types that return an actual LLVM value (i.e. not through an implicit pointer argument), we'd need to have a temporary alloca set up correctly. This is also what Clang does, and should not be an issue w.r.t. (N)RVO because those cases use pointer arguments anyway.

Are you working on this?

@dnadlinger
Copy link
Member Author

I just posted a preliminary patch for the frontend argprefix issue to Bugzilla. It fixes all the test cases for us, but I don't have time to properly test it right now.

dnadlinger added a commit to ldc-developers/druntime that referenced this issue Aug 13, 2015
…e collision landing pads in user code

Also fixes a chaining bug noticed in ldc-developers/ldc#1010.

Support for HP_LIBUNWIND is removed; it wasn't functional before
either. In case somebody wants to port LDC to a platform without
an unwinder, they can use the Apple unwinder implementation
available as part of the LLVM project (repository at
http://llvm.org/git/libunwind).
@kinke
Copy link
Member

kinke commented Aug 13, 2015

Impressive progress! I'm lacking time too atm..

@dnadlinger
Copy link
Member Author

Unfortunately, I lack the time to do the actual rewrite of the finally block generation (i.e. fixing the performance regression) anytime soon. Now that exception chaining is out of the picture, it should be fairly simple, though. Having multiple exit paths from a scope might complicate the initial analysis of the problem some, but I'm fairly confident that the solution will be fairly straightforward once you have the right perspective.

It might also help to look at what Clang does. It seems to use local integer variables to disambiguate cases where there are multiple exit paths from a "finally" block. However, without having thought about the issue in any detail, it seems that our implementation might end up quite a bit simpler because the rules for goto, etc. are much more restricted in D.

@dnadlinger
Copy link
Member Author

I'm knee-deep into what amounts to a complete rewrite of all our control flow scoping-related codegen. Hope I can finish it today; otherwise I'll have to post an incomplete version for somebody else to continue.

@kinke
Copy link
Member

kinke commented Aug 16, 2015

Heheh, good news & good luck! 👍

@dnadlinger
Copy link
Member Author

Progress! For the example in #1010 (comment) we now generate the following IR (EH disabled temporarily):

define i32 @_Dmain({ i64, { i64, i8* }* } %unnamed) #0 {
  %args = alloca { i64, { i64, i8* }* }, align 8  ; [#uses = 4 type = { i64, { i64, i8* }* }*]
  %foo = alloca i32, align 4                      ; [#uses = 6 type = i32*]
  %s1 = alloca %test.S, align 4                   ; [#uses = 2 type = %test.S*]
  %return.slot = alloca i32                       ; [#uses = 5 type = i32*]
  %s2 = alloca %test.S, align 4                   ; [#uses = 2 type = %test.S*]
  %s3 = alloca %test.S, align 4                   ; [#uses = 2 type = %test.S*]
  %s4 = alloca %test.S, align 4                   ; [#uses = 2 type = %test.S*]
  %branchsel.finally9 = alloca i32                ; [#uses = 3 type = i32*]
  %branchsel.finally8 = alloca i32                ; [#uses = 3 type = i32*]
  %branchsel.finally4 = alloca i32                ; [#uses = 4 type = i32*]
  %branchsel.finally1 = alloca i32                ; [#uses = 5 type = i32*]
  %branchsel.finally = alloca i32                 ; [#uses = 5 type = i32*]
  store { i64, { i64, i8* }* } %unnamed, { i64, { i64, i8* }* }* %args
  store i32 0, i32* %foo
  %1 = bitcast %test.S* %s1 to i8*                ; [#uses = 1 type = i8*]
  call void @llvm.memset.p0i8.i64(i8* %1, i8 0, i64 4, i32 1, i1 false)
  %2 = getelementptr { i64, { i64, i8* }* }* %args, i32 0, i32 0 ; [#uses = 1 type = i64*]
  %.len = load i64* %2                            ; [#uses = 1 type = i64]
  %3 = icmp ule i64 %.len, 1                      ; [#uses = 1 type = i1]
  br i1 %3, label %if2, label %endif3

finally:                                          ; preds = %try.success15, %finally1
  %4 = load i32* %foo                             ; [#uses = 1 type = i32]
  %5 = icmp eq i32 %4, 0                          ; [#uses = 1 type = i1]
  br i1 %5, label %if, label %endif

if:                                               ; preds = %finally
  store i32 666, i32* %foo
  br label %endif

endif:                                            ; preds = %if, %finally
  %6 = load i32* %branchsel.finally               ; [#uses = 1 type = i32]
  switch i32 %6, label %return [
    i32 1, label %try.success16
  ]

finally1:                                         ; preds = %try.success14, %finally4, %if2
  call void @_D4test1S6__dtorMFZv(%test.S* %s1)
  %7 = load i32* %branchsel.finally1              ; [#uses = 1 type = i32]
  switch i32 %7, label %finally [
    i32 1, label %try.success15
  ]

if2:                                              ; preds = %0
  store i32 1, i32* %return.slot
  store i32 0, i32* %branchsel.finally1
  store i32 0, i32* %branchsel.finally
  br label %finally1

endif3:                                           ; preds = %0
  %8 = bitcast %test.S* %s2 to i8*                ; [#uses = 1 type = i8*]
  call void @llvm.memset.p0i8.i64(i8* %8, i8 0, i64 4, i32 1, i1 false)
  %9 = getelementptr { i64, { i64, i8* }* }* %args, i32 0, i32 0 ; [#uses = 1 type = i64*]
  %.len5 = load i64* %9                           ; [#uses = 1 type = i64]
  %10 = icmp ule i64 %.len5, 2                    ; [#uses = 1 type = i1]
  br i1 %10, label %if6, label %endif7

return:                                           ; preds = %endif
  %11 = load i32* %return.slot                    ; [#uses = 1 type = i32]
  ret i32 %11

finally4:                                         ; preds = %try.success13, %finally8, %if6
  call void @_D4test1S6__dtorMFZv(%test.S* %s2)
  %12 = load i32* %branchsel.finally4             ; [#uses = 1 type = i32]
  switch i32 %12, label %finally1 [
    i32 1, label %try.success14
  ]

if6:                                              ; preds = %endif3
  store i32 2, i32* %return.slot
  store i32 0, i32* %branchsel.finally4
  store i32 0, i32* %branchsel.finally1
  store i32 0, i32* %branchsel.finally
  br label %finally4

endif7:                                           ; preds = %endif3
  %13 = bitcast %test.S* %s3 to i8*               ; [#uses = 1 type = i8*]
  call void @llvm.memset.p0i8.i64(i8* %13, i8 0, i64 4, i32 1, i1 false)
  %14 = getelementptr { i64, { i64, i8* }* }* %args, i32 0, i32 0 ; [#uses = 1 type = i64*]
  %.len10 = load i64* %14                         ; [#uses = 1 type = i64]
  %15 = icmp ule i64 %.len10, 3                   ; [#uses = 1 type = i1]
  br i1 %15, label %if11, label %endif12

finally8:                                         ; preds = %try.success, %finally9
  store i32 333, i32* %foo
  %16 = load i32* %branchsel.finally8             ; [#uses = 1 type = i32]
  switch i32 %16, label %finally4 [
    i32 1, label %try.success13
  ]

finally9:                                         ; preds = %endif12, %if11
  call void @_D4test1S6__dtorMFZv(%test.S* %s3)
  %17 = load i32* %branchsel.finally9             ; [#uses = 1 type = i32]
  switch i32 %17, label %finally8 [
    i32 1, label %try.success
  ]

if11:                                             ; preds = %endif7
  store i32 3, i32* %return.slot
  store i32 0, i32* %branchsel.finally9
  store i32 0, i32* %branchsel.finally8
  store i32 0, i32* %branchsel.finally4
  store i32 0, i32* %branchsel.finally1
  store i32 0, i32* %branchsel.finally
  br label %finally9

endif12:                                          ; preds = %endif7
  %18 = bitcast %test.S* %s4 to i8*               ; [#uses = 1 type = i8*]
  call void @llvm.memset.p0i8.i64(i8* %18, i8 0, i64 4, i32 1, i1 false)
  call void @_D4test1S6__dtorMFZv(%test.S* %s4)
  store i32 1, i32* %branchsel.finally9
  br label %finally9

try.success:                                      ; preds = %finally9
  store i32 1, i32* %branchsel.finally8
  br label %finally8

try.success13:                                    ; preds = %finally8
  store i32 1, i32* %branchsel.finally4
  br label %finally4

try.success14:                                    ; preds = %finally4
  store i32 1, i32* %branchsel.finally1
  br label %finally1

try.success15:                                    ; preds = %finally1
  store i32 1, i32* %branchsel.finally
  br label %finally

try.success16:                                    ; preds = %endif
  %19 = load i32* %foo                            ; [#uses = 0 type = i32]
  %20 = load i32* %foo                            ; [#uses = 0 type = i32]
  %21 = load i32* %return.slot                    ; [#uses = 1 type = i32]
  ret i32 %21
}

Unfortunately, it does not optimize as nicely as I would have hoped yet. See below.

@dnadlinger
Copy link
Member Author

Ah, the poor optimization results are probably because the code contains a number of strange loops in untaken branches.

@dnadlinger
Copy link
Member Author

Fixed the bug and edited the above IR. Now optimizes nicely to

define i32 @_Dmain({ i64, { i64, i8* }* } %unnamed) #0 {
  %unnamed.fca.0.extract = extractvalue { i64, { i64, i8* }* } %unnamed, 0 ; [#uses = 2 type = i64]
  %1 = icmp ult i64 %unnamed.fca.0.extract, 2     ; [#uses = 1 type = i1]
  br i1 %1, label %finally1, label %endif3

finally1:                                         ; preds = %0
  ret i32 1

endif3:                                           ; preds = %0
  %2 = icmp ult i64 %unnamed.fca.0.extract, 3     ; [#uses = 1 type = i1]
  %.26 = select i1 %2, i32 2, i32 3               ; [#uses = 1 type = i32]
  ret i32 %.26
}

On x86, this now gives (-O3)

test`_Dmain:
test[0x100000bf0] <+0>:  cmp    rdi, 0x1
test[0x100000bf4] <+4>:  ja     0x100000bfc               ; <+12>
test[0x100000bf6] <+6>:  mov    eax, 0x1
test[0x100000bfb] <+11>: ret
test[0x100000bfc] <+12>: cmp    rdi, 0x2
test[0x100000c00] <+16>: seta   al
test[0x100000c03] <+19>: movzx  eax, al
test[0x100000c06] <+22>: or     eax, 0x2
test[0x100000c09] <+25>: ret

while DMD produces (-O -release -inline)

test`_Dmain:
test[0x100000cf4] <+0>:   push   rbp
test[0x100000cf5] <+1>:   mov    rbp, rsp
test[0x100000cf8] <+4>:   sub    rsp, 0x70
test[0x100000cfc] <+8>:   push   rbx
test[0x100000cfd] <+9>:   push   r12
test[0x100000cff] <+11>:  push   r13
test[0x100000d01] <+13>:  push   r14
test[0x100000d03] <+15>:  push   r15
test[0x100000d05] <+17>:  mov    rbx, rdi
test[0x100000d08] <+20>:  mov    rcx, rsi
test[0x100000d0b] <+23>:  xor    eax, eax
test[0x100000d0d] <+25>:  mov    dword ptr [rbp - 0x8], eax
test[0x100000d10] <+28>:  mov    dword ptr [rbp - 0x18], eax
test[0x100000d13] <+31>:  cmp    rbx, 0x1
test[0x100000d17] <+35>:  ja     0x100000d56               ; <+98>
test[0x100000d19] <+37>:  mov    eax, 0x1
test[0x100000d1e] <+42>:  mov    qword ptr [rbp - 0x28], rax
test[0x100000d22] <+46>:  sub    rsp, 0x8
test[0x100000d26] <+50>:  call   0x100000e8b               ; <+407>
test[0x100000d2b] <+55>:  add    rsp, 0x8
test[0x100000d2f] <+59>:  mov    rax, qword ptr [rbp - 0x28]
test[0x100000d33] <+63>:  mov    qword ptr [rbp - 0x20], rax
test[0x100000d37] <+67>:  sub    rsp, 0x8
test[0x100000d3b] <+71>:  call   0x100000e9f               ; <+427>
test[0x100000d40] <+76>:  add    rsp, 0x8
test[0x100000d44] <+80>:  mov    rax, qword ptr [rbp - 0x20]
test[0x100000d48] <+84>:  pop    r15
test[0x100000d4a] <+86>:  pop    r14
test[0x100000d4c] <+88>:  pop    r13
test[0x100000d4e] <+90>:  pop    r12
test[0x100000d50] <+92>:  pop    rbx
test[0x100000d51] <+93>:  mov    rsp, rbp
test[0x100000d54] <+96>:  pop    rbp
test[0x100000d55] <+97>:  ret
test[0x100000d56] <+98>:  mov    dword ptr [rbp - 0x14], eax
test[0x100000d59] <+101>: cmp    rbx, 0x2
test[0x100000d5d] <+105>: ja     0x100000db1               ; <+189>
test[0x100000d5f] <+107>: mov    eax, 0x2
test[0x100000d64] <+112>: mov    qword ptr [rbp - 0x40], rax
test[0x100000d68] <+116>: sub    rsp, 0x8
test[0x100000d6c] <+120>: call   0x100000e77               ; <+387>
test[0x100000d71] <+125>: add    rsp, 0x8
test[0x100000d75] <+129>: mov    rax, qword ptr [rbp - 0x40]
test[0x100000d79] <+133>: mov    qword ptr [rbp - 0x38], rax
test[0x100000d7d] <+137>: sub    rsp, 0x8
test[0x100000d81] <+141>: call   0x100000e8b               ; <+407>
test[0x100000d86] <+146>: add    rsp, 0x8
test[0x100000d8a] <+150>: mov    rax, qword ptr [rbp - 0x38]
test[0x100000d8e] <+154>: mov    qword ptr [rbp - 0x30], rax
test[0x100000d92] <+158>: sub    rsp, 0x8
test[0x100000d96] <+162>: call   0x100000e9f               ; <+427>
test[0x100000d9b] <+167>: add    rsp, 0x8
test[0x100000d9f] <+171>: mov    rax, qword ptr [rbp - 0x30]
test[0x100000da3] <+175>: pop    r15
test[0x100000da5] <+177>: pop    r14
test[0x100000da7] <+179>: pop    r13
test[0x100000da9] <+181>: pop    r12
test[0x100000dab] <+183>: pop    rbx
test[0x100000dac] <+184>: mov    rsp, rbp
test[0x100000daf] <+187>: pop    rbp
test[0x100000db0] <+188>: ret
test[0x100000db1] <+189>: mov    dword ptr [rbp - 0x10], eax
test[0x100000db4] <+192>: cmp    rbx, 0x3
test[0x100000db8] <+196>: ja     0x100000e36               ; <+322>
test[0x100000dba] <+198>: mov    eax, 0x3
test[0x100000dbf] <+203>: mov    qword ptr [rbp - 0x68], rax
test[0x100000dc3] <+207>: sub    rsp, 0x8
test[0x100000dc7] <+211>: call   0x100000e4c               ; <+344>
test[0x100000dcc] <+216>: add    rsp, 0x8
test[0x100000dd0] <+220>: mov    rax, qword ptr [rbp - 0x68]
test[0x100000dd4] <+224>: mov    qword ptr [rbp - 0x60], rax
test[0x100000dd8] <+228>: sub    rsp, 0x8
test[0x100000ddc] <+232>: call   0x100000e60               ; <+364>
test[0x100000de1] <+237>: add    rsp, 0x8
test[0x100000de5] <+241>: mov    rax, qword ptr [rbp - 0x60]
test[0x100000de9] <+245>: mov    qword ptr [rbp - 0x58], rax
test[0x100000ded] <+249>: sub    rsp, 0x8
test[0x100000df1] <+253>: call   0x100000e77               ; <+387>
test[0x100000df6] <+258>: add    rsp, 0x8
test[0x100000dfa] <+262>: mov    rax, qword ptr [rbp - 0x58]
test[0x100000dfe] <+266>: mov    qword ptr [rbp - 0x50], rax
test[0x100000e02] <+270>: sub    rsp, 0x8
test[0x100000e06] <+274>: call   0x100000e8b               ; <+407>
test[0x100000e0b] <+279>: add    rsp, 0x8
test[0x100000e0f] <+283>: mov    rax, qword ptr [rbp - 0x50]
test[0x100000e13] <+287>: mov    qword ptr [rbp - 0x48], rax
test[0x100000e17] <+291>: sub    rsp, 0x8
test[0x100000e1b] <+295>: call   0x100000e9f               ; <+427>
test[0x100000e20] <+300>: add    rsp, 0x8
test[0x100000e24] <+304>: mov    rax, qword ptr [rbp - 0x48]
test[0x100000e28] <+308>: pop    r15
test[0x100000e2a] <+310>: pop    r14
test[0x100000e2c] <+312>: pop    r13
test[0x100000e2e] <+314>: pop    r12
test[0x100000e30] <+316>: pop    rbx
test[0x100000e31] <+317>: mov    rsp, rbp
test[0x100000e34] <+320>: pop    rbp
test[0x100000e35] <+321>: ret
test[0x100000e36] <+322>: mov    dword ptr [rbp - 0xc], eax
test[0x100000e39] <+325>: lea    rcx, qword ptr [rbp - 0xc]
test[0x100000e3d] <+329>: sub    rsp, 0x8
test[0x100000e41] <+333>: call   0x100000e4c               ; <+344>
test[0x100000e46] <+338>: add    rsp, 0x8
test[0x100000e4a] <+342>: jmp    0x100000e51               ; <+349>
test[0x100000e4c] <+344>: lea    rcx, qword ptr [rbp - 0x10]
test[0x100000e50] <+348>: ret
test[0x100000e51] <+349>: sub    rsp, 0x8
test[0x100000e55] <+353>: call   0x100000e60               ; <+364>
test[0x100000e5a] <+358>: add    rsp, 0x8
test[0x100000e5e] <+362>: jmp    0x100000e68               ; <+372>
test[0x100000e60] <+364>: mov    dword ptr [rbp - 0x8], 0x14d
test[0x100000e67] <+371>: ret
test[0x100000e68] <+372>: sub    rsp, 0x8
test[0x100000e6c] <+376>: call   0x100000e77               ; <+387>
test[0x100000e71] <+381>: add    rsp, 0x8
test[0x100000e75] <+385>: jmp    0x100000e7c               ; <+392>
test[0x100000e77] <+387>: lea    rdx, qword ptr [rbp - 0x14]
test[0x100000e7b] <+391>: ret
test[0x100000e7c] <+392>: sub    rsp, 0x8
test[0x100000e80] <+396>: call   0x100000e8b               ; <+407>
test[0x100000e85] <+401>: add    rsp, 0x8
test[0x100000e89] <+405>: jmp    0x100000e90               ; <+412>
test[0x100000e8b] <+407>: lea    rbx, qword ptr [rbp - 0x18]
test[0x100000e8f] <+411>: ret
test[0x100000e90] <+412>: sub    rsp, 0x8
test[0x100000e94] <+416>: call   0x100000e9f               ; <+427>
test[0x100000e99] <+421>: add    rsp, 0x8
test[0x100000e9d] <+425>: jmp    0x100000ead               ; <+441>
test[0x100000e9f] <+427>: cmp    dword ptr [rbp - 0x8], 0x0
test[0x100000ea3] <+431>: jne    0x100000eac               ; <+440>
test[0x100000ea5] <+433>: mov    dword ptr [rbp - 0x8], 0x29a
test[0x100000eac] <+440>: ret
test[0x100000ead] <+441>: mov    eax, dword ptr [rbp - 0x8]
test[0x100000eb0] <+444>: pop    r15
test[0x100000eb2] <+446>: pop    r14
test[0x100000eb4] <+448>: pop    r13
test[0x100000eb6] <+450>: pop    r12
test[0x100000eb8] <+452>: pop    rbx
test[0x100000eb9] <+453>: mov    rsp, rbp
test[0x100000ebc] <+456>: pop    rbp
test[0x100000ebd] <+457>: ret

@dnadlinger
Copy link
Member Author

See #1030. It would be awesome if somebody could have a look at the remaining issues (the change is not as huge as it looks; one commit simplifies IRScope which leads to a lot of lines in the diff).

@kinke
Copy link
Member

kinke commented Aug 19, 2015

Awesome! Will check it out when I find some time...

@dnadlinger
Copy link
Member Author

The main problem should be fixed now, by making sure that every piece of catch/finally code (including implicit things like dtor calls) is only emitted once.

Please open separate follow-up issues for any specific remaining issues. One thing we might want to do in particular in the future is to fine-tune the cleanup emission such that e.g. trivial dtor calls get inlined directly into the blocks before the branch (mainly to make debug builds faster, LLVM optimizes things quite well in release mode).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants