Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce labeled continue syntax inside a switch expression #8220

Closed
andrewrk opened this issue Mar 12, 2021 · 32 comments
Closed

introduce labeled continue syntax inside a switch expression #8220

andrewrk opened this issue Mar 12, 2021 · 32 comments
Labels
accepted This proposal is planned. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@andrewrk
Copy link
Member

andrewrk commented Mar 12, 2021

Background

The goto keyword was removed in #630. This remains the right call because all the control flow in Zig can be expressed in a better way: continue to goto backwards, and break to goto forwards.

However, in C, there is another concept, called "computed goto". This is described in #5950 and briefly discussed in #2162. This concept is not currently possible in Zig. It is possible to model the desired semantics with existing control flow features quite simply, but it is not possible to obtain the desired machine code, even in optimized builds.

Problem Statement

For example (godbolt link):

const Inst = extern struct {
    tag: Tag,

    const Tag = extern enum {
        add,
        addwrap,
        alloc,
        alloc_mut,
        alloc_inferred,
        alloc_inferred_mut,
        anyframe_type,
        array_cat,
        array_mul,
        array_type,
        array_type_sentinel,
        indexable_ptr_len,
    };
};

export fn entry(inst_list_ptr: [*]const Inst, inst_list_len: usize, map: [*]u32) void {
    const inst_list = inst_list_ptr[0..inst_list_len];
    for (inst_list) |inst, i| {
        map[i] = switch (inst.tag) {
            .add => analyze_add(inst),
            .addwrap => analyze_addwrap(inst),
            .alloc => analyze_alloc(inst),
            .alloc_mut => analyze_alloc_mut(inst),
            .alloc_inferred => analyze_alloc_inferred(inst),
            .alloc_inferred_mut => analyze_alloc_inferred_mut(inst),
            .anyframe_type => analyze_anyframe_type(inst),
            .array_cat => analyze_array_cat(inst),
            .array_mul => analyze_array_mul(inst),
            .array_type => analyze_array_type(inst),
            .array_type_sentinel => analyze_array_type_sentinel(inst),
            .indexable_ptr_len => analyze_indexable_ptr_len(inst),
        };
    }
}

extern fn analyze_add(inst: Inst) u32;
extern fn analyze_addwrap(inst: Inst) u32;
extern fn analyze_alloc(inst: Inst) u32;
extern fn analyze_alloc_mut(inst: Inst) u32;
extern fn analyze_alloc_inferred(inst: Inst) u32;
extern fn analyze_alloc_inferred_mut(inst: Inst) u32;
extern fn analyze_anyframe_type(inst: Inst) u32;
extern fn analyze_array_cat(inst: Inst) u32;
extern fn analyze_array_mul(inst: Inst) u32;
extern fn analyze_array_type(inst: Inst) u32;
extern fn analyze_array_type_sentinel(inst: Inst) u32;
extern fn analyze_indexable_ptr_len(inst: Inst) u32;

In the generated machine code, each prong ends up jumping back to the loop condition, before getting re-dispatched to the next prong:

.LBB0_3:
        xor     edi, edi
        call    analyze_add
        jmp     .LBB0_15
.LBB0_4:
        mov     edi, 1
        call    analyze_addwrap
        jmp     .LBB0_15

The reason this machine code is not what we desire is described in this paper in the section "Direct Threading" and "The Context Problem":

Mispredicted branches pose a serious challenge to modern processors because they threaten to starve the processor of instructions. The problem is that before the destination of the branch is known the execution of the pipeline may run dry. To perform at full speed, modern CPUs need to keep their pipelines full by correctly predicting branch targets.

This problem is even worse for direct call threading and switch dispatch. For these techniques there is only one dispatch branch and so all dispatches share the same BTB entry. Direct call threading will mispredict all dispatches except when the same virtual instruction body is dispatched multiple times consecutively.

They explain it in a nice, intuitive way here:

Another perspective is that the destination of the indirect dispatch branch is unpredictable because its destination is not correlated with the hardware pc. Instead, its destination is correlated to the vPC. We refer to this lack of correlation between the hardware pc and vPC as the context problem. We choose the term context following its use in context sensitive inlining [#!Grove_Chambers_2002!#] because in both cases the context of shared code (in their case methods, in our case virtual instruction bodies) is important to consider.

So the problem statement here is that we want to be able to write zig code that outputs machine code that matches this Direct Threading pattern. In one sense, it is an optimization problem, since we can model the same semantics with other language constructs and other machine code. But in another sense, it is more fundamental than an optimization problem, because Zig is a language that wants to generate optimal machine code, meaning it is possible to write Zig code that generates machine code equivalent or better to what you could write by hand.

In short summary, we want to be able to express zig code where each switch prong jumps directly to the next prong, instead of all switch prongs sharing the same indirect jump, in order to benefit the branch predictor.

Research Dump

Can LLVM Do the Optimization?

In this example (godbolt link), I changed the loop to while(true) and manually inlined the continue expression into each switch prong, with a continue. It does not get much simpler than this; we are practically begging LLVM to do the optimization.

const Inst = extern struct {
    tag: Tag,

    const Tag = extern enum {
        add,
        addwrap,
        alloc,
        alloc_mut,
        alloc_inferred,
        alloc_inferred_mut,
        anyframe_type,
        array_cat,
        array_mul,
    };
};

export fn entry(inst_list_ptr: [*]const Inst, inst_list_len: usize, map: [*]u32) void {
    const inst_list = inst_list_ptr[0..inst_list_len];
    var i: usize = 0;
    while (true) {
        const inst = inst_list[i];
        switch (inst.tag) {
            .add => {
                map[i] = analyze_add(inst);
                i += 1;
                if (i < inst_list_len) continue;
            },
            .addwrap => {
                map[i] = analyze_addwrap(inst);
                i += 1;
                if (i < inst_list_len) continue;
            },
            .alloc => {
                map[i] = analyze_alloc(inst);
                i += 1;
                if (i < inst_list_len) continue;
            },
            .alloc_mut => {
                map[i] = analyze_alloc_mut(inst);
                i += 1;
                if (i < inst_list_len) continue;
            },
            .alloc_inferred => {
                map[i] = analyze_alloc_inferred(inst);
                i += 1;
                if (i < inst_list_len) continue;
            },
            .alloc_inferred_mut => {
                map[i] = analyze_alloc_inferred_mut(inst);
                i += 1;
                if (i < inst_list_len) continue;
            },
            .anyframe_type => {
                map[i] = analyze_anyframe_type(inst);
                i += 1;
                if (i < inst_list_len) continue;
            },
            .array_cat => {
                map[i] = analyze_array_cat(inst);
                i += 1;
                if (i < inst_list_len) continue;
            },
            .array_mul => {
                map[i] = analyze_array_mul(inst);
                i += 1;
                if (i < inst_list_len) continue;
            },
        }
        break;
    }
}

extern fn analyze_add(inst: Inst) u32;
extern fn analyze_addwrap(inst: Inst) u32;
extern fn analyze_alloc(inst: Inst) u32;
extern fn analyze_alloc_mut(inst: Inst) u32;
extern fn analyze_alloc_inferred(inst: Inst) u32;
extern fn analyze_alloc_inferred_mut(inst: Inst) u32;
extern fn analyze_anyframe_type(inst: Inst) u32;
extern fn analyze_array_cat(inst: Inst) u32;
extern fn analyze_array_mul(inst: Inst) u32;

Snippet of assembly:

.LBB0_3:
        mov     dword ptr [r14 + 4*rbx], eax
        inc     rbx
        cmp     rbx, r15
        jae     .LBB0_4
.LBB0_1:
        mov     eax, dword ptr [r12 + 4*rbx]
        jmp     qword ptr [8*rax + .LJTI0_0]
.LBB0_2:
        xor     edi, edi
        call    analyze_add
        jmp     .LBB0_3
.LBB0_6:
        mov     edi, 2
        call    analyze_alloc
        jmp     .LBB0_3
.LBB0_5:
        mov     edi, 1
        call    analyze_addwrap
        jmp     .LBB0_3
.LBB0_7:
        mov     edi, 3
        call    analyze_alloc_mut
        jmp     .LBB0_3

Here, LLVM actually figured out the continue expression was duplicated N times, and un-inlined it, putting the code back how it was! So crafty.

EDIT: New Discovery

It does not get much simpler than this

Wrong!

After typing up this whole proposal, I realized that I did not try that optimization with using an "end" tag in the above code. Here is the case, modified (godbolt link):

const Inst = extern struct {
    tag: Tag,

    const Tag = extern enum {
        end,
        add,
        addwrap,
        alloc,
        alloc_mut,
        alloc_inferred,
        alloc_inferred_mut,
        anyframe_type,
        array_cat,
        array_mul,
    };
};

export fn entry(inst_list: [*]const Inst, map: [*]u32) void {
    var i: usize = 0;
    while (true) {
        const inst = inst_list[i];
        switch (inst.tag) {
            .end => return,
            .add => {
                map[i] = analyze_add(inst);
                i += 1;
                continue;
            },
            .addwrap => {
                map[i] = analyze_addwrap(inst);
                i += 1;
                continue;
            },
            .alloc => {
                map[i] = analyze_alloc(inst);
                i += 1;
                continue;
            },
            .alloc_mut => {
                map[i] = analyze_alloc_mut(inst);
                i += 1;
                continue;
            },
            .alloc_inferred => {
                map[i] = analyze_alloc_inferred(inst);
                i += 1;
                continue;
            },
            .alloc_inferred_mut => {
                map[i] = analyze_alloc_inferred_mut(inst);
                i += 1;
                continue;
            },
            .anyframe_type => {
                map[i] = analyze_anyframe_type(inst);
                i += 1;
                continue;
            },
            .array_cat => {
                map[i] = analyze_array_cat(inst);
                i += 1;
                continue;
            },
            .array_mul => {
                map[i] = analyze_array_mul(inst);
                i += 1;
                continue;
            },
        }
        break;
    }
}

extern fn analyze_add(inst: Inst) u32;
extern fn analyze_addwrap(inst: Inst) u32;
extern fn analyze_alloc(inst: Inst) u32;
extern fn analyze_alloc_mut(inst: Inst) u32;
extern fn analyze_alloc_inferred(inst: Inst) u32;
extern fn analyze_alloc_inferred_mut(inst: Inst) u32;
extern fn analyze_anyframe_type(inst: Inst) u32;
extern fn analyze_array_cat(inst: Inst) u32;
extern fn analyze_array_mul(inst: Inst) u32;

Two example prongs from the machine code:

.LBB0_2:
        mov     edi, 1
        call    analyze_add
        mov     dword ptr [r14 + rbx], eax
        add     rbx, 4
        mov     eax, dword ptr [r15 + rbx]
        jmp     qword ptr [8*rax + .LJTI0_0]
.LBB0_3:
        mov     edi, 2
        call    analyze_addwrap
        mov     dword ptr [r14 + rbx], eax
        add     rbx, 4
        mov     eax, dword ptr [r15 + rbx]
        jmp     qword ptr [8*rax + .LJTI0_0]

It's perfect! This is exactly what we wanted.

This compromises the entire proposal. I will still post the proposal but this new discovery makes it seem unnecessary, since, in fact, we are hereby observing #2162 already implemented and working inside LLVM.

Real Actual Use Case

Here's one in the self-hosted compiler:

switch (old_inst.tag) {

This switch is inside a loop over ZIR instructions. In optimized builds, we noticed non-trivial amount of time spent in the overhead of this dispatch, when analyzing a recursive comptime fibonacci function call.

This pattern also exists in:

  • The tokenizer
  • The parser
  • astgen
  • sema (this is the linked one above)
  • codegen
  • translate-c
  • zig fmt

(pretty much in every stage of the pipeline)

Other Possible Solution: Tail Calls

Tail calls solve this problem. Each switch prong would return foo() (tail call) and foo() at the end of its business would inline call a function which would do the switch and then tail call the next prong.

This is reasonable in the sense that it is doable right now; however there are some problems:

  • As far as I understand, tail calls don't work on some architectures.
    • (what are these? does anybody know?)
  • I'm also concerned about trying to debug when doing dispatch with tail calls.
  • It forces you to organize your logic into functions. That's another jump that
    maybe you did not want in your hot path.

Proposal

I propose to add continue :label expression syntax, and the ability to label switch expressions. Here is an example:

const Inst = extern struct {
    tag: Tag,

    const Tag = extern enum {
        end,
        add,
        addwrap,
        alloc,
        alloc_mut,
        alloc_inferred,
        alloc_inferred_mut,
        anyframe_type,
        array_cat,
        array_mul,
    };
};

export fn entry(inst_list: [*]const Inst, map: [*]u32) void {
    var i: usize = 0;
    sw: switch (inst_list[i].tag) {
        .end => return,
        .add => {
            map[i] = analyze_add(inst_list[i]);
            i += 1;
            continue :sw inst_list[i].tag;
        },
        .addwrap => {
            map[i] = analyze_addwrap(inst_list[i]);
            i += 1;
            continue :sw inst_list[i].tag;
        },
        .alloc => {
            map[i] = analyze_alloc(inst_list[i]);
            i += 1;
            continue :sw inst_list[i].tag;
        },
        .alloc_mut => {
            map[i] = analyze_alloc_mut(inst_list[i]);
            i += 1;
            continue :sw inst_list[i].tag;
        },
        .alloc_inferred => {
            map[i] = analyze_alloc_inferred(inst_list[i]);
            i += 1;
            continue :sw inst_list[i].tag;
        },
        .alloc_inferred_mut => {
            map[i] = analyze_alloc_inferred_mut(inst_list[i]);
            i += 1;
            continue :sw inst_list[i].tag;
        },
        .anyframe_type => {
            map[i] = analyze_anyframe_type(inst_list[i]);
            i += 1;
            continue :sw inst_list[i].tag;
        },
        .array_cat => {
            map[i] = analyze_array_cat(inst_list[i]);
            i += 1;
            continue :sw inst_list[i].tag;
        },
        .array_mul => {
            map[i] = analyze_array_mul(inst_list[i]);
            i += 1;
            continue :sw inst_list[i].tag;
        },
    }
}

extern fn analyze_add(inst: Inst) u32;
extern fn analyze_addwrap(inst: Inst) u32;
extern fn analyze_alloc(inst: Inst) u32;
extern fn analyze_alloc_mut(inst: Inst) u32;
extern fn analyze_alloc_inferred(inst: Inst) u32;
extern fn analyze_alloc_inferred_mut(inst: Inst) u32;
extern fn analyze_anyframe_type(inst: Inst) u32;
extern fn analyze_array_cat(inst: Inst) u32;
extern fn analyze_array_mul(inst: Inst) u32;

The new labeled continue syntax is syntactically unambiguous at a glance that it jumps to a switch expression, because it is the only form where continue accepts an operand. More details:

  • labeled continue with an operand on a loop would be a compile error
  • labeled break with a switch would be OK.

How to Lower this to LLVM

Note: I wrote this section before the EDIT: New Discovery section.

One idea I had was to put the switchbr instruction inside each prong. I did some LLVM IR surgery to try out this idea (godbolt link):

SwitchProngAdd:                                     ; preds = %WhileBody
  %9 = load i64, i64* %i, align 8
  %10 = load i32*, i32** %map, align 8
  %11 = getelementptr inbounds i32, i32* %10, i64 %9
  %12 = bitcast %Inst* %inst to i32*
  %13 = load i32, i32* %12, align 4
  %14 = call i32 @analyze_add(i32 %13)
  store i32 %14, i32* %11, align 4
  %15 = load i64, i64* %i, align 8
  %16 = add nuw i64 %15, 1
  store i64 %16, i64* %i, align 8
  %17 = load i64, i64* %i, align 8
  %18 = load %Inst*, %Inst** %inst_list, align 8
  %19 = getelementptr inbounds %Inst, %Inst* %18, i64 %17
  %20 = getelementptr inbounds %Inst, %Inst* %19, i32 0, i32 0
  %a20 = load i32, i32* %20, align 4
  switch i32 %a20, label %SwitchElse18 [
    i32 0, label %SwitchProngEnd
    i32 1, label %SwitchProngAdd
    i32 2, label %SwitchProngAddWrap
    i32 3, label %SwitchProngAlloc
    i32 4, label %SwitchProngAllocMut
    i32 5, label %SwitchProngAllocInferred
    i32 6, label %SwitchProngAllocInferredMut
    i32 7, label %SwitchProngAnyframeType
    i32 8, label %SwitchProngArrayCat
    i32 9, label %SwitchProngArrayMul
  ]

SwitchProngAddWrap:                                     ; preds = %WhileBody
  %21 = load i64, i64* %i, align 8
  %22 = load i32*, i32** %map, align 8
  %23 = getelementptr inbounds i32, i32* %22, i64 %21
  %24 = bitcast %Inst* %inst to i32*
  %25 = load i32, i32* %24, align 4
  %26 = call i32 @analyze_addwrap(i32 %25)
  store i32 %26, i32* %23, align 4
  %27 = load i64, i64* %i, align 8
  %28 = add nuw i64 %27, 1
  store i64 %28, i64* %i, align 8
  %29 = load i64, i64* %i, align 8
  %30 = load %Inst*, %Inst** %inst_list, align 8
  %31 = getelementptr inbounds %Inst, %Inst* %30, i64 %29
  %32 = getelementptr inbounds %Inst, %Inst* %31, i32 0, i32 0
  %a32 = load i32, i32* %32, align 4
  switch i32 %a32, label %SwitchElse18 [
    i32 0, label %SwitchProngEnd
    i32 1, label %SwitchProngAdd
    i32 2, label %SwitchProngAddWrap
    i32 3, label %SwitchProngAlloc
    i32 4, label %SwitchProngAllocMut
    i32 5, label %SwitchProngAllocInferred
    i32 6, label %SwitchProngAllocInferredMut
    i32 7, label %SwitchProngAnyframeType
    i32 8, label %SwitchProngArrayCat
    i32 9, label %SwitchProngArrayMul
  ]

The machine code for the prongs looks like this:

<snip>
.LBB0_8:                                # %SwitchProngAnyframeType
        mov     edi, r12d
        call    analyze_anyframe_type
        mov     dword ptr [r14 + 4*rbx], eax
        mov     eax, dword ptr [r15 + 4*rbx + 4]
        add     rbx, 1
        jmp     qword ptr [8*rax + .LJTI0_7]
.LBB0_9:                                # %SwitchProngArrayCat
        mov     edi, r12d
        call    analyze_array_cat
        mov     dword ptr [r14 + 4*rbx], eax
        mov     eax, dword ptr [r15 + 4*rbx + 4]
        add     rbx, 1
        jmp     qword ptr [8*rax + .LJTI0_8]
.LBB0_10:                               # %SwitchProngArrayMul
        mov     edi, r12d
        call    analyze_array_mul
        mov     dword ptr [r14 + 4*rbx], eax
        mov     eax, dword ptr [r15 + 4*rbx + 4]
        add     rbx, 1
        jmp     qword ptr [8*rax + .LJTI0_9]
<snip>

Pretty nice. This is exactly what we want - there is an indirect jump in each prong directly to the next prong. But the problem is that even though we should have the same jump table 9 times, LLVM duplicates the jump table 9 times:

.LJTI0_0:
        .quad   .LBB0_1
        .quad   .LBB0_2
        .quad   .LBB0_3
        .quad   .LBB0_4
        .quad   .LBB0_5
        .quad   .LBB0_6
        .quad   .LBB0_7
        .quad   .LBB0_8
        .quad   .LBB0_9
        .quad   .LBB0_10
.LJTI0_1:
        .quad   .LBB0_1
        .quad   .LBB0_2
        .quad   .LBB0_3
        .quad   .LBB0_4
        .quad   .LBB0_5
        .quad   .LBB0_6
        .quad   .LBB0_7
        .quad   .LBB0_8
        .quad   .LBB0_9
        .quad   .LBB0_10
<snip>

The duplicated jump tables are problematic, because in reality there could reasonably be about 150-200 instruction tags, which makes the jump table 600-800 bytes. This is fine; for example my L1 cache size is 256 KiB. But I wouldn't want to multiply that jump table by 200! It would be 156 KiB just for the jump tables alone. That would wreak havoc on the cache.

Unless this improves upstream, the best strategy to lower this language feature will be for Zig to manually create the jump table itself instead of relying on LLVM to do it, using LLVM's ability to take the address of basic blocks and put them into an array. This will essentially generate the same code that you would get in Clang if you used computed goto in the traditional way.

How to Lower this in Self-Hosted Backends

We have lots of options here. It would be quite straightforward, since we have full control over AIR, as well as the backend code generation.

OK But Is The Perf Actually Good?

I don't know. I think realistically in order to benchmark this and find out if the machine code performs better we have to implement it first.

@andrewrk andrewrk added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label Mar 12, 2021
@andrewrk andrewrk added this to the 0.8.0 milestone Mar 12, 2021
@andrewrk
Copy link
Member Author

EDIT: New Discovery

And here it is in idiomatic zig form, no duplicated code or anything:

const Inst = extern struct {
    tag: Tag,

    const Tag = extern enum {
        end,
        add,
        addwrap,
        alloc,
        alloc_mut,
        alloc_inferred,
        alloc_inferred_mut,
        anyframe_type,
        array_cat,
        array_mul,
    };
};

export fn entry(inst_list: [*]const Inst, map: [*]u32) void {
    var i: usize = 0;
    while (true) : (i += 1) {
        const inst = inst_list[i];
        map[i] = switch (inst.tag) {
            .end => return,
            .add => analyze_add(inst),
            .addwrap => analyze_addwrap(inst),
            .alloc => analyze_alloc(inst),
            .alloc_mut => analyze_alloc_mut(inst),
            .alloc_inferred => analyze_alloc_inferred(inst),
            .alloc_inferred_mut => analyze_alloc_inferred_mut(inst),
            .anyframe_type => analyze_anyframe_type(inst),
            .array_cat => analyze_array_cat(inst),
            .array_mul => analyze_array_mul(inst),
        };
    }
}

extern fn analyze_add(inst: Inst) u32;
extern fn analyze_addwrap(inst: Inst) u32;
extern fn analyze_alloc(inst: Inst) u32;
extern fn analyze_alloc_mut(inst: Inst) u32;
extern fn analyze_alloc_inferred(inst: Inst) u32;
extern fn analyze_alloc_inferred_mut(inst: Inst) u32;
extern fn analyze_anyframe_type(inst: Inst) u32;
extern fn analyze_array_cat(inst: Inst) u32;
extern fn analyze_array_mul(inst: Inst) u32;

It generates the ideal machine code.

@sighoya
Copy link

sighoya commented Mar 13, 2021

@andrewrk

Did you consider indirectbr?

@Techcable
Copy link
Contributor

Techcable commented Mar 15, 2021

Interestingly enough, C# has this feature with its goto case statement.

Although the use cases are limited to certian situations. In those cases it can make a big difference. Interpreters and VMs are definitely a target audience for Zig, and low level code has plenty of other state machines where these sorts of dispatch tables might need this optimization.

As someone expressed in #2162, sometimes computed goto can be slower than a regular switch statement in a loop. It would be odd to have performance pitfalls around these sorts of switch statements. I think Zig usually likes to make these performance choices explicit.

On the other hand, this is generally considered a must-have optimization for fast interpreter implementations. In CPython it gives about a 15-20% speedup.

@sighoya

Did you consider indirectbr?

That is exactly how clang implements it :) In fact, this feature is the reason indirectbr was added to LLVM in the first place

@andrewrk
Copy link
Member Author

So I want to use this in semantic analysis of Zig self-hosted compiler. Here is:

LLVM does not figure out how to generate the machine code that I want, and I suspect it would be a perf improvement. My plan is to implement this in order to benchmark it, and then we can make a decision, armed with some data.

@andrewrk
Copy link
Member Author

Check out https://dino-lang.github.io/learn/dino.pdf, section 3.1 "Byte Code Dispatch"

@Mouvedia
Copy link

Mouvedia commented Mar 26, 2021

As prior art I found PowerShell which uses break—instead of continue in this proposal—to jump to the switch's label.

@haberman
Copy link

An argument for tail calls

Based on my experience with https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html, I would strongly recommend offering guaranteed tail calls as one tool available to interpreter writers (I'm not arguing against other options).

Getting the replicated decode/dispatch (as described above) is important, but it's only one part of the story. We also have to worry about the code quality for the operation itself -- the code that actually does the work.

Optimizers struggle to generate good code when the entire interpreter is in one big function. In particular, they struggle to keep the most important state in registers. Mike Pall talked about this problem here: http://lua-users.org/lists/lua-l/2011-02/msg00742.html

If each operation is in a separate function, with the most important state propagated in arguments (passed in registers on modern architectures), and each operation tail calling the next, we can get much better code out of the compiler. With this pattern, there is no switch() at all, just a call into a table of function pointers.

Addressing open questions on Tail Calls

To answer the questions above:

As far as I understand, tail calls don't work on some architectures.
(what are these? does anybody know?)

For LLVM, musttail currently appears to be totally unsupported on powerpc (32 and 64). On ARM32 it runs into problems in some cases. There is a bit more information about this here: https://gcc.gnu.org/pipermail/gcc/2021-April/235903.html

I don't know if these problems are fundamental or if they just need proper bug fixes.

I'm also concerned about trying to debug when doing dispatch with tail calls.

If you mean stack traces, I don't think tail calls leave any less of a stack trace than a traditional switch()-based dispatch. In both cases the stack tells you where you are, but not where you've been.

It forces you to organize your logic into functions. That's another jump that
maybe you did not want in your hot path.

I think efficiency argues for tail calls, not against them. The tail call pattern does not create any unnecessary jumps. I'll elaborate a bit on this point.

Tail calls generate code comparable to hand-written assembly

Take this example from the article of implementing the "ADDVN" operation from LuaJIT, which aims to match this hand-written assembly code.

The C compiler output almost exactly matches the hand-written assembly. It does not contain any jumps except the jump to the next operation, which is inherently necessary.

When you are implementing byte code dispatch, you have a fundamental choice of whether to use replicated dispatch or shared dispatch. There are tradeoffs here, and the LuaJIT source discusses these tradeoffs a bit).

Tail calls can support both options very naturally. All we have to do is flip the dispatch() function between noinline and always_inline: here is an example. For both patterns, the resulting code is comparable to hand-written assembly: it does not contain any extraneous or unnecessary jumps.

Caveats

I should mention one significant downside to this pattern: callee-save registers are not available to these functions without spilling them to the stack first. This means, in effect, that if we want the fastest code we can only use about half the registers. We also pay a steep price when making any non-tail call.

Both of these problems can be solved if you also offer some specialized calling conventions (these are mostly supported in LLVM already). I can go into this more if you are interested. With the right calling conventions, I believe this pattern can generate code that even the most talented assembly language programmers would have a hard time beating.

@ifreund
Copy link
Member

ifreund commented Apr 30, 2021

@haberman note that zig already allows forcing tail calls with the @call() builtin: https://ziglang.org/documentation/master/#call

@haberman
Copy link

@ifreund that is great news, thanks for the info. :)

@andrewrk
Copy link
Member Author

andrewrk commented Sep 15, 2021

I'm inclined to accept this proposal. Regardless of performance improvements, this provides a useful, intuitive tool for control flow. Another observation is that when the operand of continue :sw operand is comptime-known, this would be lowered as a direct jump to another switch prong, effectively providing the desired control flow requested in #5950.

This even would allow someone to implement Duff's Device in zig:

fn send(to: *volatile u32, from_start: [*]u32, count: u32) void {
    var from = from_start;
    var n = (count + 7) / 8;
    sw: switch (count % 8) {
        0 => {
            to.* = from[0];
            from += 1;
            continue :sw 7;
        },
        7 => {
            to.* = from[0];
            from += 1;
            continue :sw 6;
        },
        6 => {
            to.* = from[0];
            from += 1;
            continue :sw 5;
        },
        5 => {
            to.* = from[0];
            from += 1;
            continue :sw 4;
        },
        4 => {
            to.* = from[0];
            from += 1;
            continue :sw 3;
        },
        3 => {
            to.* = from[0];
            from += 1;
            continue :sw 2;
        },
        2 => {
            to.* = from[0];
            from += 1;
            continue :sw 1;
        },
        1 => {
            to.* = from[0];
            from += 1;
            n -= 1;
            if (n > 0) continue :sw 0;
        },
    }
}

Reference example in C:

send(to, from, count)
register short *to, *from;
register count;
{
    register n = (count + 7) / 8;
    switch (count % 8) {
    case 0: do { *to = *from++;
    case 7:      *to = *from++;
    case 6:      *to = *from++;
    case 5:      *to = *from++;
    case 4:      *to = *from++;
    case 3:      *to = *from++;
    case 2:      *to = *from++;
    case 1:      *to = *from++;
            } while (--n > 0);
    }
}

With #7224 it could even be shortened to this:

fn send(to: *volatile u32, from_start: [*]u32, count: u32) void {
    var from = from_start;
    var n = (count + 7) / 8;
    sw: switch (count % 8) {
        inline 0...7 => |remainder| {
            to.* = from[0];
            from += 1;
            if (remainder == 1) {
                n -= 1;
                if (n <= 0) break :sw;
            }
            continue :sw @as(u3, remainder) -% 1;
        },
    }
}

What's interesting about this is that it lowers to the same machine code as the C version, even in debug mode.

@Techcable
Copy link
Contributor

Techcable commented Jun 2, 2022

I am working on this.

I already have modified parsing, AST, and formatting to support this (both stage1 & stage2).

Example interpreter

The following parses sucessfully: https://gist.github.com/Techcable/dbcd7f7c3a708aa71d86031a1105db9c

In particular, the core eval loop has no syntax errors:

fn eval(ctx: *InsnCtx) Value {
    // decode first oparg
    ctx.oparg = ctx.ip[0].oparg;
    while (true) {
        evalLoop: switch (ctx.ip[0].opcode) {
            .LOAD_IMM => {
                load_imm(ctx).verify_continue();
                continue :evalLoop ctx.ip[0].opcode;
            },
            .POP => {
                pop(ctx).verify_continue();
                continue :evalLoop ctx.ip[0].opcode;
            },
            .ADD, .SUB, .MUL => {
                arithop(ctx).verify_continue();
                continue :evalLoop ctx.ip[0].opcode;
            },
            .PRINT => {
                print(ctx).verify_continue();
                continue :evalLoop ctx.ip[0].opcode;
            },
            .RETURN => {
                const done = @"return"(ctx);
                if (done.return_value) |value| {
                    return value;
                } else {
                    unreachable;
                }
            },
        }
    }
}

Of course running any of this through semantic analysis (stage1) gives an error "label not found: 'evalLoop'"

But this is progress!

@hollmmax
Copy link
Contributor

I think I understand the motivation behind this change, but I'd like to raise a concern I have with it.
With fall-through disallowed, switch-case is a linear control flow construct. A switch is entered, one of its prongs gets executed, the switch is left.
Allowing jumping to random prongs from anywhere within the switch increases its control flow capabilities significantly. The switch is no longer a simple construct. If I see a labeled switch, I can't be sure it's not a loop until I read all its prongs. A switch inside a loop is much clearer about its control flow.
I would be happier if instead the original motivation was implemented as a guaranteed optimization or forcible through some attributes. Duff's device would IMO be better served with explicit fall-through, although the order of cases would then become semantically relevant.
Maybe I should just setup my highlighter to detect label: switch with continue :label _ inside a different color...

mlugg added a commit to mlugg/zig that referenced this issue Sep 1, 2024
This commit introduces a new AIR instruction, `repeat`, which causes
control flow to move back to the start of a given AIR loop. `loop`
instructions will no longer automatically perform this operation after
control flow reaches the end of the body.

The motivation for making this change now was really just consistency
with the upcoming implementation of ziglang#8220: it wouldn't make sense to
have this feature work significantly differently. However, there were
already some TODOs kicking around which wanted this feature. It's useful
for two key reasons:

* It allows loops over AIR instruction bodies to loop precisely until
  they reach a `noreturn` instruction. This allows for tail calling a
  few things, and avoiding a range check on each iteration of a hot
  path, plus gives a nice assertion that validates AIR structure a
  little. This is a very minor benefit, which this commit does apply to
  the LLVM and C backends.

* It should allow for more compact ZIR and AIR to be emitted by having
  AstGen emit `repeat` instructions more often rather than having
  `continue` statements `break` to a `block` which is *followed* by a
  `repeat`. This is done in status quo because `repeat` instructions
  only ever cause the direct parent block to repeat. Now that AIR is
  more flexible, this flexibility can be pretty trivially extended to
  ZIR, and we can then emit better ZIR. This commit does not implement
  this.

Support for this feature is currently regressed on all self-hosted
native backends, including x86_64. This support will be added where
necessary before this branch is merged.
@andrewrk
Copy link
Member Author

andrewrk commented Sep 5, 2024

Landed in 3929cac

@andrewrk andrewrk closed this as completed Sep 5, 2024
matklad added a commit to tigerbeetle/tigerbeetle that referenced this issue Sep 7, 2024
Rewrite commit_dispatch to more closely resemble straight line  code.

 # Motivation

While I like the general idea of commit_dispatch, it feels like our
current implementation isn't quite as obvious as it needs to be:

* Each stage is responsible for calling the next stage (you pass an
  argument to .commit_dispatch), which makes it hard to know in which
  order things actually happen. They sort-of happen in the order of the
  switch prongs in the dispatch, but it's not immediately clear if we
  are not being clever and skipping some of the stages sometimes.
* I always have trouble finding the place where we _actually_ execute
  things. The entry point is `setup_client_reply_callback`, which is the
  least likely springboard for the actual execution!
* There's some confusion about "sometimes asynchronous" stages, up to
  the point where we get unexpected-reentracy _despite_ having an
  explicit comment in the source code about _not_ doing that.

 # Implementation

The new approach is an attempt to do the most direct thing possible. If
programming were actually easy, we'd write our commit path as, roughly

```
loop {
  prefetch()
  execute()
  compact()
}
```

The problem is that programming is, in fact, not easy, and the principle
difficulty here is that most of parts of the loop are asynchronous. Were
we using a language with async-await, we would have written:

```
loop {
  prefetch().await
  execute().await
  compact().await
}
```

We don't have async-await, but we can express the same logic manually.
`prefetch` being async means that `prefetch` might not actually complete
immediately, and instead just schedule some background work. Which means
that we need to "pause" our commit work and return to it later.

Pausing is easily achieved by early returning:

```
loop {
  if (prefetch() == .pending) return
  if (execute() == .pending) return
  if (compact() == .pending) return
}
```

Resumption is trickier, as we should be able to restart the loop from
the middle. If we were writing in C, we could have done something like
this:

```
switch(state) {
  loop {
case prefetch:
    if (prefetch() == .pending) return
case execute:
    if (execute() == .pending) return
case compact:
    if (compact() == .pending) return
  }
}
```

Luckily, Zig doesn't allow this kind of convoluted code (though I remain
intrigued by the possibilities opened by the recently implemented
ziglang/zig#8220).

But we can simulate it easily enough by guarding each prong of the
switch with an if:

```
loop {
  if (state == .init) {
    state == .prefetch
    if (prefetch() == .pending) return
  }
  if (state == .prefetch) {
    state == .execute
    if (execute() == .pending) return
  }
  if (state == .execute) {
    state == .compact
    if (compact() == .pending) return
  }
}
```

Which is exactly what are we doing here!

Specifically:

* `commit_dispatch` is the main commit loop. It is an asynchronous loop
  committing everything we can commit.
* `commit_dispatch_enter` enters the loop from the outside.
* if the loop schedules an asynchronous operation, it's callback should
  invoke `commit_dispatch_resume` without arguments, which will pick up
  the loop where we've left.
* All the "parts" of the `commit_dispatch` loop are right after it in
  the source code order and are named `commit_stage`. All callbacks are
  named `commit_stage_callback`

One interesting side-effect here is that the new structure will be able
to accommodate our "optimistic prefetch" idea, where we _assume_ that
everything is in cache and start commit _without_ prefetch, bailing out
if we have a missing entry. The new loop works even if the entire commit
path turns out to be synchronous!
matklad added a commit to tigerbeetle/tigerbeetle that referenced this issue Sep 7, 2024
Rewrite commit_dispatch to more closely resemble straight line  code.

 # Motivation

While I like the general idea of commit_dispatch, it feels like our
current implementation isn't quite as obvious as it needs to be:

* Each stage is responsible for calling the next stage (you pass an
  argument to .commit_dispatch), which makes it hard to know in which
  order things actually happen. They sort-of happen in the order of the
  switch prongs in the dispatch, but it's not immediately clear if we
  are not being clever and skipping some of the stages sometimes.
* I always have trouble finding the place where we _actually_ execute
  things. The entry point is `setup_client_reply_callback`, which is the
  least likely springboard for the actual execution!
* There's some confusion about "sometimes asynchronous" stages, up to
  the point where we get unexpected-reentracy _despite_ having an
  explicit comment in the source code about _not_ doing that.

 # Implementation

The new approach is an attempt to do the most direct thing possible. If
programming were actually easy, we'd write our commit path as, roughly

```
loop {
  prefetch()
  execute()
  compact()
}
```

The problem is that programming is, in fact, not easy, and the principle
difficulty here is that most of parts of the loop are asynchronous. Were
we using a language with async-await, we would have written:

```
loop {
  prefetch().await
  execute().await
  compact().await
}
```

We don't have async-await, but we can express the same logic manually.
`prefetch` being async means that `prefetch` might not actually complete
immediately, and instead just schedule some background work. Which means
that we need to "pause" our commit work and return to it later.

Pausing is easily achieved by early returning:

```
loop {
  if (prefetch() == .pending) return
  if (execute() == .pending) return
  if (compact() == .pending) return
}
```

Resumption is trickier, as we should be able to restart the loop from
the middle. If we were writing in C, we could have done something like
this:

```
switch(state) {
  loop {
case prefetch:
    if (prefetch() == .pending) return
case execute:
    if (execute() == .pending) return
case compact:
    if (compact() == .pending) return
  }
}
```

Luckily, Zig doesn't allow this kind of convoluted code (though I remain
intrigued by the possibilities opened by the recently implemented
ziglang/zig#8220).

But we can simulate it easily enough by guarding each prong of the
switch with an if:

```
loop {
  if (state == .init) {
    state == .prefetch
    if (prefetch() == .pending) return
  }
  if (state == .prefetch) {
    state == .execute
    if (execute() == .pending) return
  }
  if (state == .execute) {
    state == .compact
    if (compact() == .pending) return
  }
}
```

Which is exactly what are we doing here!

Specifically:

* `commit_dispatch` is the main commit loop. It is an asynchronous loop
  committing everything we can commit.
* `commit_dispatch_enter` enters the loop from the outside.
* if the loop schedules an asynchronous operation, it's callback should
  invoke `commit_dispatch_resume` without arguments, which will pick up
  the loop where we've left.
* All the "parts" of the `commit_dispatch` loop are right after it in
  the source code order and are named `commit_stage`. All callbacks are
  named `commit_stage_callback`

One interesting side-effect here is that the new structure will be able
to accommodate our "optimistic prefetch" idea, where we _assume_ that
everything is in cache and start commit _without_ prefetch, bailing out
if we have a missing entry. The new loop works even if the entire commit
path turns out to be synchronous!
matklad added a commit to tigerbeetle/tigerbeetle that referenced this issue Sep 7, 2024
Rewrite commit_dispatch to more closely resemble straight line  code.

 # Motivation

While I like the general idea of commit_dispatch, it feels like our
current implementation isn't quite as obvious as it needs to be:

* Each stage is responsible for calling the next stage (you pass an
  argument to .commit_dispatch), which makes it hard to know in which
  order things actually happen. They sort-of happen in the order of the
  switch prongs in the dispatch, but it's not immediately clear if we
  are not being clever and skipping some of the stages sometimes.
* I always have trouble finding the place where we _actually_ execute
  things. The entry point is `setup_client_reply_callback`, which is the
  least likely springboard for the actual execution!
* There's some confusion about "sometimes asynchronous" stages, up to
  the point where we get unexpected-reentracy _despite_ having an
  explicit comment in the source code about _not_ doing that.

 # Implementation

The new approach is an attempt to do the most direct thing possible. If
programming were actually easy, we'd write our commit path as, roughly

```
loop {
  prefetch()
  execute()
  compact()
}
```

The problem is that programming is, in fact, not easy, and the principle
difficulty here is that most of parts of the loop are asynchronous. Were
we using a language with async-await, we would have written:

```
loop {
  prefetch().await
  execute().await
  compact().await
}
```

We don't have async-await, but we can express the same logic manually.
`prefetch` being async means that `prefetch` might not actually complete
immediately, and instead just schedule some background work. Which means
that we need to "pause" our commit work and return to it later.

Pausing is easily achieved by early returning:

```
loop {
  if (prefetch() == .pending) return
  if (execute() == .pending) return
  if (compact() == .pending) return
}
```

Resumption is trickier, as we should be able to restart the loop from
the middle. If we were writing in C, we could have done something like
this:

```
switch(state) {
  loop {
case prefetch:
    if (prefetch() == .pending) return
case execute:
    if (execute() == .pending) return
case compact:
    if (compact() == .pending) return
  }
}
```

Luckily, Zig doesn't allow this kind of convoluted code (though I remain
intrigued by the possibilities opened by the recently implemented
ziglang/zig#8220).

But we can simulate it easily enough by guarding each prong of the
switch with an if:

```
loop {
  if (state == .init) {
    state == .prefetch
    if (prefetch() == .pending) return
  }
  if (state == .prefetch) {
    state == .execute
    if (execute() == .pending) return
  }
  if (state == .execute) {
    state == .compact
    if (compact() == .pending) return
  }
}
```

Which is exactly what are we doing here!

Specifically:

* `commit_dispatch` is the main commit loop. It is an asynchronous loop
  committing everything we can commit.
* `commit_dispatch_enter` enters the loop from the outside.
* if the loop schedules an asynchronous operation, it's callback should
  invoke `commit_dispatch_resume` without arguments, which will pick up
  the loop where we've left.
* All the "parts" of the `commit_dispatch` loop are right after it in
  the source code order and are named `commit_stage`. All callbacks are
  named `commit_stage_callback`

One interesting side-effect here is that the new structure will be able
to accommodate our "optimistic prefetch" idea, where we _assume_ that
everything is in cache and start commit _without_ prefetch, bailing out
if we have a missing entry. The new loop works even if the entire commit
path turns out to be synchronous!
matklad added a commit to tigerbeetle/tigerbeetle that referenced this issue Sep 9, 2024
Rewrite commit_dispatch to more closely resemble straight line  code.

 # Motivation

While I like the general idea of commit_dispatch, it feels like our
current implementation isn't quite as obvious as it needs to be:

* Each stage is responsible for calling the next stage (you pass an
  argument to .commit_dispatch), which makes it hard to know in which
  order things actually happen. They sort-of happen in the order of the
  switch prongs in the dispatch, but it's not immediately clear if we
  are not being clever and skipping some of the stages sometimes.
* I always have trouble finding the place where we _actually_ execute
  things. The entry point is `setup_client_reply_callback`, which is the
  least likely springboard for the actual execution!
* There's some confusion about "sometimes asynchronous" stages, up to
  the point where we get unexpected-reentracy _despite_ having an
  explicit comment in the source code about _not_ doing that.

 # Implementation

The new approach is an attempt to do the most direct thing possible. If
programming were actually easy, we'd write our commit path as, roughly

```
loop {
  prefetch()
  execute()
  compact()
}
```

The problem is that programming is, in fact, not easy, and the principle
difficulty here is that most of parts of the loop are asynchronous. Were
we using a language with async-await, we would have written:

```
loop {
  prefetch().await
  execute().await
  compact().await
}
```

We don't have async-await, but we can express the same logic manually.
`prefetch` being async means that `prefetch` might not actually complete
immediately, and instead just schedule some background work. Which means
that we need to "pause" our commit work and return to it later.

Pausing is easily achieved by early returning:

```
loop {
  if (prefetch() == .pending) return
  if (execute() == .pending) return
  if (compact() == .pending) return
}
```

Resumption is trickier, as we should be able to restart the loop from
the middle. If we were writing in C, we could have done something like
this:

```
switch(state) {
  loop {
case prefetch:
    if (prefetch() == .pending) return
case execute:
    if (execute() == .pending) return
case compact:
    if (compact() == .pending) return
  }
}
```

Luckily, Zig doesn't allow this kind of convoluted code (though I remain
intrigued by the possibilities opened by the recently implemented
ziglang/zig#8220).

But we can simulate it easily enough by guarding each prong of the
switch with an if:

```
loop {
  if (state == .init) {
    state == .prefetch
    if (prefetch() == .pending) return
  }
  if (state == .prefetch) {
    state == .execute
    if (execute() == .pending) return
  }
  if (state == .execute) {
    state == .compact
    if (compact() == .pending) return
  }
}
```

Which is exactly what are we doing here!

Specifically:

* `commit_dispatch` is the main commit loop. It is an asynchronous loop
  committing everything we can commit.
* `commit_dispatch_enter` enters the loop from the outside.
* if the loop schedules an asynchronous operation, it's callback should
  invoke `commit_dispatch_resume` without arguments, which will pick up
  the loop where we've left.
* All the "parts" of the `commit_dispatch` loop are right after it in
  the source code order and are named `commit_stage`. All callbacks are
  named `commit_stage_callback`

One interesting side-effect here is that the new structure will be able
to accommodate our "optimistic prefetch" idea, where we _assume_ that
everything is in cache and start commit _without_ prefetch, bailing out
if we have a missing entry. The new loop works even if the entire commit
path turns out to be synchronous!
matklad added a commit to tigerbeetle/tigerbeetle that referenced this issue Sep 9, 2024
Rewrite commit_dispatch to more closely resemble straight line  code.

 # Motivation

While I like the general idea of commit_dispatch, it feels like our
current implementation isn't quite as obvious as it needs to be:

* Each stage is responsible for calling the next stage (you pass an
  argument to .commit_dispatch), which makes it hard to know in which
  order things actually happen. They sort-of happen in the order of the
  switch prongs in the dispatch, but it's not immediately clear if we
  are not being clever and skipping some of the stages sometimes.
* I always have trouble finding the place where we _actually_ execute
  things. The entry point is `setup_client_reply_callback`, which is the
  least likely springboard for the actual execution!
* There's some confusion about "sometimes asynchronous" stages, up to
  the point where we get unexpected-reentracy _despite_ having an
  explicit comment in the source code about _not_ doing that.

 # Implementation

The new approach is an attempt to do the most direct thing possible. If
programming were actually easy, we'd write our commit path as, roughly

```
loop {
  prefetch()
  execute()
  compact()
}
```

The problem is that programming is, in fact, not easy, and the principle
difficulty here is that most of parts of the loop are asynchronous. Were
we using a language with async-await, we would have written:

```
loop {
  prefetch().await
  execute().await
  compact().await
}
```

We don't have async-await, but we can express the same logic manually.
`prefetch` being async means that `prefetch` might not actually complete
immediately, and instead just schedule some background work. Which means
that we need to "pause" our commit work and return to it later.

Pausing is easily achieved by early returning:

```
loop {
  if (prefetch() == .pending) return
  if (execute() == .pending) return
  if (compact() == .pending) return
}
```

Resumption is trickier, as we should be able to restart the loop from
the middle. If we were writing in C, we could have done something like
this:

```
switch(state) {
  loop {
case prefetch:
    if (prefetch() == .pending) return
case execute:
    if (execute() == .pending) return
case compact:
    if (compact() == .pending) return
  }
}
```

Luckily, Zig doesn't allow this kind of convoluted code (though I remain
intrigued by the possibilities opened by the recently implemented
ziglang/zig#8220).

But we can simulate it easily enough by guarding each prong of the
switch with an if:

```
loop {
  if (state == .init) {
    state == .prefetch
    if (prefetch() == .pending) return
  }
  if (state == .prefetch) {
    state == .execute
    if (execute() == .pending) return
  }
  if (state == .execute) {
    state == .compact
    if (compact() == .pending) return
  }
}
```

Which is exactly what are we doing here!

Specifically:

* `commit_dispatch` is the main commit loop. It is an asynchronous loop
  committing everything we can commit.
* `commit_dispatch_enter` enters the loop from the outside.
* if the loop schedules an asynchronous operation, it's callback should
  invoke `commit_dispatch_resume` without arguments, which will pick up
  the loop where we've left.
* All the "parts" of the `commit_dispatch` loop are right after it in
  the source code order and are named `commit_stage`. All callbacks are
  named `commit_stage_callback`

One interesting side-effect here is that the new structure will be able
to accommodate our "optimistic prefetch" idea, where we _assume_ that
everything is in cache and start commit _without_ prefetch, bailing out
if we have a missing entry. The new loop works even if the entire commit
path turns out to be synchronous!
matklad added a commit to tigerbeetle/tigerbeetle that referenced this issue Sep 9, 2024
Rewrite commit_dispatch to more closely resemble straight line  code.

 # Motivation

While I like the general idea of commit_dispatch, it feels like our
current implementation isn't quite as obvious as it needs to be:

* Each stage is responsible for calling the next stage (you pass an
  argument to .commit_dispatch), which makes it hard to know in which
  order things actually happen. They sort-of happen in the order of the
  switch prongs in the dispatch, but it's not immediately clear if we
  are not being clever and skipping some of the stages sometimes.
* I always have trouble finding the place where we _actually_ execute
  things. The entry point is `setup_client_reply_callback`, which is the
  least likely springboard for the actual execution!
* There's some confusion about "sometimes asynchronous" stages, up to
  the point where we get unexpected-reentracy _despite_ having an
  explicit comment in the source code about _not_ doing that.

 # Implementation

The new approach is an attempt to do the most direct thing possible. If
programming were actually easy, we'd write our commit path as, roughly

```
loop {
  prefetch()
  execute()
  compact()
}
```

The problem is that programming is, in fact, not easy, and the principle
difficulty here is that most of parts of the loop are asynchronous. Were
we using a language with async-await, we would have written:

```
loop {
  prefetch().await
  execute().await
  compact().await
}
```

We don't have async-await, but we can express the same logic manually.
`prefetch` being async means that `prefetch` might not actually complete
immediately, and instead just schedule some background work. Which means
that we need to "pause" our commit work and return to it later.

Pausing is easily achieved by early returning:

```
loop {
  if (prefetch() == .pending) return
  if (execute() == .pending) return
  if (compact() == .pending) return
}
```

Resumption is trickier, as we should be able to restart the loop from
the middle. If we were writing in C, we could have done something like
this:

```
switch(state) {
  loop {
case prefetch:
    if (prefetch() == .pending) return
case execute:
    if (execute() == .pending) return
case compact:
    if (compact() == .pending) return
  }
}
```

Luckily, Zig doesn't allow this kind of convoluted code (though I remain
intrigued by the possibilities opened by the recently implemented
ziglang/zig#8220).

But we can simulate it easily enough by guarding each prong of the
switch with an if:

```
loop {
  if (state == .init) {
    state == .prefetch
    if (prefetch() == .pending) return
  }
  if (state == .prefetch) {
    state == .execute
    if (execute() == .pending) return
  }
  if (state == .execute) {
    state == .compact
    if (compact() == .pending) return
  }
}
```

Which is exactly what are we doing here!

Specifically:

* `commit_dispatch` is the main commit loop. It is an asynchronous loop
  committing everything we can commit.
* `commit_dispatch_enter` enters the loop from the outside.
* if the loop schedules an asynchronous operation, it's callback should
  invoke `commit_dispatch_resume` without arguments, which will pick up
  the loop where we've left.
* All the "parts" of the `commit_dispatch` loop are right after it in
  the source code order and are named `commit_stage`. All callbacks are
  named `commit_stage_callback`

One interesting side-effect here is that the new structure will be able
to accommodate our "optimistic prefetch" idea, where we _assume_ that
everything is in cache and start commit _without_ prefetch, bailing out
if we have a missing entry. The new loop works even if the entire commit
path turns out to be synchronous!
matklad added a commit to tigerbeetle/tigerbeetle that referenced this issue Sep 9, 2024
Rewrite commit_dispatch to more closely resemble straight line  code.

 # Motivation

While I like the general idea of commit_dispatch, it feels like our
current implementation isn't quite as obvious as it needs to be:

* Each stage is responsible for calling the next stage (you pass an
  argument to .commit_dispatch), which makes it hard to know in which
  order things actually happen. They sort-of happen in the order of the
  switch prongs in the dispatch, but it's not immediately clear if we
  are not being clever and skipping some of the stages sometimes.
* I always have trouble finding the place where we _actually_ execute
  things. The entry point is `setup_client_reply_callback`, which is the
  least likely springboard for the actual execution!
* There's some confusion about "sometimes asynchronous" stages, up to
  the point where we get unexpected-reentracy _despite_ having an
  explicit comment in the source code about _not_ doing that.

 # Implementation

The new approach is an attempt to do the most direct thing possible. If
programming were actually easy, we'd write our commit path as, roughly

```
loop {
  prefetch()
  execute()
  compact()
}
```

The problem is that programming is, in fact, not easy, and the principle
difficulty here is that most of parts of the loop are asynchronous. Were
we using a language with async-await, we would have written:

```
loop {
  prefetch().await
  execute().await
  compact().await
}
```

We don't have async-await, but we can express the same logic manually.
`prefetch` being async means that `prefetch` might not actually complete
immediately, and instead just schedule some background work. Which means
that we need to "pause" our commit work and return to it later.

Pausing is easily achieved by early returning:

```
loop {
  if (prefetch() == .pending) return
  if (execute() == .pending) return
  if (compact() == .pending) return
}
```

Resumption is trickier, as we should be able to restart the loop from
the middle. If we were writing in C, we could have done something like
this:

```
switch(state) {
  loop {
case prefetch:
    if (prefetch() == .pending) return
case execute:
    if (execute() == .pending) return
case compact:
    if (compact() == .pending) return
  }
}
```

Luckily, Zig doesn't allow this kind of convoluted code (though I remain
intrigued by the possibilities opened by the recently implemented
ziglang/zig#8220).

But we can simulate it easily enough by guarding each prong of the
switch with an if:

```
loop {
  if (state == .init) {
    state == .prefetch
    if (prefetch() == .pending) return
  }
  if (state == .prefetch) {
    state == .execute
    if (execute() == .pending) return
  }
  if (state == .execute) {
    state == .compact
    if (compact() == .pending) return
  }
}
```

Which is exactly what are we doing here!

Specifically:

* `commit_dispatch` is the main commit loop. It is an asynchronous loop
  committing everything we can commit.
* `commit_dispatch_enter` enters the loop from the outside.
* if the loop schedules an asynchronous operation, it's callback should
  invoke `commit_dispatch_resume` without arguments, which will pick up
  the loop where we've left.
* All the "parts" of the `commit_dispatch` loop are right after it in
  the source code order and are named `commit_stage`. All callbacks are
  named `commit_stage_callback`

One interesting side-effect here is that the new structure will be able
to accommodate our "optimistic prefetch" idea, where we _assume_ that
everything is in cache and start commit _without_ prefetch, bailing out
if we have a missing entry. The new loop works even if the entire commit
path turns out to be synchronous!
matklad added a commit to tigerbeetle/tigerbeetle that referenced this issue Sep 9, 2024
Rewrite commit_dispatch to more closely resemble straight line  code.

 # Motivation

While I like the general idea of commit_dispatch, it feels like our
current implementation isn't quite as obvious as it needs to be:

* Each stage is responsible for calling the next stage (you pass an
  argument to .commit_dispatch), which makes it hard to know in which
  order things actually happen. They sort-of happen in the order of the
  switch prongs in the dispatch, but it's not immediately clear if we
  are not being clever and skipping some of the stages sometimes.
* I always have trouble finding the place where we _actually_ execute
  things. The entry point is `setup_client_reply_callback`, which is the
  least likely springboard for the actual execution!
* There's some confusion about "sometimes asynchronous" stages, up to
  the point where we get unexpected-reentracy _despite_ having an
  explicit comment in the source code about _not_ doing that.

 # Implementation

The new approach is an attempt to do the most direct thing possible. If
programming were actually easy, we'd write our commit path as, roughly

```
loop {
  prefetch()
  execute()
  compact()
}
```

The problem is that programming is, in fact, not easy, and the principle
difficulty here is that most of parts of the loop are asynchronous. Were
we using a language with async-await, we would have written:

```
loop {
  prefetch().await
  execute().await
  compact().await
}
```

We don't have async-await, but we can express the same logic manually.
`prefetch` being async means that `prefetch` might not actually complete
immediately, and instead just schedule some background work. Which means
that we need to "pause" our commit work and return to it later.

Pausing is easily achieved by early returning:

```
loop {
  if (prefetch() == .pending) return
  if (execute() == .pending) return
  if (compact() == .pending) return
}
```

Resumption is trickier, as we should be able to restart the loop from
the middle. If we were writing in C, we could have done something like
this:

```
switch(state) {
  loop {
case prefetch:
    if (prefetch() == .pending) return
case execute:
    if (execute() == .pending) return
case compact:
    if (compact() == .pending) return
  }
}
```

Luckily, Zig doesn't allow this kind of convoluted code (though I remain
intrigued by the possibilities opened by the recently implemented
ziglang/zig#8220).

But we can simulate it easily enough by guarding each prong of the
switch with an if:

```
loop {
  if (state == .init) {
    state == .prefetch
    if (prefetch() == .pending) return
  }
  if (state == .prefetch) {
    state == .execute
    if (execute() == .pending) return
  }
  if (state == .execute) {
    state == .compact
    if (compact() == .pending) return
  }
}
```

Which is exactly what are we doing here!

Specifically:

* `commit_dispatch` is the main commit loop. It is an asynchronous loop
  committing everything we can commit.
* `commit_dispatch_enter` enters the loop from the outside.
* if the loop schedules an asynchronous operation, it's callback should
  invoke `commit_dispatch_resume` without arguments, which will pick up
  the loop where we've left.
* All the "parts" of the `commit_dispatch` loop are right after it in
  the source code order and are named `commit_stage`. All callbacks are
  named `commit_stage_callback`

One interesting side-effect here is that the new structure will be able
to accommodate our "optimistic prefetch" idea, where we _assume_ that
everything is in cache and start commit _without_ prefetch, bailing out
if we have a missing entry. The new loop works even if the entire commit
path turns out to be synchronous!
matklad added a commit to tigerbeetle/tigerbeetle that referenced this issue Sep 9, 2024
Rewrite commit_dispatch to more closely resemble straight line  code.

 # Motivation

While I like the general idea of commit_dispatch, it feels like our
current implementation isn't quite as obvious as it needs to be:

* Each stage is responsible for calling the next stage (you pass an
  argument to .commit_dispatch), which makes it hard to know in which
  order things actually happen. They sort-of happen in the order of the
  switch prongs in the dispatch, but it's not immediately clear if we
  are not being clever and skipping some of the stages sometimes.
* I always have trouble finding the place where we _actually_ execute
  things. The entry point is `setup_client_reply_callback`, which is the
  least likely springboard for the actual execution!
* There's some confusion about "sometimes asynchronous" stages, up to
  the point where we get unexpected-reentracy _despite_ having an
  explicit comment in the source code about _not_ doing that.

 # Implementation

The new approach is an attempt to do the most direct thing possible. If
programming were actually easy, we'd write our commit path as, roughly

```
loop {
  prefetch()
  execute()
  compact()
}
```

The problem is that programming is, in fact, not easy, and the principle
difficulty here is that most of parts of the loop are asynchronous. Were
we using a language with async-await, we would have written:

```
loop {
  prefetch().await
  execute().await
  compact().await
}
```

We don't have async-await, but we can express the same logic manually.
`prefetch` being async means that `prefetch` might not actually complete
immediately, and instead just schedule some background work. Which means
that we need to "pause" our commit work and return to it later.

Pausing is easily achieved by early returning:

```
loop {
  if (prefetch() == .pending) return
  if (execute() == .pending) return
  if (compact() == .pending) return
}
```

Resumption is trickier, as we should be able to restart the loop from
the middle. If we were writing in C, we could have done something like
this:

```
switch(state) {
  loop {
case prefetch:
    if (prefetch() == .pending) return
case execute:
    if (execute() == .pending) return
case compact:
    if (compact() == .pending) return
  }
}
```

Luckily, Zig doesn't allow this kind of convoluted code (though I remain
intrigued by the possibilities opened by the recently implemented
ziglang/zig#8220).

But we can simulate it easily enough by guarding each prong of the
switch with an if:

```
loop {
  if (state == .init) {
    state == .prefetch
    if (prefetch() == .pending) return
  }
  if (state == .prefetch) {
    state == .execute
    if (execute() == .pending) return
  }
  if (state == .execute) {
    state == .compact
    if (compact() == .pending) return
  }
}
```

Which is exactly what are we doing here!

Specifically:

* `commit_dispatch` is the main commit loop. It is an asynchronous loop
  committing everything we can commit.
* `commit_dispatch_enter` enters the loop from the outside.
* if the loop schedules an asynchronous operation, it's callback should
  invoke `commit_dispatch_resume` without arguments, which will pick up
  the loop where we've left.
* All the "parts" of the `commit_dispatch` loop are right after it in
  the source code order and are named `commit_stage`. All callbacks are
  named `commit_stage_callback`

One interesting side-effect here is that the new structure will be able
to accommodate our "optimistic prefetch" idea, where we _assume_ that
everything is in cache and start commit _without_ prefetch, bailing out
if we have a missing entry. The new loop works even if the entire commit
path turns out to be synchronous!
matklad added a commit to tigerbeetle/tigerbeetle that referenced this issue Sep 9, 2024
Rewrite commit_dispatch to more closely resemble straight line  code.

 # Motivation

While I like the general idea of commit_dispatch, it feels like our
current implementation isn't quite as obvious as it needs to be:

* Each stage is responsible for calling the next stage (you pass an
  argument to .commit_dispatch), which makes it hard to know in which
  order things actually happen. They sort-of happen in the order of the
  switch prongs in the dispatch, but it's not immediately clear if we
  are not being clever and skipping some of the stages sometimes.
* I always have trouble finding the place where we _actually_ execute
  things. The entry point is `setup_client_reply_callback`, which is the
  least likely springboard for the actual execution!
* There's some confusion about "sometimes asynchronous" stages, up to
  the point where we get unexpected-reentracy _despite_ having an
  explicit comment in the source code about _not_ doing that.

 # Implementation

The new approach is an attempt to do the most direct thing possible. If
programming were actually easy, we'd write our commit path as, roughly

```
loop {
  prefetch()
  execute()
  compact()
}
```

The problem is that programming is, in fact, not easy, and the principle
difficulty here is that most of parts of the loop are asynchronous. Were
we using a language with async-await, we would have written:

```
loop {
  prefetch().await
  execute().await
  compact().await
}
```

We don't have async-await, but we can express the same logic manually.
`prefetch` being async means that `prefetch` might not actually complete
immediately, and instead just schedule some background work. Which means
that we need to "pause" our commit work and return to it later.

Pausing is easily achieved by early returning:

```
loop {
  if (prefetch() == .pending) return
  if (execute() == .pending) return
  if (compact() == .pending) return
}
```

Resumption is trickier, as we should be able to restart the loop from
the middle. If we were writing in C, we could have done something like
this:

```
switch(state) {
  loop {
case prefetch:
    if (prefetch() == .pending) return
case execute:
    if (execute() == .pending) return
case compact:
    if (compact() == .pending) return
  }
}
```

Luckily, Zig doesn't allow this kind of convoluted code (though I remain
intrigued by the possibilities opened by the recently implemented
ziglang/zig#8220).

But we can simulate it easily enough by guarding each prong of the
switch with an if:

```
loop {
  if (state == .init) {
    state == .prefetch
    if (prefetch() == .pending) return
  }
  if (state == .prefetch) {
    state == .execute
    if (execute() == .pending) return
  }
  if (state == .execute) {
    state == .compact
    if (compact() == .pending) return
  }
}
```

Which is exactly what are we doing here!

Specifically:

* `commit_dispatch` is the main commit loop. It is an asynchronous loop
  committing everything we can commit.
* `commit_dispatch_enter` enters the loop from the outside.
* if the loop schedules an asynchronous operation, it's callback should
  invoke `commit_dispatch_resume` without arguments, which will pick up
  the loop where we've left.
* All the "parts" of the `commit_dispatch` loop are right after it in
  the source code order and are named `commit_stage`. All callbacks are
  named `commit_stage_callback`

One interesting side-effect here is that the new structure will be able
to accommodate our "optimistic prefetch" idea, where we _assume_ that
everything is in cache and start commit _without_ prefetch, bailing out
if we have a missing entry. The new loop works even if the entire commit
path turns out to be synchronous!
LiterallyVoid added a commit to LiterallyVoid/zig that referenced this issue Sep 11, 2024
This feature was proposed in ziglang#8220, and implemented in ziglang#21257.
LiterallyVoid added a commit to LiterallyVoid/zig that referenced this issue Sep 11, 2024
This feature was proposed in ziglang#8220, and implemented in ziglang#21257.
matklad added a commit to tigerbeetle/tigerbeetle that referenced this issue Sep 11, 2024
Rewrite commit_dispatch to more closely resemble straight line  code.

 # Motivation

While I like the general idea of commit_dispatch, it feels like our
current implementation isn't quite as obvious as it needs to be:

* Each stage is responsible for calling the next stage (you pass an
  argument to .commit_dispatch), which makes it hard to know in which
  order things actually happen. They sort-of happen in the order of the
  switch prongs in the dispatch, but it's not immediately clear if we
  are not being clever and skipping some of the stages sometimes.
* I always have trouble finding the place where we _actually_ execute
  things. The entry point is `setup_client_reply_callback`, which is the
  least likely springboard for the actual execution!
* There's some confusion about "sometimes asynchronous" stages, up to
  the point where we get unexpected-reentracy _despite_ having an
  explicit comment in the source code about _not_ doing that.

 # Implementation

The new approach is an attempt to do the most direct thing possible. If
programming were actually easy, we'd write our commit path as, roughly

```
loop {
  prefetch()
  execute()
  compact()
}
```

The problem is that programming is, in fact, not easy, and the principle
difficulty here is that most of parts of the loop are asynchronous. Were
we using a language with async-await, we would have written:

```
loop {
  prefetch().await
  execute().await
  compact().await
}
```

We don't have async-await, but we can express the same logic manually.
`prefetch` being async means that `prefetch` might not actually complete
immediately, and instead just schedule some background work. Which means
that we need to "pause" our commit work and return to it later.

Pausing is easily achieved by early returning:

```
loop {
  if (prefetch() == .pending) return
  if (execute() == .pending) return
  if (compact() == .pending) return
}
```

Resumption is trickier, as we should be able to restart the loop from
the middle. If we were writing in C, we could have done something like
this:

```
switch(state) {
  loop {
case prefetch:
    if (prefetch() == .pending) return
case execute:
    if (execute() == .pending) return
case compact:
    if (compact() == .pending) return
  }
}
```

Luckily, Zig doesn't allow this kind of convoluted code (though I remain
intrigued by the possibilities opened by the recently implemented
ziglang/zig#8220).

But we can simulate it easily enough by guarding each prong of the
switch with an if:

```
loop {
  if (state == .init) {
    state == .prefetch
    if (prefetch() == .pending) return
  }
  if (state == .prefetch) {
    state == .execute
    if (execute() == .pending) return
  }
  if (state == .execute) {
    state == .compact
    if (compact() == .pending) return
  }
}
```

Which is exactly what are we doing here!

Specifically:

* `commit_dispatch` is the main commit loop. It is an asynchronous loop
  committing everything we can commit.
* `commit_dispatch_enter` enters the loop from the outside.
* if the loop schedules an asynchronous operation, it's callback should
  invoke `commit_dispatch_resume` without arguments, which will pick up
  the loop where we've left.
* All the "parts" of the `commit_dispatch` loop are right after it in
  the source code order and are named `commit_stage`. All callbacks are
  named `commit_stage_callback`

One interesting side-effect here is that the new structure will be able
to accommodate our "optimistic prefetch" idea, where we _assume_ that
everything is in cache and start commit _without_ prefetch, bailing out
if we have a missing entry. The new loop works even if the entire commit
path turns out to be synchronous!
matklad added a commit to tigerbeetle/tigerbeetle that referenced this issue Sep 11, 2024
Rewrite commit_dispatch to more closely resemble straight line  code.

 # Motivation

While I like the general idea of commit_dispatch, it feels like our
current implementation isn't quite as obvious as it needs to be:

* Each stage is responsible for calling the next stage (you pass an
  argument to .commit_dispatch), which makes it hard to know in which
  order things actually happen. They sort-of happen in the order of the
  switch prongs in the dispatch, but it's not immediately clear if we
  are not being clever and skipping some of the stages sometimes.
* I always have trouble finding the place where we _actually_ execute
  things. The entry point is `setup_client_reply_callback`, which is the
  least likely springboard for the actual execution!
* There's some confusion about "sometimes asynchronous" stages, up to
  the point where we get unexpected-reentracy _despite_ having an
  explicit comment in the source code about _not_ doing that.

 # Implementation

The new approach is an attempt to do the most direct thing possible. If
programming were actually easy, we'd write our commit path as, roughly

```
loop {
  prefetch()
  execute()
  compact()
}
```

The problem is that programming is, in fact, not easy, and the principle
difficulty here is that most of parts of the loop are asynchronous. Were
we using a language with async-await, we would have written:

```
loop {
  prefetch().await
  execute().await
  compact().await
}
```

We don't have async-await, but we can express the same logic manually.
`prefetch` being async means that `prefetch` might not actually complete
immediately, and instead just schedule some background work. Which means
that we need to "pause" our commit work and return to it later.

Pausing is easily achieved by early returning:

```
loop {
  if (prefetch() == .pending) return
  if (execute() == .pending) return
  if (compact() == .pending) return
}
```

Resumption is trickier, as we should be able to restart the loop from
the middle. If we were writing in C, we could have done something like
this:

```
switch(state) {
  loop {
case prefetch:
    if (prefetch() == .pending) return
case execute:
    if (execute() == .pending) return
case compact:
    if (compact() == .pending) return
  }
}
```

Luckily, Zig doesn't allow this kind of convoluted code (though I remain
intrigued by the possibilities opened by the recently implemented
ziglang/zig#8220).

But we can simulate it easily enough by guarding each prong of the
switch with an if:

```
loop {
  if (state == .init) {
    state == .prefetch
    if (prefetch() == .pending) return
  }
  if (state == .prefetch) {
    state == .execute
    if (execute() == .pending) return
  }
  if (state == .execute) {
    state == .compact
    if (compact() == .pending) return
  }
}
```

Which is exactly what are we doing here!

Specifically:

* `commit_dispatch` is the main commit loop. It is an asynchronous loop
  committing everything we can commit.
* `commit_dispatch_enter` enters the loop from the outside.
* if the loop schedules an asynchronous operation, it's callback should
  invoke `commit_dispatch_resume` without arguments, which will pick up
  the loop where we've left.
* All the "parts" of the `commit_dispatch` loop are right after it in
  the source code order and are named `commit_stage`. All callbacks are
  named `commit_stage_callback`

One interesting side-effect here is that the new structure will be able
to accommodate our "optimistic prefetch" idea, where we _assume_ that
everything is in cache and start commit _without_ prefetch, bailing out
if we have a missing entry. The new loop works even if the entire commit
path turns out to be synchronous!
matklad added a commit to tigerbeetle/tigerbeetle that referenced this issue Sep 12, 2024
Rewrite commit_dispatch to more closely resemble straight line  code.

 # Motivation

While I like the general idea of commit_dispatch, it feels like our
current implementation isn't quite as obvious as it needs to be:

* Each stage is responsible for calling the next stage (you pass an
  argument to .commit_dispatch), which makes it hard to know in which
  order things actually happen. They sort-of happen in the order of the
  switch prongs in the dispatch, but it's not immediately clear if we
  are not being clever and skipping some of the stages sometimes.
* I always have trouble finding the place where we _actually_ execute
  things. The entry point is `setup_client_reply_callback`, which is the
  least likely springboard for the actual execution!
* There's some confusion about "sometimes asynchronous" stages, up to
  the point where we get unexpected-reentracy _despite_ having an
  explicit comment in the source code about _not_ doing that.

 # Implementation

The new approach is an attempt to do the most direct thing possible. If
programming were actually easy, we'd write our commit path as, roughly

```
loop {
  prefetch()
  execute()
  compact()
}
```

The problem is that programming is, in fact, not easy, and the principle
difficulty here is that most of parts of the loop are asynchronous. Were
we using a language with async-await, we would have written:

```
loop {
  prefetch().await
  execute().await
  compact().await
}
```

We don't have async-await, but we can express the same logic manually.
`prefetch` being async means that `prefetch` might not actually complete
immediately, and instead just schedule some background work. Which means
that we need to "pause" our commit work and return to it later.

Pausing is easily achieved by early returning:

```
loop {
  if (prefetch() == .pending) return
  if (execute() == .pending) return
  if (compact() == .pending) return
}
```

Resumption is trickier, as we should be able to restart the loop from
the middle. If we were writing in C, we could have done something like
this:

```
switch(state) {
  loop {
case prefetch:
    if (prefetch() == .pending) return
case execute:
    if (execute() == .pending) return
case compact:
    if (compact() == .pending) return
  }
}
```

Luckily, Zig doesn't allow this kind of convoluted code (though I remain
intrigued by the possibilities opened by the recently implemented
ziglang/zig#8220).

But we can simulate it easily enough by guarding each prong of the
switch with an if:

```
loop {
  if (state == .init) {
    state == .prefetch
    if (prefetch() == .pending) return
  }
  if (state == .prefetch) {
    state == .execute
    if (execute() == .pending) return
  }
  if (state == .execute) {
    state == .compact
    if (compact() == .pending) return
  }
}
```

Which is exactly what are we doing here!

Specifically:

* `commit_dispatch` is the main commit loop. It is an asynchronous loop
  committing everything we can commit.
* `commit_dispatch_enter` enters the loop from the outside.
* if the loop schedules an asynchronous operation, it's callback should
  invoke `commit_dispatch_resume` without arguments, which will pick up
  the loop where we've left.
* All the "parts" of the `commit_dispatch` loop are right after it in
  the source code order and are named `commit_stage`. All callbacks are
  named `commit_stage_callback`

One interesting side-effect here is that the new structure will be able
to accommodate our "optimistic prefetch" idea, where we _assume_ that
everything is in cache and start commit _without_ prefetch, bailing out
if we have a missing entry. The new loop works even if the entire commit
path turns out to be synchronous!
andrewrk added a commit that referenced this issue Sep 13, 2024
Add langref docs for labeled switch

This feature was proposed in #8220, and implemented in #21257.

Co-authored-by: Andrew Kelley <[email protected]>
DivergentClouds pushed a commit to DivergentClouds/zig that referenced this issue Sep 24, 2024
Add langref docs for labeled switch

This feature was proposed in ziglang#8220, and implemented in ziglang#21257.

Co-authored-by: Andrew Kelley <[email protected]>
richerfu pushed a commit to richerfu/zig that referenced this issue Oct 28, 2024
This commit modifies the representation of the AIR `switch_br`
instruction to represent ranges in cases. Previously, Sema emitted
different AIR in the case of a range, where the `else` branch of the
`switch_br` contained a simple `cond_br` for each such case which did a
simple range check (`x > a and x < b`). Not only does this add
complexity to Sema, which we would like to minimize, but it also gets in
the way of the implementation of ziglang#8220. That proposal turns certain
`switch` statements into a looping construct, and for optimization
purposes, we want to lower this to AIR fairly directly (i.e. without
involving a `loop` instruction). That means we would ideally like a
single instruction to represent the entire `switch` statement, so that
we can dispatch back to it with a different operand as in ziglang#8220. This is
not really possible to do correctly under the status quo system.

This commit implements lowering of this new `switch_br` usage in the
LLVM and C backends. The C backend just turns any case containing ranges
entirely into conditionals, as before. The LLVM backend is a little
smarter, and puts scalar items into the `switch` instruction, only using
conditionals for the range cases (which direct to the same bb). All
remaining self-hosted backends are temporarily regressed in the presence
of switch range cases. This functionality will be restored for at least
the x86_64 backend before merge.
richerfu pushed a commit to richerfu/zig that referenced this issue Oct 28, 2024
This commit introduces a new AIR instruction, `repeat`, which causes
control flow to move back to the start of a given AIR loop. `loop`
instructions will no longer automatically perform this operation after
control flow reaches the end of the body.

The motivation for making this change now was really just consistency
with the upcoming implementation of ziglang#8220: it wouldn't make sense to
have this feature work significantly differently. However, there were
already some TODOs kicking around which wanted this feature. It's useful
for two key reasons:

* It allows loops over AIR instruction bodies to loop precisely until
  they reach a `noreturn` instruction. This allows for tail calling a
  few things, and avoiding a range check on each iteration of a hot
  path, plus gives a nice assertion that validates AIR structure a
  little. This is a very minor benefit, which this commit does apply to
  the LLVM and C backends.

* It should allow for more compact ZIR and AIR to be emitted by having
  AstGen emit `repeat` instructions more often rather than having
  `continue` statements `break` to a `block` which is *followed* by a
  `repeat`. This is done in status quo because `repeat` instructions
  only ever cause the direct parent block to repeat. Now that AIR is
  more flexible, this flexibility can be pretty trivially extended to
  ZIR, and we can then emit better ZIR. This commit does not implement
  this.

Support for this feature is currently regressed on all self-hosted
native backends, including x86_64. This support will be added where
necessary before this branch is merged.
richerfu pushed a commit to richerfu/zig that referenced this issue Oct 28, 2024
Add langref docs for labeled switch

This feature was proposed in ziglang#8220, and implemented in ziglang#21257.

Co-authored-by: Andrew Kelley <[email protected]>
kprotty pushed a commit to tigerbeetle/tigerbeetle that referenced this issue Nov 12, 2024
Rewrite commit_dispatch to more closely resemble straight line  code.

 # Motivation

While I like the general idea of commit_dispatch, it feels like our
current implementation isn't quite as obvious as it needs to be:

* Each stage is responsible for calling the next stage (you pass an
  argument to .commit_dispatch), which makes it hard to know in which
  order things actually happen. They sort-of happen in the order of the
  switch prongs in the dispatch, but it's not immediately clear if we
  are not being clever and skipping some of the stages sometimes.
* I always have trouble finding the place where we _actually_ execute
  things. The entry point is `setup_client_reply_callback`, which is the
  least likely springboard for the actual execution!
* There's some confusion about "sometimes asynchronous" stages, up to
  the point where we get unexpected-reentracy _despite_ having an
  explicit comment in the source code about _not_ doing that.

 # Implementation

The new approach is an attempt to do the most direct thing possible. If
programming were actually easy, we'd write our commit path as, roughly

```
loop {
  prefetch()
  execute()
  compact()
}
```

The problem is that programming is, in fact, not easy, and the principle
difficulty here is that most of parts of the loop are asynchronous. Were
we using a language with async-await, we would have written:

```
loop {
  prefetch().await
  execute().await
  compact().await
}
```

We don't have async-await, but we can express the same logic manually.
`prefetch` being async means that `prefetch` might not actually complete
immediately, and instead just schedule some background work. Which means
that we need to "pause" our commit work and return to it later.

Pausing is easily achieved by early returning:

```
loop {
  if (prefetch() == .pending) return
  if (execute() == .pending) return
  if (compact() == .pending) return
}
```

Resumption is trickier, as we should be able to restart the loop from
the middle. If we were writing in C, we could have done something like
this:

```
switch(state) {
  loop {
case prefetch:
    if (prefetch() == .pending) return
case execute:
    if (execute() == .pending) return
case compact:
    if (compact() == .pending) return
  }
}
```

Luckily, Zig doesn't allow this kind of convoluted code (though I remain
intrigued by the possibilities opened by the recently implemented
ziglang/zig#8220).

But we can simulate it easily enough by guarding each prong of the
switch with an if:

```
loop {
  if (state == .init) {
    state == .prefetch
    if (prefetch() == .pending) return
  }
  if (state == .prefetch) {
    state == .execute
    if (execute() == .pending) return
  }
  if (state == .execute) {
    state == .compact
    if (compact() == .pending) return
  }
}
```

Which is exactly what are we doing here!

Specifically:

* `commit_dispatch` is the main commit loop. It is an asynchronous loop
  committing everything we can commit.
* `commit_dispatch_enter` enters the loop from the outside.
* if the loop schedules an asynchronous operation, it's callback should
  invoke `commit_dispatch_resume` without arguments, which will pick up
  the loop where we've left.
* All the "parts" of the `commit_dispatch` loop are right after it in
  the source code order and are named `commit_stage`. All callbacks are
  named `commit_stage_callback`

One interesting side-effect here is that the new structure will be able
to accommodate our "optimistic prefetch" idea, where we _assume_ that
everything is in cache and start commit _without_ prefetch, bailing out
if we have a missing entry. The new loop works even if the entire commit
path turns out to be synchronous!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted This proposal is planned. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

Successfully merging a pull request may close this issue.