-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Primer on low-level stack and control-flow primitives #105
Comments
I think stack walking is a pretty interesting primitive, and would generally be useful to add to WebAssembly. There's a lot of functionality described here, so we'd want to think about which features would need to be in a first version. It seems that the basic functionality does overlap with the current exception-handling proposal, so I'm curious whether this primitive is intended to replace or complement what is currently specified in the exception-handling repo. I also wonder whether the initial primitives ( Also, I didn't see much mention of how these primitives interact with host frames. Can you go into that a little bit? |
I've talked to Ross about this separately. I think this idea is really promising, and could enable garbage collectors written in WebAssembly to perform reasonably well. I worked on a similar idea in the context of education several years ago: http://blog.brownplt.org/2013/02/19/teaching-gc.html In this project, the goal was to allow students to write garbage collection algorithms in Scheme (now Racket), for Scheme programs, without the need to understand low-level details of the machine and compiler. So, we came up with a high-level API for walking the stack and storing values in linear memory. CC @shriram |
Definitely. I like to work out a detailed roadmap and then carve out waypoints. Certainly I would not expect first-class stacks to be part of a first version. Stack-allocated closures I could see going either way, though I suspect they'd be pushed to a second stage. The "initial" primitives you mention seem to me like the likely first batch.
@arjunguha actually first had the idea for these primitives months ago in order to support custom garbage collection, such as through my But those are all examples listed above. Another common application of stack marks is for implementing dynamically scoped variables (e.g. for LaTeX, bash, and Emacs Lisp). You can also use stack-allocated closures (or simulate stack-allocated closures with stack walking) to implement C#'s
There's actually a lot of interesting opportunities here. I can imagine a
There definitely is overlap. Except for the allocation and tag-inspection of
Stack walking only observes (aside from side channels like timing) the specified stack marks. This means it skips over stack marks it does not understand and stack frames (like host frames) without the specified stack marks. Of course, the host is the one implementing stack walking, so it can have its own behind-the-scenes handlers for when focus goes up and down the stack, and likewise for when a stack is unwound. |
Just wanted to chime in and give my two cents on what aspects of this are of interest to at least one language implementation: For context, I'm working on an ahead-of-time compiler for Erlang (and really BEAM languages in general that are able to compile to Erlang AST or Core representations). Wasm is really what motivated the project originally (and still is), but our current implementation targets x86_64, while we figure out the best approach to adding support for Wasm. There are a few features whose common implementation techniques (on other architectures) are not easily lowered to Wasm:
For my own purposes, non-local control flow and green threads are of primary interest, and are the areas that present the most difficulty in lowering to Wasm versus other architectures. For exceptions, it's acceptable in my case to use C++-style exceptions, even though that punishes Erlang programs a bit, since exceptions are used quite frequently for control flow (Erlang distinguishes between its three main "types" of exceptions: errors, exits, and throws; throws are used almost exclusively for control flow). On the other hand though, creating green threads, suspending/yielding them, and scheduling are of critical importance - and because the scheduler is part of the runtime system, changes in how green threads are represented means changing the scheduler, so its extremely desirable to keep differences in representation as non-invasive as possible. The current way we lower green threads on non-Wasm architectures is by creating independent stacks, and suspend/resume them using assembly at various pre-determined yield points in the compiler-generated code. Ideally then, Wasm would have primitives to create stacks, destroy them, and swap between them (or more precisely, suspend execution on one stack, resume on another). This would include treating the "original" stack as just another stack (sans destroy), since the scheduler is really just another green thread, albeit a privileged one. Within the context of a single stack, either being able to use stack stepping/unwinding primitives, or being able to save/restore a point in the stack, would provide a straightforward way of implementing non-local control flow. Put together, this above set of primitives would only require relatively minor adaptations in both the compiler backend and runtime system to support Wasm as a new target, and I suspect that applies to just about any language that has a similar set of requirements. The only other problem is understanding how these features interact with debuggers and other such tooling. On x86_64, sticking to conventions and making use of CFI directives can at least make sure compiled programs play nicely with a debugger; but it's not clear to me how that translates to Wasm. |
Thanks @bitwalker for engaging and for the detailed description of how this might help with Erlang (a particularly useful example since it wasn't one I had in mind when putting this together)! Would you mind educating us a bit more on Erlang's exceptions and either how they differ from C++ exceptions or how they would be aided by these primitives? Admittedly I know little about Erlang exceptions. |
@RossTate Sure! C++ exceptions, or to be more specific, the zero-cost exception handling model specified by the Itanium ABI; add no runtime overhead when not used, but are relatively expensive to raise. That expense comes from unwinding the stack frame by frame, executing search and cleanup code in each frame with a landing pad. The original Wasm exception handling proposal mostly follows this model, though if I recall correctly, it doesn't explicitly support the two-phase approach described by the Itanium ABI, as it doesn't do the cleanup phase, but otherwise is more or less the same. Erlang has three kinds of "errors":
As implemented in the BEAM virtual machine, all three errors use the same basic mechanism: the VM searches up the stack for the nearest catch pointer (which is a pointer to the instruction where control will resume), if one is found, it jumps to the instruction indicated in the pointer and resumes execution. If no catch pointer is found, and the start of the processes' stack is reached, the throw is converted to a In short, Erlang exceptions don't need to unwind and potentially execute code in every frame along the way; instead control is transferred directly to the instruction indicated by the catch pointer, only needing to search the stack for that catch pointer if it isn't already on hand. The result is really cheap exceptions. Using them for control flow comes with very little overhead. Where things get tricky with Erlang though is in the implementation of processes (green threads). Specifically, transferring control from yield points in the generated code, to the scheduler; which then switches to another process and resumes control where that process left off (or in the case of a newly spawned process, its entry point). As it stands today, I'm not even sure how to do this in a way that doesn't require a completely different code generation approach and a separate scheduler implementation. For example, on the codegen side, one could convert from direct-style to continuation-passing style, split functions into many small ones at yield points and pass the next function to call as the continuation, then "yield" by unwinding all the way back to the scheduler, and resume execution again by calling the continuation function. While viable, it is entirely different to how codegen works on other platforms, and differs significantly in how generated code integrates with the scheduler. That's a pretty significant hurdle in my case, but I think it applies generally to any language implementation facing similar constraints, and which aims to support Wasm as a target, but isn't the only target. I think it may be easiest to illustrate the fundamental primitives that I think solve the above problems by expressing them as functions (NOTE: these are not concrete proposals, just sketches): // an opaque value that represents a continuation handle, can be passed around, stored, etc.
extern type Cont;
// an opaque value that represents a handle to a stack, can be passed around, stored, etc.
extern type Stack;
// an entry function takes no arguments and never returns
type Entry = fn() -> !;
/// Captures the current continuation and returns a handle to it.
///
/// Can be called with `call_cont`, but only once - subsequent attempts to
/// call the continuation would trap.
///
/// Continuations are associated with the stack they are captured on
fn capture_cc() -> Cont;
/// Transfers control to the captured continuation, this "consumes" the continuation.
///
/// Trying to call a continuation twice will trap.
/// Trying to call a continuation on a stack other than the one it was captured on will trap
fn call_cc(cont: Cont) -> !;
/// Creates a new stack, with a given size, and entry point.
///
/// The entry point is a function pointer which will be used as the continuation
/// when control is transfered to the stack for the first time.
///
/// Returns a handle to the newly created stack
fn create_stack(size: usize, entry: Entry) -> Stack;
/// Creates a stack created via `create_stack`, traps if the given stack is the current stack
fn destroy_stack(stack: Stack);
/// Captures the current continuation to the current stack,
/// swaps the current stack with the new stack, and resumes
/// execution on the new stack using the continuation stored on the new stack
fn swap_stack(new_stack: Stack) -> !; The main purpose of the The other three primitives are designed to play well with continuations, in such a way that they are safe to use, avoid exposing too much low-level detail, while (I believe) still allowing you to implement all of the interesting things you'd want first-class stacks for. It may make sense for these to be more granular, but I think keeping the interface smaller provides more flexibility to host implementations without damaging their usefulness. I think there is value in some of the stack walking/etc. primitives discussed already in this thread, so I'm not rehashing them here. Anyway, sorry this turned into a huge brain dump! I want to avoid making arguments based too much on the needs of my compiler, but hopefully knowing the specific problems I'm facing, it provides some useful information for the discussion :) |
Thanks! That was useful. Definitely don't apologize for the brain dump; it's the sort of the thing we want a peek into. Now I want to talk to just a couple points that particularly stood out to me. One is that, for Erlang, it is very useful to Another point I should make is that the primitives above do not support |
Yeah my take is that you definitely don't want to assume very much at all about the exception model - i.e., don't assume C++-style exceptions, or require that a given language build their exceptions on top of them.
Yeah, definitely, my expectation would be that the primitives here are not operating on the host stack, but the Wasm stack. I would also expect that by default, a stack is created that your program runs on, so it isn't necessary to do anything by default, but as soon as you create a second stack, you can switch between them (this implies a -- Another thing I think may be of interest: LLVM has an interesting set of primitives that I think could be worth comparing/contrasting against, and my understanding is that they are intended specifically to model most kinds of structured exceptions. Check out the link, but to summarize here:
Here's an example from the docs for how C++-style exceptions would use these primitives: // Source code we're lowering
struct Cleanup {
Cleanup();
~Cleanup();
int m;
};
void may_throw();
int f() noexcept {
try {
Cleanup obj;
may_throw();
} catch (int e) {
may_throw();
return e;
}
return 0;
} define i32 @f() nounwind personality i32 (...)* @__cxx_personality {
entry:
%obj = alloca %struct.Cleanup, align 4
%e = alloca i32, align 4
%call = invoke %struct.Cleanup* @"Cleanup"(%struct.Cleanup* nonnull %obj)
to label %invoke.cont unwind label %lpad.catch
invoke.cont: ; preds = %entry
invoke void @"may_throw"()
to label %invoke.cont.2 unwind label %lpad.cleanup
invoke.cont.2: ; preds = %invoke.cont
call void @"~Cleanup"(%struct.Cleanup* nonnull %obj) nounwind
br label %return
return: ; preds = %invoke.cont.3, %invoke.cont.2
%retval.0 = phi i32 [ 0, %invoke.cont.2 ], [ %3, %invoke.cont.3 ]
ret i32 %retval.0
lpad.cleanup: ; preds = %invoke.cont.2
%0 = cleanuppad within none []
call void @"??1Cleanup@@QEAA@XZ"(%struct.Cleanup* nonnull %obj) nounwind
cleanupret %0 unwind label %lpad.catch
lpad.catch: ; preds = %lpad.cleanup, %entry
%1 = catchswitch within none [label %catch.body] unwind label %lpad.terminate
catch.body: ; preds = %lpad.catch
%catch = catchpad within %1 [%rtti.TypeDescriptor2* @"??_R0H@8", i32 0, i32* %e]
invoke void @"may_throw"()
to label %invoke.cont.3 unwind label %lpad.terminate
invoke.cont.3: ; preds = %catch.body
%3 = load i32, i32* %e, align 4
catchret from %catch to label %return
lpad.terminate: ; preds = %catch.body, %lpad.catch
cleanuppad within none []
call void @"abort"
unreachable
} I cleaned up a few of the names to make it clearer to read, but the key thing it illustrates is how a compiler can lower a normal |
* [spec] Add exec text * Update document/core/exec/instructions.rst Co-Authored-By: Andreas Rossberg <[email protected]> * Update document/core/exec/instructions.rst Co-Authored-By: Andreas Rossberg <[email protected]> * Update document/core/exec/instructions.rst Co-Authored-By: Andreas Rossberg <[email protected]> * Apply suggestions from code review Co-Authored-By: Andreas Rossberg <[email protected]>
During all this conversation, I've been reading up on the various stack and control primitives out there. These are the low-level primitives that exceptions, stack tracing, gc-root marking, generators, resumable exceptions, and continuation delimiting can be and often are implemented with. They have been found to often work will across module boundaries and even across language boundaries (or at least to facilitate cross-boundary coordination). Given their low-level nature, they don't require introducing any new value types. So I thought I'd review them here for y'all and attempt to illustrate how they'd work in and contribute to WebAssembly.
Simple Stack Marking
Let's start with helping languages manage their own memory. One key part of garbage collection is finding all the roots, i.e. all the garbage-collectable references currently on the stack. Yes you could do this by maintaining your own list, but this list gets changed a lot, and it's only needed relatively infrequently—only when the garbage collector runs. So that's not particularly efficient. What you really want is a way to walk the current stack and mark all the references on it on demand. (Here I'm assuming that code occasionally calls
gc_collect
, explicitly handing over control, rather than the gc firing at arbitrary times.)We'll do this using a stack mark:
mark gcroot : [] -> [i32]
This declares a new kind of stack mark called
gcroot
that provides ani32
(with no input needed), which conceptually is a root on the stack. This declaration would be up in the events section, not at the instruction level.Next we'll actually mark the stack:
Here we allocate some garbage-collected memory and assign the address to
newref
. Now this address is sitting onfoo
's stack frame, and so we need to make sure its referenced memory isn't cleaned up untilfoo
returns. Thewith-value-mark (gcroot newref) {...}
construct places agcroot
mark with valuenewref
on the stack (often using code regions to indicate which instructions are marked, imposing little to no run-time overhead) and keeps it there until the...
in the body is exited. In particular, any call togc_collect
while that...
body is still executing will see the value ofnewref
on the stack as agcroot
mark. Which leaves the question, how doesgc_collect
observe that mark?Stack Walking
Now let's dive into
gc_collect
, or really its helper functioncollect_roots
:Here we have three new constructs:
walk-stack
,next-mark
, andget-mark
.The construct
next-mark
is only usable withinwalk-stack
, and the constructget-mark
is only usable withinnext-mark
.What
walk-stack
does is store a pointer into the current stack, starting with the present point in the stack, indicating where we currently are in the stack walk.next-mark
then walks up the stack looking for marks. Here it has just one mark type specified,gcroot
, but in general there could be many mark types specified. Oncenext-mark
finds a mark of one of the specified types, it updates the pointer into the stack, and then it enters the body corresponding to that tag. Within that body, wheneverget-mark
is executed it returns the payload of that mark onto the stack. If at some pointnext-mark
is executed and unable to find a matching mark, it executes thenone
body.So
collect_roots
effectively iterates through all thegcroot
marks on the stack and callsadd_reachable
on each of the respectivei32
addresses, returning when the top of the stack is reached.Advanced Stack Marking
Now this is a little inefficient. Often a function references many roots at a time, and this will process them one by one. So let's optimize this example, first by understanding how
with-value-mark
is actually a shorthand.Remember that the declaration
mark gcroot : [] -> [i32]
looks a lot like an input-output operation, not just a value. That's because it is.with-value-mark
is just shorthand for "whenever someone doesget-mark
on this mark, provide this value".So with that in mind, let's expand
foo
above:Here we see that the body of
mark gcroot { push newref; }
is just a block of type[] -> [i32]
, matching the corresponding declaration. Wheneverget-mark
is called, it hands the current pointer into the stack to this block so that the block can access its stack frame while executing on top of the overall stack (rather than creating a new stack). Once this block terminates, its frame is popped off the stack. So this gives us a way to temporarily execute blocks from stack frames up the stack.So let's take advantage of that a redesign our mark to be
gcaddroots : [] -> []
, and consider a new functionfoo2
with multiple roots:Here
gcaddroots
directly takes care of adding both references to the reachable set. This in turn makescollect_roots
simpler:So with lower-level stack primitives and control-flow primitives, we can enable much better support for manual memory management. Hopefully it's clear how this same primitives could enable one to implement
get_current_stack_trace()
and to even place stack-trace marks that indicate how WebAssembly code corresponds to source code so that the result ofget_current_stack_trace()
is something that a non-wasm expert could actually use to debug their source code.Stack Unwinding
Now for the second set of stack primitives used to implement stack unwinding. Stack unwinding is the process of cleaning up state as one pops frames off the stack. This clean up takes the form of destructors in C++ and
finally
clauses in many other languages. The most familiar instigators of cleanup are function returns and thrown exceptions. As we'll see, exceptions are really many primitives put together. I'll use C# because it makes it easiest to see how to break exceptions down into these primitives.Escaping
Consider the following C# function:
It uses the unqualified
catch
clause to catch both C++ and C# exceptions (at the cost of not being able to inspect the exception).Here's how we would compile this to lower primitives (ignoring C++'s nested rethrow for now):
Here we introduce two new primitives:
escape-hatch
andescape-to
. The body ofescape $target
executes normally unless/until anescape-to $target
is reached, at which point all of the stack between those two points is unwound (starting fromescape-to
) and then redirected to thehatch
clause. This in particular means thehatch
clause is never executed ifescape-to $target
is never executed. It also means that the wasm program has control over when the stack is unwound and at the same time guarantees nothing is ever executed on an invalid stack.If we look at the implementations of
cpp_throw
/csharp_throw
, we see they work by walking up the stack until they find a handler. Note thatcpp_throw
/csharp_throw
do not cause the stack to be unwound. Instead, as you can see in the lowering ofbar
, the handler is responsible for this, which is important for implementing more flexible exceptions (discussed next). If the handler doesn't do this, thencpp_throw
/csharp_throw
continue on to the next handler until they reach the top of the stack. Notice thatnext-mark
doesn't have anone
case here, which meanscpp_throw
/csharp_throw
trap if they reach the top of the stack. Conveniently, all of the stack is still intact, so a debugger can kick in to let the programmer inspect the intact state of the program and cause of the exception, or the host can collect the stack-trace marks to report the trace of the exception.Handing Off State
Okay, before we get into filtered exceptions, let's first illustrate how to hand off state to the
escape-hatch
:lowers to
Here we see that
escape-to
can hand off values to thehatch
. We also see a case of a handler not escaping incpp_handler
. There's a flag in C# (nowadays set to true by default) that makescatch (Exception e)
even catch (wrapped) C++ exceptions. If that flag is set, then thecpp_handler
wraps the exception and forwards it to thehatch
. If not, then thecpp_handler
does nothing so thatthrow_cpp
continues on to a "real"cpp_handler
.Filtering Exceptions
Because we have given the wasm program more control, without any more changes we can support C#'s filtered exceptions. Consider the following example:
The
when
clause indicates that the exception should only be caught when that condition evaluates to true. This clause can be stateful and so is specified to be evaluated before anyfinally
clauses or destructors. (Here)[https://thomaslevesque.com/2015/06/21/exception-filters-in-c-6/]'s a blog post on why this is useful. (Hint: better debugging support is one reason.) As a consequence, we lower this program to the following:Notice that no unwinding is done if the condition fails. We also didn't have to change
csharp_throw
to support any of this;csharp_throw
just walks up the stack firing off potential exception handlers until one unwinds the stack (just like what a standard VM's implementation of exceptions does).Stack Conventions
I say that we unwind the stack, but what does that mean? As a low-level primitive, it just means move the stack pointer and reclaim the stack. Technically that is it. But conventionally stack unwinding also means executing
finally
clauses and C++ destructors. Rather than baking in such a convention, let us modifyescape/hatch
to directly support it:We have added an
unwind
clause tobar4
. This forces escapes to$target
to walk the stack as it is unwound, now frame by frame rather than all at once. Whenever the given mark is encountered, in this caseunwinder
, then the correspondingunwind
clause is executed (while the stack frame for the mark is still in tact). If theescape-to
clause provides a value of typet
to pass tohatch
, then theunwind
clause technically has type[t] -> [t]
, giving it the chance to update this value, which can be useful in particular for collecting stack-trace marks while the stack is unwound. In the case ofbar4
, we don't care about updating the exceptione
, so we just useget-tag
to execute the mark. In this way, the escape inbar4
effectively calls allunwinder
marks as it unwinds the stack.Another lesser-known convention exists in languages with more complex control flow. Consider how a stack walk essentially maintains the stack but shifts the control focus of the program up the stack. Sometimes it is useful for function calls on the stack to be able to track this focus. For example, a C++ program maintains it's own stack in addition to the wasm stack, and so it might be helpful to also track what point within that separate stack corresponds to the current focal point of the stack walk.
Let's modify our garbage-collection helper
collect_roots
to permit such a convention:This modification invokes
focus-out
marks on the way out so that they can observe that the stack walk has moved out of that part of the stack. After the walk reaches the top of the stack, it then walks back and invokesfocus-in
marks to conceptually revert any changes that were made by thefocus-out
marks.Especially with these conventions, this pattern of simply invoking marks as one passes them is extremely common. So let's use
stack-walk [next-pass*] [prev-pass*]
andescape [unwind-pass*]
as shorthand for simply invokingget-mark
whenever aget-next
/get-prev
/unwind
passes a mark in the respective lists. This shorthand lets us abbreviatecollect_roots
to just the following:Nested Rethrow
At this point we have all the core functionality needed to implement a variety of features. I need to get to bed, so I'll edit this with some more non-exception examples later, but first let me illustrate how this supports C++'s nested rethrow. Actually, I'll show it supports Python's nested rethrow because, unlike C++, Python actually has a stack-tracing convention that's worth demonstrating.
To see this convention, if you run the following program:
then you will get the following stack trace:
Notice that both
failure
andrefailure
are in the trace. This illustrates that Python builds the stack trace as it goes. Note: if you add prints, you'll also see that the stack forfailure
is unwound beforerefailure
executes.Here is how we can compile this (assuming for simplicity that all Python exceptions are
ValueError
s and all we care about is the trace):There are three things to note here. First,
raise_with_trace
also observes thecode_line
marks in order to collect the stack trace as it walks up the stack to find a handler (which then is handed the stack trace up to that point). Second, in the lowering offail
the call torefailure
is executed within apython_except
mark. Third, thereraise
function implementing Python's rethrow walks up the stack looking for such apython_except
mark that provides the exception that is most recently caught. Altogether we get an implementation of Python's nested-reraise construct with its own semantics for stack tracing (that differs from both Java's and C#'s), and we didn't have to add anything new for it. Of course, the same technique can be done/adapted for C++'s nested-reraise construct.Generators
A number of languages feature generators. There are a few variations on this concept, so I'll focus on C# and Python generators. These have a
foreach (x in list) {...}
construct that executes its body for each element "in" the list, and a generator-method construct in whichyield value
can be used to conveniently yield the elements "in" the list. Althoughforeach
is often implemented by translating tofor (IEnumerator enum = list.GetEnumerator(); enum.MoveNext(); ) { x = enum.Current; ...; }
and by converting a generator-method into a state machine that dynamically allocates and updates the state to simulate control flow, the two features can be matched together to make for a much more efficient implementation of this common pattern. For simplicity, I'll assume everyIEnumerable
has aGenerate()
method that directly executes theyield
statements rather than transforming them into a state machine.Let's start by lowering the following C#-ish code:
to the following:
Next let's lower the body of some generating code:
to the following:
So the
foreach
sets up aforeach
mark on the stack within which it callslist.Generate()
, which let's say eventually calls toGenerateInts(5, 4)
. ThenGenerateInts
walks up the stack to find that (first)foreach
mark and executes its body, usingget-mark
to execute theforeach
body with eachyield
ed value. This pattern effectively implements the standard stack-sharing implementation of generators, again without needing to introduce any more primitives—without even references.Stack-Allocated Closures
Let's review for a sec. We have this mechanism for walking the stack, which gets us a pointer into the stack. We have this mechanism for getting a mark, which gets a code pointer. The two together effectively make a stack-allocated closure, and
get-mark
is just calling that closure. (Technically, this is slightly different from a stack-allocated closure, but the analogy is close enough for our purposes.) Because the closure is allocated on the stack, rather than on the heap, we have to make sure our reference to it does not outlive the stack frame it sits upon. This is why this pair is not a first-class value and whyget-mark
is restricted to being used within astack-walk
andnext-mark
.So stack walking and marks give us a way to go up and get stack-allocated closures, but it's also useful to be able to send stack-allocated closures (i.e. code pointer and into-stack pointer pair) down. I won't go into how this can be used to improve performance of higher-level languages, since that's still unpublished, but I will demonstrate how this can be used to optimize generators further and to help implement continuations.
First, let's extend functions with higher-order parameters so that one can write declarations like
(higher-param $sac (i32 i32) (i32 i32))
after parameter declarations like(param $p i32)
and before(result ...)
. These higher-order parameters are stack-allocated closures and consequently are invocable, let's say viahigher-local.invoke $sac
.Using this, let's rewrite the lowering of
GenerateInts
to use stack-allocated closures:Notice that
GeneratesInts
no longer has to walk the stack, which was the slowest part of its implementation before. It just invokes its higher-order parameterforeach
. Note, though, that it does not treatforeach
as a value (i.e. there's nohigher-order.get
), which guarantees the lifetime offoreach
will not exceed the lifetime of the call toGenerateInts
.Next let's rewrite the lowering of
baz
to instead assumeIEnumerable.Generate
takes a stack-allocated closure:Here we declare a new higher-order closure using
higher-order.let
(it's unsound to have ahigher-local.set
) and hand it off tolist.Generate
—no need for a stack mark anymore. This doesn't involve any dynamic allocation despite the fact thatforeach
closes over the local parameterheader
because that local parameter is sitting on the stack frame and the lifetime offoreach
is guaranteed to last only as long as the stack frame. In other words, the stack frame is its closure environment.So we can even more efficiently implement generators with higher-order parameters and stack-allocated closures, but can also combine first-class stacks with higher-order parameters to implement one-shot continuations.
First-Class Stacks
So far the mental model has been that there is one stack for the program, but it turns out that everything described works even if have multiple stacks calling into (or really returning to) each other. Such functionality lets us implement all sorts of interesting things like lightweight threads or even continuations.
Stack Type
The stack type is
stack (param ti*) (result to*)
, whereti*
is the inputs the stack is expecting andto*
is the outputs the stack will produce when/if it eventually returns. These stacks are just big state machines. We can run this state machine to its completion to get values returned, but it turns out to be much more useful for the state machine to pause on occasion, and this pausing is much more safely done if it's done voluntarily by the state machine rather than forcefully by some external entity.Stack Allocation
To allocate a stack, one uses
stack.alloc $f : [t*] -> [(stack (param ti*) (result to*))]
where$f
is afunc (param ti* t*) (higher-param () (ti*)) (result to*)
. This creates a stack whose initial stack frame is the function$f
along with some of its inputs and a special higher-order input. The higher-param with no inputs is to be used by$f
to pause the newly allocated stack.stack.alloc
initializes the portion of the stack frame corresponding to that higher-param to be (the pair of) the host-level code pointer for pausing a stack and the pointer to the freshly created stack. Since a non-executing stack is always awaitingti*
inputs, this "pause" will return back to$f
the inputs it was given.One thing to note is that the
t*
could include mutable references, which$f
can use to externally expose changes to the stack's internal state as it desires. So while these stacks might seem like a big black box, whoever created the stack has the ability to open up as much of that black box as it wants. Of course, this all assumes we have a way to make the stack change.Stack Stepping
After we allocate a stack, it's just sitting there. To make it actually do something, just use
stack.step $l : [(stack (param ti*) (result to*)) ti*] -> []
where$l
is a block fromto*
. This provides the stack with theti*
inputs and causes it to progress until either the "pause" higher-param is called from within the unknown enclosed$f
, in which casestack.step
returns normally, or until the unknown enclosed$f
returns, in which case control is transferred to$l
with the returned values on the stack.(Note that, for simplicity, I'm putting aside the complication that every stack needs to be guarded by a lock so that two threads cannot step the same stack at the same top. Later on I'll show how to adjust the instructions to better accommodate that constraint.)
Putting these together, we can implement lightweight cooperative worker threads:
Here
main
maintains a queue of workers, with each worker being a stack. It repeatedly pops a stack off the queue, makes it take a step, requeueing the stack if it doesn't finish, and aggregating the result of its work with the ongoing result if it completes. Before it forces the stack to take a step, it sets up aspawner
stack mark to catch any stack walks done by a call tospawn
within the worker stack, adding any provided workers to the queue.The function
new_worker
's job is solely to create the stack, which it does by bundling the given$thread_state
with the functionworker_body
, whose body is really where all the interesting stuff happens. Looking insideworker_body
, we see that it sets up two stack marks. Thethread-locals
stack mark simply provides the state of the thread. Theyielder
stack mark invokes the providedpause
higher-order parameter. This means that calls toyield()
withindo_work
will walk up the stack, execute this mark, thereby runningpause
, whichstack.alloc
has set up to cause the stack to pause. Note that, if the language compiling to wasm wants to support lightweight threads with very efficient yielding, i.e not having to do a stack walk to find the "pause", then these primitives would also support another implementation strategy in which all compiled functions take apause
higher-order parameter and pass "pause" down to whomever they call. That is, we are not forcing a particular implementation strategy upon the language implementer; we are giving them the low-level primitives to choose their own.Hopefully this illustrates how these low-level primitives combine together to support high-level patterns like algebraic effects. If the "event" of the algebraic effect has an input, the analog of
worker_body
would make itsyielder
mark update the$thread_state
with that input. But we can also do a lot more than what algebraic effects support!Stack Extension
In order to allocate a
stack (param ti*) (result to*)
we allocated a partial stack frame for a function that pauses and otherwise convertsti*
intoto*
. Once we have such a stack, if it's not executing then we know that it's current stack frame is awaiting someti*
s. So we can extend the stack with another stack frame that will returnti*
s. In order to preserve the type of the stack, this stack frame must itself be awaiting someti*
s. This suggests the following primitive:where
$f
is afunc (param ti* t*) (higher-param () (ti*)) (result ti*)
.This turns out to be surprisingly useful, especially for doing stack inspection. For example, suppose we wanted to combine our lightweight threads with our
collect_roots
implementation for program-managed garbage collection. The problem is thatmain
has a lot of stacks representing worker threads that refer to root addresses that need to be collected. We can use stack extension to address this.The revised
main
sets up itsgcaddroots
mark to go through each of the stacks instack_list
and collect their roots. It does so by adding a stack frame forcollect_worker_roots
onto each stack, stepping the stack to cause it to callcollect_roots
, which then walks that stack to rungcaddroots
along it, and then to pause, thereby returning control back tomain
. Thestack-wall
construct is just an optimization that makes stack walks think they have hit the top of the stack at that point so that each of these calls tocollect_roots
within the thread stacks don't each rewalkmain
's stack.stack-wall
is also useful for security purposes as it prevents callees from observing the marks on the caller's stack and prevents even observing the time it takes to walk the caller's stack.Notice that the above example does a
stack.extend
and then astack.step
. There is actually one primitive that combines these two together and is in fact slightly stronger and lower-level than the combination:where
$f
is afunc (param t*) (higher-param () (ti*)) (result ti*)
and$l
is a block fromto*
. As it sounds, this extends the stack frame with$f
and steps the stack in one go. Notice that, unlike withstack.extend
,$f
does not taketi*
as inputs, and that, unlikestack.step
, there need not be anyti*
on the stack. Sostack.extend
essentially adds a stack frame that first pauses to get more inputs and then calls the provided function, whereasstack.step
adds a stack frame with the provided inputs for the function that simply returns those inputs.What is particularly useful about
stack.extend_and_step
, besides being more true to what actually happens at the low level, is that it gives you a way to step the stack without providing the expected inputs for the stack, as in the following variant where the worker threads each expect ani32
.Stack Running
In both stack allocation and extension we were careful to make sure no local stack state could be caught in the first-class stack. The reason is that even if we were to immediately step the first-class stack, it could pause and outlive the local stack frame. But, although pausing is the key feature of first-class stacks, sometimes one reaches a point where they simply want to run a first-class stack to completion, disabling pausing. This effectively fuses (the lifetime of) the first-class stack with (that of) the local stack, and as such we can safely permit local state to be infused into the first-class stack.
We achieve this functionality with the following primitive:
This conceptually mounts the given first-class stack on the local stack, extends the first-class stack with the frame for the main body of
stack.run
, and then resumes the first-class stack but executing thepausing
clause whenever the first-class stack would have paused, eventually returning theresult
values of the first-class stack. The trick with thepausing
clauses is most straightforwardly implemented by allocating space within each stack for a stack-allocated closure that is initially null but which the stack-allocated closure that is originally passed topause
first checks and defers to if the value is no longer null.We can use this in our thread-workers example to forcibly abort threads. That is, suppose it is possible for the
main
loop to recognize early that the$result
has already been determined, e.g. because it's computing a big conjunction and some worker has resulted in false, making it unnecessary to continue executing the remaining threads. Those threads might still be holding onto resources, though, so it is important to clean them up. We can do so by adding the following after themain
loop:Stack Acquiring
Earlier I deferred the issue that first-class stacks need to have associated locks to prevent someone from, say, trying to extend a stack with a frame while another program is stepping the stack. We could have every stack instruction acquire and release this lock, but that seems inefficient and doesn't address the fact that one thread might want to guarantee that a number of operations happen in direct succession. So instead we add the construct
stack.acquire($s := expr) {...}
. This pops off the result ofexpr
, which should be astack
, acquires the lock on that stack, binds a local-variable-of-sorts$s
to that stack, executes the...
body within which$s
is in scope, and then releases the lock on the stack. It's input and output types are the same as the body...
plus an additionalstack
input.Now
$s
is not really a local variable. It is simply names a way to refer to thestack.acquire
. So the second thing we do is modify all of the above instructions to refer to$s
rather than take astack
as input. That way these instructions do not need to acquire/release the stack's lock because they know that the stack is already acquired.Let's illustrate this new construct by showing how we can use it to append an entire stack, not just a single stack frame, onto a stack:
There is a lot going on here. First,
stack_append
acquires the lock ons
, the stack to be appended to. Then it extendss
with the stack frame forstack_append_helper
, capturingsapp
as the argument to the correspond parameter, which it then steps into, effectively starting the execution ofstack_append_helper
. This simply acquires the lock onsapp
, and then runssapp
but immediatle pauses execution to get the values to provide tosapp
, restoring control tostack_append
. The acquire ons
then completes, releasing the lock, andstack_append
returns. So by the end of this, no one is holding the lock on the stacks
, but the stack frame added ontos
is holding the lock tosapp
. Furthermore, the next time someone tries to steps
, the value will be returned tosapp
instead, while will either run to completion or, due to thepausing
clause, causes
to pause wheneversapp
would have paused.Altogether, semantically speaking after
stack_append
completes it is as if the entirety ofsapp
has been mounted ontos
and locked into place, combining the two into one. Furthermore, one design choice I made in the above primitives is that this is completely equivalent to permanently acquiring the lock onsapp
and then directly copying each of the stack frames onsapp
ontos
one by one. In other words, unless you were the one to create a stack boundary, you cannot observe a stack boundary.(Side note: you can use the technique above to also compose stacks of types
[ti*] -> [t*]
and[t*] -> [to*]
into a stack of type[ti*] -> [to*]
, just like you can do with closures.)Stack Duplication
Lastly, for stack duplication we mainly just add the instruction
stack.duplicate $s : [] -> [(stack (param ti*) (result to*))]
where$s
references astack.acquire
of astack (param ti*) (result to*)
. This primarily just duplicates the stack via memcpy. However, there is one complication.The stack being duplicated can be in the midst of a
step
orrun
of another stack, in which case it has acquired the lock on that stack. We cannot have two stacks holding onto the same lock. So we have to duplicate that stack as well. In other words, while a stack is stepping or running another stack, the two are temporarily conceptually one stack and so must both be duplicated.As a convention, after duplicating a stack, one could walk the duplicate and run each
mark duplicator : [] -> []
on it. Theseduplicator
marks could update the stack frame to duplicate relevant resources (or trap if impossible) so that the duplicate is reasonably independent of the original. For example, the stack could be in the middle of astack.acquire
on the stack held in some local variable$x
, and theduplicator
may want to update$x
to reference the duplicated stack. For this reason, we also addstack.get $s
that returns the stack currently acquired by the referencedstack.acquire
, which might not be the stack value that was first passed to thestack.acquire
because astack.duplicate
might have happened. It is an odd corner case, but from what I can tell this works out and provides the right primitives for supporting multi-shot continuations.Wrapping Up
So what do y'all think of these primitives? Obviously these are a lot, but we don't have to add them all at once. Especially the ones about first-class stacks don't even make sense to add until after gc, since stacks will need to be memory managed. But I thought y'all would appreciate seeing a roadmap for this functionality and how it all fits together.
The text was updated successfully, but these errors were encountered: