Author(s): Dan Scales, Keith Randall, and Austin Clements (with input from many others, including Russ Cox and Cherry Zhang)
Last updated: 2019-09-23
Discussion at https://golang.org/issue/34481
General defer performance discussion at https://golang.org/issue/14939.
As of Go 1.13, most defer
operations take about 35ns (reduced from about 50ns
in Go 1.12).
In contrast, a direct call takes about 6ns.
This gap incentivizes engineers to eliminate defer
operations from hot code
paths, which takes away time from more productive tasks, leads to less
maintainable code (e.g., if a panic
is later introduced, the "optimization" is
no longer correct), and discourages people from using a language feature when it
would otherwise be an appropriate solution to a problem.
We propose a way to make most defer
calls no more expensive than open-coding the
call, hence eliminating the incentives to shy away from using this language
feature to its fullest extent.
Go 1.13 implements the defer
statement by calling into the runtime to push a
"defer object" onto the defer chain.
Then, in any function that contains defer
statements, the compiler inserts a
call to runtime.deferreturn
at every function exit point to unwind that
function's defers.
Both of these cause overhead: the defer object must be populated with function
call information (function pointer, arguments, closure information, etc.) when
it is added to the chain, and deferreturn
must find the right defers to
unwind, copy out the call information, and invoke the deferred calls.
Furthermore, this inhibits compiler optimizations like inlining, since the defer
functions are called from the runtime using the defer information.
When a function panics, the runtime runs the deferred calls on the defer chain
until one of these calls recover
or it exhausts the chain, resulting in a
fatal panic.
The stack itself is not unwound unless a deferred call recovers.
This has the important property that examining the stack from a deferred call
run during a panic will include the panicking frame, even if the defer was
pushed by an ancestor of the panicking frame.
In general, this defer chain is necessary since a function can defer an
unbounded or dynamic number of calls that must all run when it returns.
For example, a defer
statement can appear in a loop or an if
block.
This also means that, in general, the defer objects must be heap-allocated,
though the runtime uses an allocation pool to reduce the cost of the allocation.
This is notably different from exception handling in C++ or Java, where the
applicable set of except
or finally
blocks can be determined statically at
every program counter in a function.
In these languages, the non-exception case is typically inlined and exception
handling is driven by a side table giving the locations of the except
and
finally
blocks that apply to each PC.
However, while Go's defer
mechanism permits unbounded calls, the vast majority
of functions that use defer
invoke each defer
statement at most once, and do
not invoke defer
in a loop.
Go 1.13 adds an optimization to stack-allocate
defer objects in this case, but they must still be pushed and popped from the
defer chain.
This applies to 363 out of the 370 static defer sites in the cmd/go
binary and
speeds up this case by 30% relative to heap-allocated defer objects.
This proposal combines this insight with the insights used in C++ and Java to
make the non-panic case of most defer
operations no more expensive than the
manually open-coded case, while retaining correct panic
handling.
While the common case of defer handling is simple enough, it can interact in often non-obvious ways with things like recursive panics, recover, and stack traces. Here we attempt to enumerate the requirements that any new defer implementation should likely satisfy, in addition to those in the language specification for Defer statements and Handling panics.
-
Executing a
defer
statement logically pushes a deferred function call onto a per-goroutine stack. Deferred calls are always executed starting from the top of this stack (hence in the reverse order of the execution ofdefer
statements). Furthermore, each execution of adefer
statement corresponds to exactly one deferred call (except in the case of program termination, where a deferred function may not be called at all). -
Defer calls are executed in one of two ways. Whenever a function call returns normally, the runtime starts popping and executing all existing defer calls for that stack frame only (in reverse order of original execution). Separately, whenever a panic (or a call to Goexit) occurs, the runtime starts popping and executing all existing defer calls for the entire defer stack. The execution of any defer call may be interrupted by a panic within the execution of the defer call.
-
A program may have multiple outstanding panics, since a recursive (second) panic may occur during any of the defer calls being executed during the processing of the first panic. A previous panic is “aborted” if the processing of defers by the new panic reaches the frame where the previous panic was processing defers when the new panic happened. When a defer call returns that did a successful
recover
that applies to a panic, the stack is immediately unwound to the frame which contains the defer that did the recover call, and any remaining defers in that frame are executed. Normal execution continues in the preceding frame (but note that normal execution may actually be continuing a defer call for an outer panic). Any panic that has not been recovered or aborted must still appear on the caller stack. Note that the first panic may never continue its defer processing, if the second panic actually successfully runs all defer calls, but the original panic must appear on the stack during all the processing by the second panic. -
When a defer call is executed because a function is returning normally (whether there are any outstanding panics or not), the call site of a deferred call must appear to be the function that invoked
defer
to push that function on the defer stack, at the line where that function is returning. A consequence of this is that, if the runtime is executing deferred calls in panic mode and a deferred call recovers, it must unwind the stack immediately after that deferred call returns and before executing another deferred call. -
When a defer call is executed because of an explicit panic, the call stack of a deferred function must include
runtime.gopanic
and the frame that panicked (and its callers) immediately below the deferred function call. As mentioned, the call stack must also include any outstanding previous panics. If a defer call is executed because of a run-time panic, the same condition applies, except thatruntime.gopanic
does not necessarily need to be on the stack. (In the current gc-go implementation,runtime.gopanic
does appear on the stack even for run-time panics.)
We propose optimizing deferred calls in functions where every defer
is
executed at most once (specifically, a defer
may be on a conditional path, but
is never in a loop in the control-flow graph).
In this optimization, the compiler assigns a bit for every defer
site to
indicate whether that defer had been reached or not.
The defer
statement itself simply sets the corresponding bit and stores all
necessary arguments in specific stack slots.
Then, at every exit point of the function, the compiler open-codes each deferred
call, protected by (and clearing) each corresponding bit.
For example, the following:
defer f1(a)
if cond {
defer f2(b)
}
body...
would compile to
deferBits |= 1<<0
tmpF1 = f1
tmpA = a
if cond {
deferBits |= 1<<1
tmpF2 = f2
tmpB = b
}
body...
exit:
if deferBits & 1<<1 != 0 {
deferBits &^= 1<<1
tmpF2(tmpB)
}
if deferBits & 1<<0 != 0 {
deferBits &^= 1<<0
tmpF1(tmpA)
}
In order to ensure that the value of deferBits
and all the tmp variables are
available in case of a panic, these variables must be allocated explicit stack
slots and the stores to deferBits and the tmp variables (tmpF1
, tmpA
, etc.)
must write the values into these stack slots.
In addition, the updates to deferBits
in the defer exit code must explicitly
store the deferBits
value to the corresponding stack slot.
This will ensure that panic processing can determine exactly which defers have
been executed so far.
However, the defer exit code can still be optimized significantly in many cases.
We can refer directly to the deferBits
and tmpA ‘values’ (in the SSA sense),
and these accesses can therefore be optimized in terms of using existing values
in registers, propagating constants, etc.
Also, if the defers were called unconditionally, then constant propagation may
in some cases to eliminate the checks on deferBits
(because the value of
deferBits
is known statically at the exit point).
If there are multiple exits (returns) from the function, we can either duplicate the defer exit code at each exit, or we can have one copy of the defer exit code that is shared among all (or most) of the exits. Note that any sharing of defer-exit code code may lead to less specific line numbers (which don’t indicate the exact exit location) if the user happens to look at the call stack while in a call made by the defer exit code.
For any function that contains a defer which could be executed more than once (e.g. occurs in a loop), we will fall back to the current way of handling defers. That is, we will create a defer object at each defer statement and push it on to the defer chain. At function exit, we will call deferreturn to execute an active defer objects for that function. We may similarly revert to the defer chain implementation if there are too many defers or too many function exits. Our goal is to optimize the common cases where current defers overheads show up, which is typically in small functions with only a small number of defers statements.
Because no actual defer records have been created, panic processing is quite
different and somewhat more complex in the open-coded approach.
When generating the code for a function, the compiler also emits an extra set of
FUNCDATA
information that records information about each of the open-coded
defers.
For each open-coded defer, the compiler emits FUNCDATA
that specifies the
exact stack locations that store the function pointer and each of the arguments.
It also emits the location of the stack slot containing deferBits
.
Since stack frames can get arbitrarily large, the compiler uses a varint
encoding for the stack slot offsets.
In addition, for all functions with open-coded defers, the compiler adds a small
segment of code that does a call to runtime.deferreturn
and then returns.
This code segment is not reachable by the main code of the function, but is used
to unwind the stack properly when a panic is successfully recovered.
To handle a panic
, the runtime conceptually walks the defer chain in parallel
with the stack in order to interleave execution of pushed defers with defers in
open-coded frames.
When the runtime encounters an open-coded frame F
executing function f
, it
executes the following steps.
-
The runtime reads the funcdata for function
f
that contains the open-defer information. -
Using the information about the location in frame
F
of the stack slot fordeferBits
, the runtime loads the current value ofdeferBits
for this frame. The runtime processes each of the active defers, as specified by the value ofdeferBits
, in reverse order. -
For each active defer, the runtime loads the function pointer for the defer call from the appropriate stack slot. It also builds up an argument frame by copying each of the defer arguments from its specified stack slot to the appropriate location in the argument frame. It then updates
deferBits
in its stack slot after masking off the bit for the current defer. Then it uses the function pointer and argument frame to call the deferred function. -
If the defer call returns normally without doing a recovery, then the runtime continues executing active defer calls for frame F until all active defer calls have finished.
-
If any defer call returns normally but has done a successful recover, then the runtime stops processing defers in the current frame. There may or may not be any remaining defers to process. The runtime then arranges to jump to the
deferreturn
code segment and unwind the stack to frameF
, by simultaneously setting the PC to the address of thedeferreturn
segment and setting the SP to the appropriate value for frameF
. Thedeferreturn
code segment then calls back into the runtime. The runtime can now process any remaining active defers from frameF
. But for these defers, the stack has been appropriately unwound and the defer appears to be called directly from functionf
. When all defers for the frame have finished, the deferreturn finishes and the code segment returns from frame F to continue execution.
If a deferred call in step 3 itself panics, the runtime starts its normal panic processing again. For any frame with open-coded defers that has already run some defers, the deferBits value at the specified stack slot will always accurately reflect the remaining defers that need to be run.
The processing for runtime.Goexit
is similar.
The main difference is that there is no panic happening, so there is no need to
check for or do special handling for recovery in runtime.Goexit
.
A panic could happen while running defer functions for runtime.Goexit
, but
that will be handled in runtime.gopanic
.
One other approach that we extensively considered (and prototyped) also has inlined defer code for the normal case, but actual executes the defer exit code directly even in the panic case. Executing the defer exit code in the panic case requires duplication of stack frame F and some complex runtime code to start execution of the defer exit code using this new duplicated frame and to regain control when the defer exit code complete. The required runtime code for this approach is much more architecture-dependent and seems to be much more complex (and possibly fragile).
This proposal does not change any user-facing APIs, and hence satisfies the compatibility guidelines.
An implementation has been mostly done. The change is here. Comments on the design or implementation are very welcome.
Some miscellaneous implementation details:
-
We need to restrict the number of defers in a function to the size of the deferBits bitmask. To minimize code size, we currently make deferBits to be 8 bits, and don’t do open-coded defers if there are more than 8 defers in a function. If there are more than 8 defers in a function, we revert to the standard defer chain implementation.
-
The deferBits variable and defer arguments variables (such as
tmpA
) must be declared (viaOpVarDef
) in the entry block, since the unconditional defer exit code at the bottom of the function will access them, so these variables are live throughout the entire function. (And, of course, they can be accessed by panic processing at any point within the function that might cause a panic.) For any defer argument stack slots that are pointers (or contain pointers), we must initialize those stack slots to zero in the entry block. The initialization is required for garbage collection, which doesn’t know which of these defer arguments are active (i.e. which of the defer sites have been reached, but the corresponding defer call has not yet happened) -
Because the
deferreturn
code segment is disconnected from the rest of the function, it would not normally indicate that any stack slots are live. However, we want the liveness information at thedeferreturn
call to indicate that all of the stack slots associated with defers (which may include pointers to variables accessed by closures) and all of the return values are live. We must explicitly set the liveness for thedeferreturn
call to be the same as the liveness at the first defer call on the defer exit path.