-
Notifications
You must be signed in to change notification settings - Fork 598
Syscallbuf on AArch64
This document presents the logical steps that led to the initial implementation of sycallbuf on AArch64. The code below reflects the implementation of syscallbuf as designed and merged as of 2022/07/01 and may not be sync'd with the current implementation.
-
Need for runtime stub
On AArch64, all the instructions are of the same length. This means that we can replace the syscall instruction
svc 0
with any other single instruction we want. However, since the instruction length is 32bit, there’s no way we can encode the jump target address in the instruction so it’s impossible for us to jump directly to the syscall hook inlibrrpreload.so
. Thankfully, this is an issue that affects x86_64 as well due to the lack of 64bit immediate in the branch instruction (as well as the stack switching code which will be discussed later). The same trick used there works for aarch64 as well. We simply need to allocate a stub at runtime close enough to the syscall site and then we can have as many instructions and immediates as we want (within reason) in the stub to encode the jump to the syscall hook. -
Register requirement
On AArch64, the glibc syscall wrapper assumes that all the visible state of the process other than
x0
(used for syscall return value) remains unchanged. This includes the low 128bit of the vector register (it doesn’t include the higher bit in the SVE registers but we can’t rely on that being present yet and even if it does, its not required to have higher bits, e.g.neoverse-n2
). This effectively mean that we can’t use any registers before we save some of them to memory.In principle, the processor state flags also needs to be saved (
nzcv
). However, not saving it seems to be OK so far. If it really needs to be saved, we might need to use some branches to detect the flags up front since we need to use branch before we can save register values (see below). -
Stack requirement
In order to support things like go, that does aggressive tricks with the stack, we want to not use the native stack in the syscall buf code. This severely limits how we can save the registers to memory. On AArch64, all instructions that writes to memory must use one of the 31 general purpose register or the stack pointer as the base pointer to compute the address. (PC relative addressing also exists but is limited to
- address computation, which store the result to a register, overwriting it’s previous value
- branch, which is virtually of no use for writing to memory
- prefetch, and load instruction (but not store...).
It’s worth noting that this isn’t an issue on 32bit arm since PC can be used as a general purpose register in both load and store). If we cannot assume SP points to valid memory, and we don’t know the value of any other registers (xzr cannot be the base register used for address computation) there is no way we can store to a known address without trashing a register.
Fortunately, even though we don’t know the exact values of any of the other registers, nor their offset to a known memory location, we do know the likely range of one register, or at least the range of that register that we care about, i.e. the syscall number register
x8
. If the value ofx8
is larger than some pre-determined value (~1000 should be good enough since that’s theRR_CALL_BASE
), we can either issue a real syscall directly or just return-ENOSYS
. If the value ofx8
is within a small range, we can map that to a range of address that we know is valid in a reversible way and use that to store any number of other registers to memory.For this to work we need to do the comparison and the
x8
to address mapping without using any registers. The comparison can be done using eithercmp
ortst
easily. AFAICT, there is no easy way to do the comparison without setting the processor flags (a chain oftbnz
works but is not ideal.) The most straight forward mapping to use is to mapx8
to an address in the thread local page at0x70010000
. Unfortunately, this offset cannot be encoded in an add instruction. We can, however, usemovk
since we know the high bits ofx8
is zero and doesn’t need to be saved. To summarize, the minimum stub prologue we should use is roughlycmp x8, 1024 b.hi .Lnosys movk x8, 0x70010000 stp xm, xn, [x8, #offset] // we can use xm and xn after this point .Lnosys: mov x0, -ENOSYS // or `svc 0` if we want to support that. b syscall_return_address
-
Exiting from the stub
Before we return to the original code, we need to set all registers back to the original values (minus the ones changed by syscall). However, we need to use at least one register to store the address for the branch back and since we don’t have any instructions after the syscall we cannot have any instructions to restore the register values in the original code and it’ll have to be done in the runtime stub instead. We could in principle use PC relative addressing to load/restore the original value for this register but it’ll be easier to use a register for to point to a address in the thread local area for this instead so that it can more easily use the current logic for thread local storage and avoid potential conflicts between threads. The restoring logic can simply be
ldp xm, xn, [xm]
and we can actually reuse part of the same area we’ve allocated for the initial stash area for this purpose. This area will also be used for restoring registers fromclone
(see below) and it’ll need to be cloned (just the first two elements) for a cloned task. -
clone
handlingThe
clone
syscall may do something to the stack that is very difficult for us to handle in the C code. Therefore, we should have a branch to check that and do a raw traced syscall directly from the syscall hook assembly code (x86 avoids this issue by failing to match and patch the syscall site). We can do this with a simplecmp x8, 0xdc
and branch when entering the syscall hook code. We have to be careful to never touch sp in this branch since the clone syscall may be doing things with it that we don’t want to deal with... The branch may need to happen after we’ve restoredx8
and saved the registers in the location that the stub will restore them from. -
Passing information from stub to the syscall hook.
After we enters the syscall hook, we need to know how to make the syscall and return to where we came from. Since we need to return to the stub, this means we need to know the address of the stub. This is most easily done by using a
blr
instruction from the stub when jumping to the syscall hook. This overwrite thex30
register so that’s one of the registers we need to save (to memory) in the stub. In order to help with unwinding and since the stub won’t have unwind info, we should pass the address of the real callsite to the syscall hook as well. We can do this by simply store the address in the stub and have the syscall hook code load from there. The syscall hook code would just need to know how to find it based on the x30 value after we enter from the stub code. To make the control flow looks more like normal code, we'll store the return address with an offset, past the end of all the instructions in the stub. -
Full stub code
cmp x8, 1024 b.hi .Lnosys movk x8, preload_thread_locals stp x15, x30, [x8, stub_scratch_2 - preload_thread_locals] movz x30, #:abs_g3:_syscall_hook_trampoline movk x30, #:abs_g2_nc:_syscall_hook_trampoline movk x30, #:abs_g1_nc:_syscall_hook_trampoline movk x30, #:abs_g0_nc:_syscall_hook_trampoline // Might be shorter depending on the address blr x30 // we return from syscall hook to here ldp x15, x30, [x15] .Lreturn: b syscall_return_address .Lnosys: mov x0, -ENOSYS // or `svc 0` if we want to support that. b .Lreturn .long <syscall return address>
We save
x15
in addition tox30
since it’s much easier if we have two registers to play with. Sincex15
is a scratch register, hopefully it matters less if the debugger can’t restore it in unwinding. (We’ll have valid unwind info for it so this shouldn’t matter much.)Update: it turns out that at least the RR test does rely on invalid syscall still triggering an event (which they won’t if we simply returned ENOSYS from userspace) so we need to do a syscall from within the stub with a check in the patching code to avoid patching this syscall again.
-
IP range checking that involves the runtime stub
AFAICT, there is currently one place where RR check whether the code is in the runtime jump stub. With the need to make a syscall from it (see above), we also need to add another one. These are,
-
Check to make sure if we can deliver a signal
The code also make assumption that if we are in the stub (more like syscallbuf) code, we’ll exit through
_syscallbuf_final_exit_instruction
so that we can catch it by setting a breakpoint there. Therefore, we have to skip the first two instructions (x8
range check) and the last four instructions (return stub and the fallback syscall handling) in the check. We could maybe change the_syscallbuf_final_exit_instruction
handling to look for the the addresses in the stubs instead but that’s a bit unnecessary...Since we are returning through the stub epilogue and the code there uses the thread local memory, we need to avoid delivering signal before the stub finish using the thread local memory. Otherwise, the user signal handler might clobber it causing us to restore the registers to the wrong content. The only instruction we need to be careful of here is the
ldp
on the return path. It is past the normal syscall hook exit breakpoint so we have to deal with it slightly differently. For now we can simply check and add a breakpoint on the exit of the stub and manually do the return from stub when we hit that breakpoint. -
Check to make sure if we should patch the syscall.
This should include the full stub range.
-
-
Unpatching
Since we use the stub for return as well and it seems like some caller might be in the hook when we call unpatch, we need to make sure not to overwrite anything that’s used by the return path. It seems that it’s the easiest to just overwrite the first two instructions with
svc 0
andb syscall_return_address
. Since the branch instruction encodes the relative address, this won’t be the same instruction as the original branch, and we need to make sure the address of the stub is within the right range for the unpatch jump to be encoded. (In practice though, the fact that we can jump here and jump back already guarantees that mathematically.) -
Syscall hook prologue
Once we enter the syscall hook, we need to change how the registers are saved and restore
x8
. Changing the register saving address to a fixed one means that we don’t need to waste a register remembering that address anymore. We also want to restorex8
since we’ll be done using this trick and we want to enter the syscall with the right syscall number. This would also prepare us in case we got here with aclone
call and we want to bail out early.bti c // BTI compatible mov x15, preload_thread_locals // Stash away x30 so that we can have two registers to use again // we can't use stub_scratch_2 since we might overwrite the data there str x30, [x15, stub_scratch_1 - preload_thread_locals] // Move the saving area to the start of scratch_2 // Do it in the forward order since we know x8 >= x15 ldr x30, [x8, stub_scratch_2 - preload_thread_locals] str x30, [x15, stub_scratch_2 - preload_thread_locals] ldr x30, [x8, stub_scratch_2 - preload_thread_locals + 8] str x30, [x15, stub_scratch_2 - preload_thread_locals + 8] // Restore x8 movk x8, 0, LSL 16
By the end of the prologue, every registers are back to their original values, except for
x15
andx30
which have their old values instub_scratch_2
. The stub address is saved instub_scratch_1
. -
Clone handling
Most of the requirement has by laid out already,
- Do not touch sp (before or after syscall)
- Return through
_syscallbuf_final_exit_instruction
(which is just aret
) - Store the return address in
stub_scratch_1
(for signal handling) - Bonus point for keeping the unwind info valid the whole way through
cmp x8, 0xdc // SYS_clone b.eq .Lclone .Lclone: // Must not touch sp in this branch. // Use x15 to remember the return address since we are only copying // the first two elements of stub_scratch_2 for the child. ldr x15, [x15, stub_scratch_1 - preload_thread_locals] mov x30, 0x70000000 // RR_PAGE_SYSCALL_TRACED blr x30 // stub_scratch_2 content is maintained by rr // we need to put the syscall return address in stub_scratch_1 movz x30, #:abs_g1:stub_scratch_2 // assume 32bit address movk x30, #:abs_g0_nc:stub_scratch_2 str x15, [x30, 16] // stash away stub address ldr x15, [x15] // syscall return address str x15, [x30, stub_scratch_1 - stub_scratch_2] mov x15, x30 ldr x30, [x15, 16] add x30, x30, 8 // actual return address b _syscallbuf_final_exit_instruction
-
Stack switching
Once we know that we are not dealing with clone, we can switch to the new stack and save everything to the new one
ldr w30, [x15, alt_stack_nesting_level - preload_thread_locals] cmp w30, 0 add w30, w30, 1 str w30, [x15, alt_stack_nesting_level - preload_thread_locals] b.ne .Lnest ldr x30, [x15, syscallbuf_stub_alt_stack - preload_thread_locals] sub x30, x30, 48 b .Lstackset .Lnest: sub x30, sp, 48 .Lstackset: // Now x30 points to the new stack with 48 bytes of space allocated // Move sp into a normal register. Otherwise we can't store it mov x15, sp // Save sp to new stack. str x15, [x30, 16] mov sp, x30 // sp is switched, x15 and x30 are free to use // [stub_scratch_1] holds the stub address // Now we need to construct the stack frame, with everything // in the scratch area copied over so that we can nest again. mov x15, preload_thread_locals // load runtime stub address ldr x30, [x15, stub_scratch_1 - preload_thread_locals] // save stub return address str x30, [sp] // load syscall return address ldr x30, [x30, -8] str x30, [sp, 8] ldr x30, [x15, stub_scratch_2 - preload_thread_locals] str x30, [sp, 24] ldr x30, [x15, stub_scratch_2 - preload_thread_locals + 8] str x30, [sp, 32] // stackframe layout // 32: original x30 // 24: original x15 // 16: original sp // 8: return address to syscall // 0: return address to stub
-
syscall hook epilogue
The
_syscall_hook_trampoline
restores all the registers to the previous values (again, minus the register for syscall return value) so we just need to restore the registers we’ve overwritten by the end of the stack switch, i.e.x15
,x30
andsp
. Thex15
andx30
will be restored when we get back to the stub so we don’t need to restore them here but we do need to copy their values tostub_scratch_2
again so that the stub can restore them (since without a valid stack that is still the only memory we can use to restore things. At least this time we don’t need to hunt for a register to store the address). We also need to store the return address tostub_scratch_1
since that’ll help rr with setting breakpoint.movz x15, #:abs_g1:stub_scratch_2 // assume 32bit address movk x15, #:abs_g0_nc:stub_scratch_2 ldr x30, [sp, 24] // x15 str x30, [x15] ldr x30, [sp, 32] // x30 str x30, [x15, 8] ldr x30, [sp, 8] // syscall return address // tell rr breakpoint handling where we are going str x30, [x15, stub_scratch_1 - stub_scratch_2] ldr x30, [sp] // stub return address ldr x15, [sp, 16] // sp mov sp, x15 movz x15, #:abs_g1:stub_scratch_2 // assume 32bit address movk x15, #:abs_g0_nc:stub_scratch_2 _syscallbuf_final_exit_instruction: ret
The manual unwind info for the syscall hook is left as an exercise to the reader (see the actual implementation for the answer).