-
Notifications
You must be signed in to change notification settings - Fork 60
Capability Revocation
When an application calls free()
, the contents of the region are best
viewed as toxic junk of which we need to dispose before we can cycle it back
to the application. There are two basic aspects to this:
- before we can slate memory for reuse, there must be no capabilities to it outside the allocator.
- when we give memory back to the application, there must be no tags within it.
Our promise is intended to be quite weak:
- No live capability aliasing; if the application still has a tagged
capability to memory, that memory has not yet been reused even if it has
been
free()
'd. Contrapositively, memory that has been (slated for) reuse will have had all stale authorizing caps' tags cleared.
The basic flow is this:
-
On
free()
, memory is put into a quarantine. -
At some point thereafter, the status of some/all quarantined allocations is translated into a bitmap for the kernel's consumption. This bitmap is currently chosen to have one bit per 16 bytes of memory, thereby defining the granularity of revoke-ability.
See Staging Revocation of an Allocation for the mechanism.
For correctness, entire allocations have to go in to the bitmap (see What is Revocation, Exactly?).
This step may be deferred arbitrarily, and there are perhaps good reasons to do so that we'll talk about in a minute (see Coalescing in Quarantine and Segmented Quarantines).
-
At yet some later point, the allocator requests that the kernel revoke access according to the bitmap generated earlier. This step itself may happen in passes that we'll talk about in another minute (see Multi-pass Revocation).
-
Any allocations that were translated to bitmaps prior to the previous step are now guaranteed to have no inward authorizing capabilities other than those held by the allocator or internal to the kernel (but not caps passed by the application to the kernel!). These regions can be made available for reuse.
-
Prior to handing out a reused region to the application, we must ensure that it has no tags. We should perhaps be kind and in fact zero the whole thing.
Despite the seeming simplicity, there're a number of options for an implementation!
- When revoking, it would be good to have as few capabilities, and
especially pages with capabilities, within dead regions of memory as
possible, so it may make sense to zero upon
free()
, even though the application has access and could unzero it (by mistake or maliciously). Otherwise, the revoker will be visiting pages and cachelines that are certain to be zeroed anyway by the time addresses are reissued to the application. - Despite that we only promise to catch use-after-reallocation, we might
want to catch UAF, too. While we could
munmap()
, we're hoping to have a dedicatedMPROT_QUARANTINE
implementation that doesn't need to be mated with ammap()
ormprotect()
later. This would additionally let us avoid zeroing entire pages onfree()
and also post-revocation. - Revocation unto itself is expensive: it's at least one visit to every capability-bearing page in the system, and possibly many, depending on the concurrency choices made. Thus, we should attempt to do it as little as possible (and so, quarantine) as well as maximize the benefit of revocations done (see Segmented Quarantines).
- But even requesting revocation can be expensive. It may make sense to
defer generating the bitmaps rather than setting them on every
free()
; this is another exciting use ofMPROT_QUARANTINE
.
The revoker is a very simple thing, in rough sketch:
Given: the contents of memory &c and a bitmap of quarantine status For each capability anywhere the application can access, If that capability's *base* is marked as quarantined, Replace it with its de-tagged, 0-permission form.
Specifically, "anywhere the application can access" means:
- User-visible pages of memory
- Processor register files
- Kernel callback sources, specifically
aio
andkqueue
The base of the capability is chosen for two reasons:
- It cannot be out of bounds, unlike the cursor.
- If the allocator properly sets CHERI bounds, the base must, in fact, be within the original allocation. This depends on the fact that two adjacent capabilities cannot be fused.
We leave base, offset, and length of the capability alone, so as to be compatible with some (probably actually undefined behavior) C idioms.
And thus we see the need to set all bits corresponding to the "shadow" of each quarantined allocation: the application might have derived a capability to a sub-object such that the resulting base is not the base of the allocation. Dually, the top may not be the top of the allocation. Despite both possibilities, the cursor may be pointing not only out of bounds but into another allocation.
At present, we believe that managing the need to clear tags prior to reuse is best left to userland. While the kernel will absolutely test any capability found in dead memory (as well as all the live ones), forcing it to ask about the quarantined status of the memory containing every capability as well as the memory to which each capability points sounds more expensive than asking the allocators to manage this aspect.
There are a few pieces to the "translated into a bitmap" step above. First,
we're going to need access to the bitmap for a given piece of the virtual
address space. To ensure that only the allocator can make (or clear)
staging, this access is only given if software can demonstrate that it has a
capability bearing VMMAP
permission to the region in question, as
returned by, for example, mmap()
.
Note
We expect that the allocator clears
VMMAP
permission on all objects it gives out; this was already true of the in-tree jemalloc prior to any design for revocation, and we have applied it to the allocators under study here.
So armed with such a capability, called "arena", a system call
void * __capability arena_shadow; int res = caprevoke_shadow(CAPREVOKE_SHADOW_NOVMMAP, arena, &arena_shadow);
will return a suitable capability to the bitmap. Internally to the revoker,
there are in fact other bitmaps (one, for sealing types, accessible via
CAPREVOKE_SHADOW_OTYPE
, and the other, internal, at page granularity,
deliberately not exposed to userland, for the use of MPROT_QUARANTINE
).
Rather than go through all the tedium of finding the correct bit(s) for an
allocation, this logic has been library-fied in libcheri_caprevoke
.
Before detailing the operation thereof, we must make one final note.
CHERIfied allocators (almost?) universally must be able to re-derive a
VMMAP
-bearing capability to an allocation given only their globals and
the non-VMMAP
-bearing capability held by the application. For somewhat
tricky reasons (see Concurrent Revocation Safety of Bitmap Marking),
libcheri_caprevoke's
method to set bits takes both the application
capability and this re-derived capability, upon which the spatial bounds
must be appropriately set for the allocation in question, as the library
sets bits from the base to top of this capability. So armed, the call
int res = caprev_shadow_nomap_set(arena_shadow, rederived_cap, app_cap);
will set the correct bits. Internally, it strives to write 64-bit words
(see Are Wide Bitmap Stores Always Safe?) and uses up to two LL/SC loops
to ensure that partial updates of words shared with other allocations are
not clobbered. The caller must always check the result; a non-zero value
indicates an attempted double-free()
and should cause free()
to
return in a mostly-idempotent way (statistical counters are fine to adjust;
changes to the heap are almost surely not fine). (See Concurrent
Revocation Safety of Bitmap Marking for more details.)
After revocation, the same library can be used to clear the bits prior to reissue:
caprev_shadow_nomap_clear(arena_shadow, rederived_cap);
A feature still in the works is to have the kernel manage staging entire pages internally. The kernel knows about revocation and so should be able to do the right thing here. Once this is implemented, this section will be updated.
A 4096-byte page of 1-bit-per-16-byte flags covers 512 KiB of the address
space. We believe that this, or a near power-of-two multiple, is a good
threshold for distinguishing between "large" and "small" allocations when
trading off between caprev_shadow_nomap_set
and MPROT_QUARANTINE
.
The exact cross-over point has yet to be measured.
A little about the mechanism is probably in order. We intend that the
allocator can, rather than calling caprev_shadow_nomap_set
(or the
_raw
variant) for large regions, instead call
mprotect(MPROT_QUARANTINE)
on the region in question. This will remove
all pages within the indicated region from the process's address space and
(atomically) convert it to the equivalent of a MAP_GUARD
region:
impossible to back with pages and not reusable as address space for later
(unhinted) mmap()
calls. The kernel, internally, has a coarse-grained
revocation bitmap, at one bit per page, that the revoker also tests during
revocation scans; these pages will have their shadow set.
After revocation (so that the coarse bitmap is accurate), but before returning to userland, the kernel can quickly iterate all such regions and clear their bitmaps and revert the address space regions to anonymous memory.
Staging revocation requests early is advantageous: the next revocation
allows them to be putback into circulation. However, small allocations incur
overheads during staging: only a few bits get changed and the full cost of
atomics is borne. Large allocations amortize atomics across potentially
multiple full-word writes and, if sufficiently large, may (eventually) be
better off being handled by MADV_QUARANTINE
.
It is therefore a policy and allocator mechanism question: should free()
regions be coalesced in quarantine prior to staging? If so, under what
policy? Tentatively, we suggest that a dual-queue design may be the right
one:
- on free, small allocations that do not coalesce with existing unstaged allocations are placed at the end of the first LRU queue; large allocations that do not coalesce are placed at the end of the second LRU queue.
- whenever two or three allocations coalesce in the first queue, if the result is still small, they are replaced with their merge at the end of the first queue. If the result is large, they are removed and their merge is added to the end of the second queue.
- whenever an allocation coalesces with one or two objects in the second queue, they are removed and the merger is put on the end of the second queue.
When the quarantine revocation policy demands that memory be revoked for reuse, the second queue (of large objects) is drained in order (i.e., least-recently-coalesced large allocations are staged first) until the policy believes that quarantine will be below threshold. If that fails to satiate the policy, the first queue is drained in order.
We assume that whatever interlocking protects the quarantine coalescing
metadata structures can be extended to cover staging to the bitmap. That
is, only one thread will attempt to stage a coalesced collection of
allocations to the bitmap. Allocators should avail themselves of the "raw"
interface in libcheri_caprevoke
. To eliminate concerns of
representability of coalesced regions as capabilities, this raw interface
acts directly on raw vaddr_t
cursors into the heap; heap_start
points at the first byte of the region and heap_end
the last:
caprev_shadow_nomap_set_raw(arena_shadow, heap_start, heap_end); caprev_shadow_nomap_clear_raw(arena_shadow, heap_start, heap_end);
caprev_shadow_nomap_set_raw
returns void
as it cannot fail (except
to crash if improperly invoked).
These raw bitmap-accessing functions do not use atomics internally, as we presume that allocators obtain their backing stores in units of whole pages, and a 4K page has a 32-byte shadow in the bitmap, which is more than sufficient to ensure that each 64-bit word in the bitmap is associated with only one allocator.
Once, at allocator startup, a revocation-aware allocator should request a capability to some kernel-maintained counters:
volatile const struct caprevoke_info *cri; int res = caprevoke_shadow(CAPREVOKE_SHADOW_INFO_STRUCT, NULL, (void **)&cri); assert(res == 0);
Having filled out the bitmap "form" of requests for revocation, revocation itself may be performed by the following sequence:
/* Ensure all cores see the ones in bitmap */ atomic_thread_fence(memory_order_acq_rel); /* And ensure that this read definitely happens after the release */ caprevoke_epoch start_epoch = cri->epoch_enqueue; while(!caprevoke_epoch_clears(cri->epoch_dequeue, start_epoch)) { struct caprevoke_stats crst; caprevoke(CAPREVOKE_LAST_PASS, start_epoch, &crst); } /* Zero the bitmap requests for our quarantined allocations */ /* Ensure all cores see the zeros; it's fine for our loads to move earlier */ atomic_thread_fence(memory_order_release);
caprevoke()
is a synchronous call: it will not return until revocation
is finished. Upon return, crst
contains useful statistics about the
work done (if the kernel is compiled with CHERI_CAPREVOKE_STATS
). The
names of the fields within struct caprevoke_info
may be mysterious,
but will hopefully become clear in the next section.
If it is possible that multiple threads are engaged in revocation, we need to defend against the possibility that one was doing so while we were busy filling out the bitmap, which is why we needed the memory fence: we had to ensure that our stores were visible to all CPU cores.
In almost all cases, the first caprevoke
is sufficient and the while
loop does not run its body more than once. However, if a revocation is now
in progress or, as we will say later, "an epoch is open", the loop will walk
the full caprevoke
state machine to the desired point, where we can be
sure that everything we have just written to the bitmap form has been
revoked.
Suppose we have several allocators in a program; snmalloc is a good example
as it has a dedicated allocator per-thread, but one could just have several
truly different allocators encapsulating everything from mmap()
to managing
their objects' revocation. In such a case, we would like to share the
work done by revocation: because revocation necessarily is a global (well,
per-address-space) operation, if allocator A request revocation, any
requests staged to the bitmap by allocator B will also be revoked! The
challenge is for B to know that this has taken place, asynchronously.
Towards this end, revocation divides time into "epochs." There are two, related questions one might ask about epochs:
- by which epoch had staging a revocation certainly finished?
- given the current epoch, which staged revocations are certainly done?
In the earlier story, epochs did not matter so much: certainly all staging
had been done prior to the invocation of caprevoke()
and by the end of the
call, all revocation had been done. Epoch counters are a kind of lazy
synchronization with other threads.
To take advantage of epochs, we divide our quarantine into segments, filled
sequentially. If we have just filled a filling
segment of our
quarantine, we may, rather than revoking now, merely label the segment
structure with the current epoch and add it to the quarantine:
atomic_thread_fence(memory_order_acq_rel); filling->epoch_enq = cri->epoch_enqueue; /* append filling as the youngest quarantine node */
Because time has, presumably, elapsed whilest this segment was being filled, it makes seense to probe at the oldest quarantine segment(s) and assess their revocation status:
while(caprevoke_clears(cri->epoch_dequeue, quarantine->oldest->epoch_enq)) { done = quarantine->oldest; quarantine->oldest = quarantine->oldest->next; /* de-quarantine all allocations in done; free done itself */ } /* * Ensure that all cores see the zeros written to the bitmap by * dequarantining. This may be fused into existing allocator * serialization, if present. */ atomic_thread_fence(memory_order_release);
Pay careful attention to the use of epoch_enqueue
to label the
now-filled segment and epoch_dequeue
to do the comparison for removing
segments. You are welcome to treat caprevoke_epoch_clears()
as a black
box (and perhaps should; but see So What Are Epochs, Anyway? if you're
curious.)
At this point, the allocator should re-evaluate its quarntine policy: it is
possible that the while
loop did not run and so the quarantine has
merely grown by the addition of filling
. If it finds that it is over
quota, then it should revoke the oldest segment it holds:
caprevoke_epoch start_epoch = quarantine->oldest->epoch_enq; while(!caprevoke_epoch_clears(cri->epoch_dequeue, start_epoch)) { struct caprevoke_stats crst; caprevoke(CAPREVOKE_LAST_PASS, start_epoch, &crst); }
If invoked, this code fragment will, by its exit, ensure that some number, but not necessarily all, of segments of the quarantine are ready to be dequeued. The dequeing loop above will certainly then run its body at least once.
We now have a policy question! It is advantageous to frequently know the label we should use for the contents of our quarantine, as that lets us maximize the work we can piggy-back off other revocations. It's also advantageous to not ask too often, as there is overhead in the queue of segments.
Upon marking a region of the address space as MADV_QUARANTINE
'd, the
kernel should obtain the (inerlocked!) epoch counter at the end of its
operation (i.e., just before returning to userland) and mark the region with
this value.
When the revoker encounters a MADV_QUARANTINE
'd region whose epoch is
sufficiently far in the past that this pass clears it, it may, after
finishing this pass and before advancing the epoch clock, then reset the
region to being ordinary, anonymous address space.
The rough sketch of revocation makes sense as an atomic action, but that requires that we stop the world for the duration of a scan through all of memory, a rude proposal at best. We therefore divide revocation into a series of passes:
- an "opening" pass, done by a single thread without the world stopped. This visits all capability-bearing pages.
- zero or more "middle" passes, which visit "recently capability-dirty" pages (i.e., pages to which capabilities have been stored since they were visited by the opening pass).
- a "closing" or "last" pass, which runs with the world stopped, but, like the middle passes, visits only the "recently capability-dirty" pages (but also visits some kernel state while the world is stopped).
If no flags are specified, caprevoke()
does an opening or middle pass.
CAPREVOKE_LAST_PASS
indicates our desire to (open and) close revocation.
If caprevoke()
must open, as well as close, it will do the opening pass
without the world stopped unless CAPREVOKE_LAST_NO_EARLY
is also given.
If CAPREVOKE_LAST_PASS
encounters an open epoch, it will first do a
"middle" pass without the world stopped, unless CAPREVOKE_LAST_NO_EARLY
is also given. This CAPREVOKE_LAST_NO_EARLY
flag is almost surely only
useful for experimentation.
It is a policy question as to when an allocator should open and close a multi-pass revocation or run a middle pass. We believe the default policy, of opening and immediately closing, is a sensible one, but experimentation is merited. In all cases, the loop exhibited above suffices to gate segments' being de-quarantined.
We can, if we like, replace the earlier example of queueing of a full
segment into quarantine with just about any operation we like. The flags
CAPREVOKE_NO_WAIT_OK | CAPREVOKE_IGNORE_START | CAPREVOKE_LAST_NO_EARLY
will be essentially equivalent to the shared-state access above: no
revocation will take place. We can also "hurry along" an open revocation,
but not open a new one, with CAPREVOKE_NO_WAIT_OK | CAPREVOKE_ONLY_IF_OPEN
| CAPREVOKE_LAST_PASS
, which will close an open revocation unless there is
already a revoker doing work towards that end (though it may just be doing a
"middle" pass).
Epochs are divided in twain to capture the semantics of multi-pass
revocation: we refer to "open" epochs and "closed" ones. Their ordering is
as you might expect, closed becomes open becomes the next closed becomes the
next open, &c. caprevoke_epoch_clears()
encapsulates the logic
necessary to make sure that staging in an open epoch waits for the closing
pass after the next opening pass.
Resetting MADV_QUARANTINE
regions must happen in a second pass over the
address space after the "closing" pass: the bitmap must remain intact for
the duration of the "closing" pass. Fortunately, this is merely iterating
the map entries, and so should be quick; if necessary, the map entries can
be specificially chained together for faster iteration.
Epochs, like time in general in multi-threaded systems, are complicated
things. An epoch identifier refers to a point in time, or, more properly,
in the partial order of memory transactions, prior to which some thing(s)
certainly had taken place. The intended use of struct caprevoke_info
epoch_enqueue
and epoch_dequeue
are:
-
epoch_enqueue
is the latest epoch guaranteed to be before all reads of the current or next revocation. -
epoch_dequeue
is the earliest epoch guaranteed to be after all past revocations.
The sole comparator we care about is caprevoke_epoch_clears(now, then)
.
This is true if the epoch value now
releases all staged revocation
requests completed prior to then
. This function encapsulates the
following reasoning:
a staged revocation request is certainly done if revocation has both begun and then ended after its staging.
(Importantly, it does not suffice for one revocation to end and another to begin.) The implementation details are interesting, but this document is already too long.
All of this is encapsulated in libcheri_caprevoke
and can be taken as
black box.
There are several worrying possibilities that might arise when we are staging allocations to the kernel's bitmap:
- Multiple allocations' shadows may occur within the same byte or word.
- This is a double-
free()
and the allocation is already staged. - This is a
free()
using a capability whose allocation is staged for revocation. - A revocation pass might happen while we are in the middle of filling out the bitmap for a given structure. If we have filled in the bottom bit, then from the kernel's perspective, any capability we hold to the object whose base has not advanced is worthy of revocation.
caprev_shadow_nomap_set
takes both a re-derived capability to the
allocation to be staged, which should bear VMMAP
permission so as to not
be subject to revocation, and the capability provided by the application to
free()
. This re-derived capability is used to determine which bits to
set, and so addresses the last point, regardless of how the revoker
ultimately mangle capabilities it finds worthy of revocation.
The first point is addressed by using LL/SC to implement atomic ORs for the first and last 64-bit word occupied by the shadow of the allocation being staged. All intervening words are assumed to be property of this allocation and are simply written to with all bits set; this will be true of an allocator without placement bugs that generate overlapping objects.
The second point, of double-``free()``s is addressed by extending the LL/SC for the first word test, between LL and SC, that none of the bits about to be set are already set. If any are, the call signals a failure.
The third point is the most tricky. Just because the capability passed to
free()
has been staged does not mean that the revocation has already
happened at the time of free()
and so there is an inherent risk of a
TOCTTOU race. Moreover, the revocation of this allocation and clearing of
its shadow in the bitmap may take place between free()
and the check
inside the LL/SC mentioned above. In order to properly interlock with the
revoker, the LL/SC for the first word is again extended to verify the tag
of the application-sourced capability between the LL and SC instructions,
thereby closing the TOCTTOU window and ensuring that double-``free()``s
which straddle epochs remain idempotent.
Finally, because caprevoke()
as we have used it here always performs
interlocking against other caprevoke()
calls, we can be certain that our
stores to the bitmap are visible to other threads and CPUs engaged in
revocation in time for the test in caprevoke_epoch_clears()
to be
correct. Closing ("last") passes additionally single-thread the world
during their operation and so ensure that all threads on all cores see the
tag-clearing as having taken place.
The libcheri_caprevoke
helpers always write 64-bit words to the kernel's
bitmap. A 64-bit word corresponds to an aligned, 1024-byte segment of the
address space. As all mmap()
results are at least 4096-byte aligned and
at least 4096 bytes long, it is impossible that an aligned 64-bit bitmap
word cross a permission boundary. While, in principle, one could request
bitmap access capabilities for sub-page regions, we assume that our
allocators do not.
The revoker tries very hard to not give callers sharp edges. However, for
experimentation's sake, we expose some interesting behaviours. In
particular, there are a series of CAPREVOKE_JUST_*
flags that will cause
caprevoke()
to forego its usual operation, including interlocking with
other callers of caprevoke()
. While most of these are useful only to
test various aspects of the implementation, one has proven tempting to use
with general revocation and deserves special attention.