-
Notifications
You must be signed in to change notification settings - Fork 60
Capability Revocation
Hello fellow CheriFree-rs,
I'm sorry that I have not yet produced actual documentation for the
caprevoke()
family of APIs. In lieu of the real deal, but hopefully more
useful than the test cases, here's some notes that might help to get
everyone on the same (virtual) page. Also hopefully some of this forms the
basis for what goes in the paper.
When an application calls free()
, the contents of the region are best
viewed as toxic junk of which we need to dispose before we can cycle it back
to the application. There are two basic aspects to this:
- before we can slate memory for reuse, there must be no capabilities to it outside the allocator.
- when we give memory back to the application, there must be no tags within it.
Our promise is intended to be quite weak:
- No live capability aliasing; if the application still has a tagged
capability to memory, that memory has not yet been reused even if it has
been
free()
'd. Contrapositively, memory that has been (slated for) reuse will have had all stale authorizing caps' tags cleared.
The basic flow is this:
-
On
free()
, memory is put into a quarantine. -
At some point thereafter, the status of some/all quarantined allocations is translated into a bitmap for the kernel's consumption. This bitmap is currently chosen to have one bit per 16 bytes of memory, thereby defining the granularity of revoke-ability.
See Staging Revocation of an Allocation for the mechanism.
For correctness, entire allocations have to go in to the bitmap (see What is Revocation, Exactly?).
This step may be deferred arbitrarily, and there are perhaps good reasons to do so that we'll talk about in a minute (see Coalescing in Quarantine and Segmented Quarantines).
-
At yet some later point, the allocator requests that the kernel revoke access according to the bitmap generated earlier. This step itself may happen in passes that we'll talk about in another minute (see Multi-pass Revocation).
-
Any allocations that were translated to bitmaps prior to the previous step are now guaranteed to have no inward authorizing capabilities other than those held by the allocator or internal to the kernel (but not caps passed by the application to the kernel!). These regions can be made available for reuse.
-
Prior to handing out a reused region to the application, we must ensure that it has no tags. We should perhaps be kind and in fact zero the whole thing.
Despite the seeming simplicity, there're a number of options for an implementation!
- When revoking, it would be good to have as few capabilities, and
especially pages with capabilities, within dead regions of memory as
possible, so it may make sense to zero upon
free()
, even though the application has access and could unzero it (by mistake or maliciously). Otherwise, the revoker will be visiting pages and cachelines that are certain to be zeroed anyway by the time addresses are reissued to the application. - Despite that we only promise to catch use-after-reallocation, we might
want to catch UAF, too. While we could
munmap()
, we're hoping to have a dedicatedMPROT_QUARANTINE
implementation that doesn't need to be mated with ammap()
ormprotect()
later. This would additionally let us avoid zeroing entire pages onfree()
and also post-revocation. - Revocation unto itself is expensive: it's at least one visit to every capability-bearing page in the system, and possibly many, depending on the concurrency choices made. Thus, we should attempt to do it as little as possible (and so, quarantine) as well as maximize the benefit of revocations done (see Segmented Quarantines).
- But even requesting revocation can be expensive. It may make sense to
defer generating the bitmaps rather than setting them on every
free()
; this is another exciting use ofMPROT_QUARANTINE
.
The revoker is a very simple thing, in rough sketch:
Given: the contents of memory &c and a bitmap of quarantine status For each capability anywhere the application can access, If that capability's *base* is marked as quarantined, Replace it with its de-tagged, 0-permission form.
Specifically, "anywhere the application can access" means:
- User-visible pages of memory
- Processor register files
- Kernel callback sources, specifically
aio
andkqueue
The base of the capability is chosen for two reasons:
- It cannot be out of bounds, unlike the cursor.
- If the allocator properly sets CHERI bounds, the base must, in fact, be within the original allocation. This depends on the fact that two adjacent capabilities cannot be fused.
We leave base, offset, and length of the capability alone, so as to be compatible with some (probably actually undefined behavior) C idioms.
And thus we see the need to set all bits corresponding to the "shadow" of each quarantined allocation: the application might have derived a capability to a sub-object such that the resulting base is not the base of the allocation. Dually, the top may not be the top of the allocation. Despite both possibilities, the cursor may be pointing not only out of bounds but into another allocation.
At present, we believe that managing the need to clear tags prior to reuse is best left to userland. While the kernel will absolutely test any capability found in dead memory (as well as all the live ones), forcing it to ask about the quarantined status of the memory containing every capability as well as the memory to which each capability points sounds more expensive than asking the allocators to manage this aspect.
There are a few pieces to the "translated into a bitmap" step above. First,
we're going to need access to the bitmap for a given piece of the virtual
address space. To ensure that only the allocator can make (or clear)
staging, this access is only given if software can demonstrate that it has a
capability bearing VMMAP
permission to the region in question, as
returned by, for example, mmap()
.
Note
We expect that the allocator clears
VMMAP
permission on all objects it gives out; this was already true of the in-tree jemalloc prior to any design for revocation, and we have applied it to the allocators under study here.
So armed with such a capability, called "arena", a system call
void * __capability arena_shadow; int res = caprevoke_shadow(CAPREVOKE_SHADOW_NOVMMAP, arena, &arena_shadow);
will return a suitable capability to the bitmap. Internally to the revoker,
there are in fact other bitmaps (one, for sealing types, accessible via
CAPREVOKE_SHADOW_OTYPE
, and the other, internal, at page granularity,
deliberately not exposed to userland, for the use of MPROT_QUARANTINE
).
Rather than go through all the tedium of finding the correct bit(s) for an
allocation, this logic has been library-fied in libcheri_caprevoke
.
Before detailing the operation thereof, we must make one final note.
CHERIfied allocators (almost?) universally must be able to re-derive a
VMMAP
-bearing capability to an allocation given only their globals and
the non-VMMAP
-bearing capability held by the application. For somewhat
tricky reasons (see Concurrent Revocation Safety of Bitmap Marking),
libcheri_caprevoke's
method to set bits takes both the application
capability and this re-derived capability, upon which the spatial bounds
must be appropriately set for the allocation in question, as the library
sets bits from the base to top of this capability. So armed, the call
int res = caprev_shadow_nomap_set(arena_shadow, rederived_cap, app_cap);
will set the correct bits. Internally, it strives to write 64-bit words
(see Are Wide Bitmap Stores Always Safe?) and uses up to two LL/SC loops
to ensure that partial updates of words shared with other allocations are
not clobbered. The caller must always check the result; a non-zero value
indicates an attempted double-free()
and should cause free()
to
return in a mostly-idempotent way (statistical counters are fine to adjust;
changes to the heap are almost surely not fine). (See Concurrent
Revocation Safety of Bitmap Marking for more details.)
After revocation, the same library can be used to clear the bits prior to reissue:
caprev_shadow_nomap_clear(arena_shadow, rederived_cap);
A feature still in the works is to have the kernel manage staging entire pages internally. The kernel knows about revocation and so should be able to do the right thing here. Once this is implemented, this section will be updated.
A 4096-byte page of 1-bit-per-16-byte flags covers 512 KiB of the address
space. We believe that this, or a near power-of-two multiple, is a good
threshold for distinguishing between "large" and "small" allocations when
trading off between caprev_shadow_nomap_set
and MPROT_QUARANTINE
.
The exact cross-over point has yet to be measured.
A little about the mechanism is probably in order. We intend that the
allocator can, rather than calling caprev_shadow_nomap_set
(or the
_raw
variant) for large regions, instead call
mprotect(MPROT_QUARANTINE)
on the region in question. This will remove
all pages within the indicated region from the process's address space and
(atomically) convert it to the equivalent of a MAP_GUARD
region:
impossible to back with pages and not reusable as address space for later
(unhinted) mmap()
calls. The kernel, internally, has a coarse-grained
revocation bitmap, at one bit per page, that the revoker also tests during
revocation scans; these pages will have their shadow set.
After revocation (so that the coarse bitmap is accurate), but before returning to userland, the kernel can quickly iterate all such regions and clear their bitmaps and revert the address space regions to anonymous memory.
Staging revocation requests early is advantageous: the next revocation
allows them to be putback into circulation. However, small allocations incur
overheads during staging: only a few bits get changed and the full cost of
atomics is borne. Large allocations amortize atomics across potentially
multiple full-word writes and, if sufficiently large, may (eventually) be
better off being handled by MADV_QUARANTINE
.
It is therefore a policy and allocator mechanism question: should free()
regions be coalesced in quarantine prior to staging? If so, under what
policy? Tentatively, we suggest that a dual-queue design may be the right
one:
- on free, small allocations that do not coalesce with existing unstaged allocations are placed at the end of the first LRU queue; large allocations that do not coalesce are placed at the end of the second LRU queue.
- whenever two or three allocations coalesce in the first queue, if the result is still small, they are replaced with their merge at the end of the first queue. If the result is large, they are removed and their merge is added to the end of the second queue.
- whenever an allocation coalesces with one or two objects in the second queue, they are removed and the merger is put on the end of the second queue.
When the quarantine revocation policy demands that memory be revoked for reuse, the second queue (of large objects) is drained in order (i.e., least-recently-coalesced large allocations are staged first) until the policy believes that quarantine will be below threshold. If that fails to satiate the policy, the first queue is drained in order.
We assume that whatever interlocking protects the quarantine coalescing
metadata structures can be extended to cover staging to the bitmap. That
is, only one thread will attempt to stage a coalesced collection of
allocations to the bitmap. Allocators should avail themselves of the "raw"
interface in libcheri_caprevoke
. To eliminate concerns of
representability of coalesced regions as capabilities, this raw interface
acts directly on raw vaddr_t
cursors into the heap; heap_start
points at the first byte of the region and heap_end
the last:
caprev_shadow_nomap_set_raw(arena_shadow, heap_start, heap_end); caprev_shadow_nomap_clear_raw(arena_shadow, heap_start, heap_end);
caprev_shadow_nomap_set_raw
returns void
as it cannot fail (except
to crash if improperly invoked).
These raw bitmap-accessing functions do not use atomics internally, as we presume that allocators obtain their backing stores in units of whole pages, and a 4K page has a 32-byte shadow in the bitmap, which is more than sufficient to ensure that each 64-bit word in the bitmap is associated with only one allocator.
Having filled out the bitmap "form" of requests for revocation, in the
simplest setting, where only one form of caprevoke()
is used, revocation
itself may be performed by just calling
struct caprevoke_stats crst; caprevoke(CAPREVOKE_LAST_PASS|CAPREVOKE_IGNORE_START, 0, &crst);
to do all the work.
Warning
Importantly, this is only certain to suffice if no other forms of
caprevoke()
calls are used (i.e., no other variants of the flag word are given anywhere in the program)! As will hopefully become clear later, we are relying on there being no revocation in progress when we make this call here.
caprevoke()
is a synchronous call: it will not return until revocation
is finished. Upon return, crst
contains useful statistics about the
work done and some "epoch" counters, to which we now turn our attention.
If it is possible that multiple threads are engaged in revocation, we need to defend against the possibility that one was doing so while we were busy filling out the bitmap. Without getting into the details of why, it is almost surely better to use this code fragment instead of the above in anything but test programs:
struct caprevoke_stats crst; caprevoke_epoch_t start_epoch; caprevoke(CAPREVOKE_LAST_PASS | CAPREVOKE_IGNORE_START | CAPREVOKE_NO_WAIT_OK, 0, &crst); while(__builtin_expect( !caprevoke_epoch_clears(crst->epoch_fini, start_epoch), 0)) { caprevoke(CAPREVOKE_LAST_PASS, start_epoch, &crst); }
In almost all cases, the first caprevoke
is sufficient and the while
loop does not run its body. However, if a revocation is now in progress or,
as we will say later, "an epoch is open", the loop will walk the full
caprevoke
state machine to the desired point, where we can be sure that
everything we have just written to the bitmap form has been revoked.
Suppose we have several allocators in a program; snmalloc is a good example
as it has a dedicated allocator per-thread, but one could just have several
truly different allocators encapsulating everything from mmap()
to managing
their objects' revocation. In such a case, we would like to share the
work done by revocation: because revocation necessarily is a global (well,
per-address-space) operation, if allocator A request revocation, any
requests staged to the bitmap by allocator B will also be revoked! The
challenge is for B to know that this has taken place, asynchronously.
Towards this end, revocation divides time into "epochs." There are two, related questions one might ask about epochs:
- by which epoch had staging a revocation certainly finished?
- given the current epoch, which staged revocations are certainly done?
In the earlier story, epochs did not matter so much: certainly all staging
had been done prior to the invocation of caprevoke()
and by the end of the
call, all revocation had been done. Epoch counters are a kind of lazy
synchronization with other threads.
To take advantage of epochs, we divide our quarantine into segments, filled sequentially. If there are no existing segments of our quarantine and we have just filled one, we may call
struct caprevoke_stats crst; caprevoke(CAPREVOKE_LAST_PASS|CAPREVOKE_IGNORE_START, 0, &crst);
as before to revoke everything in this segment and proceed as before. However, we can also invoke
struct caprevoke_stats crst; caprevoke(CAPREVOKE_NO_WAIT_OK |CAPREVOKE_IGNORE_START |CAPREVOKE_LAST_NO_EARLY, 0, &crst);
(the precise meanings of the flags will be explained later; for now, just
roll with it). This does not do any revocation; instead, it merely
reports the epochs in crst
. We can now label the filled segment of
our quarantine as having been fully staged by the reported
crst.epoch_init
. On the other hand, suppose that there are existing
segments in the quarantine, of which earliest_filled
is the oldest, and
earliest_filled->epoch
is the label it has been given; suppose further
that we have just filled filling
. We can then run
struct caprevoke_stats crst; caprevoke(CAPREVOKE_LAST_PASS, earliest_filled->epoch, &crst); filling->epoch = crst.epoch_init;
to (try to) revoke the earliest_filled
segment and label the just-filled
one. (Again, we are making some assumptions here about other
caprevoke()
calls; so long as everyone uses this or the above
epoch-fetching forms, this will, indeed, revoke all allocations within the
earliest_filled
segment.) We should then add filling
to the queue
and then pull segments that have certainly been revoked out of the queue,
using the function caprevoke_epoch_clears()
:
filling->next = NULL; last_filled->next = filling; /* last_filled might be earliest_filled */ while((earliest_filled != NULL) && caprevoke_epoch_clears(crst.epoch_fini, earliest_filled->epoch)) { done = earliest_filled; earliest_filled = earliest_filled->next; /* de-quarantine all allocations in done; free done itself */ }
Pay careful attention to the use of epoch_init
to label the now-filled
segment and epoch_fini
to do the comparison for removing segments. You
are welcome to treat caprevoke_epoch_clears()
as a black box (and
perhaps should; but see So What Are Epochs, Anyway? if you're curious.)
Because we did not set the CAPREVOKE_IGNORE_START
flag and passed in
the label of the earliest filled quarantine segment, this caprevoke()
call may not have had to do any revocation at all before it could return to
us with all allocations quarantined in earliest_filled
certainly done.
If it did no work, then the subsequent loop will release earliest_filled
and perhaps no other segments of the quarantine. If, on the other hand, it
was necessary to do revocation, the loop will release all segments,
earliest_filled
through filling
.
We now have a policy question! It is advantageous to frequently know the label we should use for the contents of our quarantine, as that lets us maximize the work we can piggy-back off other revocations. It's also advantageous to not ask too often, as asking involves a system call and interlocking within the revocation subsystem.
Upon marking a region of the address space as MADV_QUARANTINE
'd, the
kernel should obtain the (inerlocked!) epoch counter at the end of its
operation (i.e., just before returning to userland) and mark the region with
this value.
When the revoker encounters a MADV_QUARANTINE
'd region whose epoch is
sufficiently far in the past that this pass clears it, it may, after
finishing this pass and before advancing the epoch clock, then reset the
region to being ordinary, anonymous address space.
The rough sketch of revocation makes sense as an atomic action, but that requires that we stop the world for the duration of a scan through all of memory, a rude proposal at best. We therefore divide revocation into a series of passes:
- an "opening" pass, done by a single thread without the world stopped. This visits all capability-bearing pages.
- zero or more "middle" passes, which visit "recently capability-dirty" pages (i.e., pages to which capabilities have been stored since they were visited by the opening pass).
- a "closing" or "last" pass, which runs with the world stopped, but, like the middle passes, visits only the "recently capability-dirty" pages (but also visits some kernel state while the world is stopped).
If no flags are specified, caprevoke()
does an opening or middle pass.
CAPREVOKE_LAST_PASS
indicates our desire to (open and) close revocation.
If caprevoke()
must open, as well as close, it will do the opening pass
without the world stopped unless CAPREVOKE_LAST_NO_EARLY
is also given.
If CAPREVOKE_LAST_PASS
encounters an open epoch, it will first do a
"middle" pass without the world stopped, unless CAPREVOKE_LAST_NO_EARLY
is also given. This CAPREVOKE_LAST_NO_EARLY
flag is almost surely only
useful for experimentation.
It is a policy question as to when an allocator should open and close a multi-pass revocation or run a middle pass. We believe the default policy, of opening and immediately closing, is a sensible one, but experimentation is merited. In all cases, the loop exhibited above suffices to gate segments' being de-quarantined.
When it is necessary to ensure that a segment is revoked, one may have to
repeatedly invoke caprevoke()
:
struct caprevoke_stats crst; do { caprevoke(CAPREVOKE_LAST_PASS, earliest_filled->epoch, &crst); } while(!epoch_clears(crst.epoch_fini, earliest_filled->epoch));
We can, if we like, replace the earlier example of queueing of a full
segment into quarantine with just about any operation we like. An
"opportunistic" approach is to just ask for the time, as we did above, with
CAPREVOKE_NO_WAIT_OK | CAPREVOKE_IGNORE_START | CAPREVOKE_LAST_NO_EARLY
.
We can also "hurry along" an open revocation, but not open a new one, with
CAPREVOKE_NO_WAIT_OK | CAPREVOKE_ONLY_IF_OPEN | CAPREVOKE_LAST_PASS
,
which will close an open revocation unless there is already a revoker doing
work towards that end (though it may just be doing a "middle" pass).
Epochs are divided in twain to capture the semantics of multi-pass
revocation: we refer to "open" epochs and "closed" ones. Their ordering is
as you might expect, closed becomes open becomes the next closed becomes the
next open, &c. caprevoke_epoch_clears()
encapsulates the logic
necessary to make sure that staging in an open epoch waits for the closing
pass after the next opening pass.
There is some relatively tricky interlocking logic in the kernel to ensure
that userland gets useful answers. If, in particular, an epoch is opening
or closing at the time of entry to a caprevoke()
call, epoch_init
will be reported assuming that these operations had already taken place.
epoch_fini
may thus be less than epoch_init
if the caprevoke()
call does not fully interlock with any existing revokers, as it might not
when a sufficient epoch is specified (i.e., CAPREVOKE_IGNORE_START
is
not set) or CAPREVOKE_NO_WAIT_OK
, whose purpose is exactly this, is
given.
Resetting MADV_QUARANTINE
regions must happen in a second pass over the
address space after the "closing" pass: the bitmap must remain intact for
the duration of the "closing" pass. Fortunately, this is merely iterating
the map entries, and so should be quick; if necessary, the map entries can
be specificially chained together for faster iteration.
Epochs, like time in general in multi-threaded systems, are complicated
things. An epoch identifier refers to a point in time, or, more properly,
in the partial order of memory transactions, prior to which some thing(s)
certainly had taken place. The intended use of struct caprevoke_stats
epoch_init
and epoch_fini
are:
-
epoch_init
is the observed, synchronized time when the call began, -
epoch_fini
is the same clock but at the end of the call.
That's all well and good, but more importantly:
-
epoch_fini
tells us which epoch's beginning all program-order observations after the call certainly happen after. -
epoch_init
tells us which epoch's beginning we are allowed to treat as happening after all program events prior to the call certainly happend before. It is not necessarily actually true that the epoch began after these events.
The sole comparator we care about is caprevoke_epoch_clears(now, then)
.
This is true if the epoch value now
releases all staged revocation
requests completed prior to then
. This function encapsulates the
following reasoning:
a staged revocation request is certainly done if revocation has both begun and then ended after its staging.
(Importantly, it does not suffice for one revocation to end and another to begin.) The implementation details are interesting, but this document is already too long.
All of this is encapsulated in libcheri_caprevoke
and can be taken as
black box.
There are several worrying possibilities that might arise when we are staging allocations to the kernel's bitmap:
- Multiple allocations' shadows may occur within the same byte or word.
- This is a double-
free()
and the allocation is already staged. - This is a
free()
using a capability whose allocation is staged for revocation. - A revocation pass might happen while we are in the middle of filling out the bitmap for a given structure. If we have filled in the bottom bit, then from the kernel's perspective, any capability we hold to the object whose base has not advanced is worthy of revocation.
caprev_shadow_nomap_set
takes both a re-derived capability to the
allocation to be staged, which should bear VMMAP
permission so as to not
be subject to revocation, and the capability provided by the application to
free()
. This re-derived capability is used to determine which bits to
set, and so addresses the last point, regardless of how the revoker
ultimately mangle capabilities it finds worthy of revocation.
The first point is addressed by using LL/SC to implement atomic ORs for the first and last 64-bit word occupied by the shadow of the allocation being staged. All intervening words are assumed to be property of this allocation and are simply written to with all bits set; this will be true of an allocator without placement bugs that generate overlapping objects.
The second point, of double-``free()``s is addressed by extending the LL/SC for the first word test, between LL and SC, that none of the bits about to be set are already set. If any are, the call signals a failure.
The third point is the most tricky. Just because the capability passed to
free()
has been staged does not mean that the revocation has already
happened at the time of free()
and so there is an inherent risk of a
TOCTTOU race. Moreover, the revocation of this allocation and clearing of
its shadow in the bitmap may take place between free()
and the check
inside the LL/SC mentioned above. In order to properly interlock with the
revoker, the LL/SC for the first word is again extended to verify the tag
of the application-sourced capability between the LL and SC instructions,
thereby closing the TOCTTOU window and ensuring that double-``free()``s
which straddle epochs remain idempotent.
Finally, because caprevoke()
as we have used it here always performs
interlocking against other caprevoke()
calls, we can be certain that our
stores to the bitmap are visible to other threads and CPUs engaged in
revocation in time for the test in caprevoke_epoch_clears()
to be
correct. Closing ("last") passes additionally single-thread the world
during their operation and so ensure that all threads on all cores see the
tag-clearing as having taken place.
The libcheri_caprevoke
helpers always write 64-bit words to the kernel's
bitmap. A 64-bit word corresponds to an aligned, 1024-byte segment of the
address space. As all mmap()
results are at least 4096-byte aligned and
at least 4096 bytes long, it is impossible that an aligned 64-bit bitmap
word cross a permission boundary. While, in principle, one could request
bitmap access capabilities for sub-page regions, we assume that our
allocators do not.
The revoker tries very hard to not give callers sharp edges. However, for
experimentation's sake, we expose some interesting behaviours. In
particular, there are a series of CAPREVOKE_JUST_*
flags that will cause
caprevoke()
to forego its usual operation, including interlocking with
other callers of caprevoke()
. While most of these are useful only to
test various aspects of the implementation, one has proven tempting to use
with general revocation and deserves special attention.
CAPREVOKE_JUST_THE_TIME
performs a non-interlocked read of the epoch
counter value and returns this as epoch_fini
. This value is as useful
as any other reported epoch_fini
: certainly the beginning of the denoted
epoch is in the observable past. There are no cases in which the revoker
implementation advances the globally-visible clock only to revert it later.
However, this value must not be treated as an epoch_init
value and
used to label segments of quarantine. The flip-side of the revoker not
advancing the clock prematurely is that it does not advance the clock until
the end of an opening or closing pass of a revocation, and so the value
reported by CAPREVOKE_JUST_THE_TIME
may be lagged. The epoch_init
value reported without any CAPREVOKE_JUST_*
flags takes this into
consideration by interlocking to observe whether a pass is now in progress.
We are not, presently, aware of any situation that materially benefits from
CAPREVOKE_JUST_THE_TIME
save, perhaps, diagnostic printout.