Capability Revocation

Hello fellow CheriFree-rs,

I'm sorry that I have not yet produced actual documentation for the caprevoke() family of APIs. In lieu of the real deal, but hopefully more useful than the test cases, here's some notes that might help to get everyone on the same (virtual) page. Also hopefully some of this forms the basis for what goes in the paper.

Fine Details of Memory's Path Through free()

When an application calls free(), the contents of the region are best viewed as toxic junk of which we need to dispose before we can cycle it back to the application. There are two basic aspects to this:

before we can slate memory for reuse, there must be no capabilities to it outside the allocator.
when we give memory back to the application, there must be no tags within it.

Our promise is intended to be quite weak:

No live capability aliasing; if the application still has a tagged capability to memory, that memory has not yet been reused even if it has been free()'d. Contrapositively, memory that has been (slated for) reuse will have had all stale authorizing caps' tags cleared.

The basic flow is this:

On free(), memory is put into a quarantine.
At some point thereafter, the status of some/all quarantined allocations is translated into a bitmap for the kernel's consumption. This bitmap is currently chosen to have one bit per 16 bytes of memory, thereby defining the granularity of revoke-ability.

See Staging Revocation of an Allocation for the mechanism.

For correctness, entire allocations have to go in to the bitmap (see What is Revocation, Exactly?).

This step may be deferred arbitrarily, and there are perhaps good reasons to do so that we'll talk about in a minute (see Coalescing in Quarantine and Segmented Quarantines).
At yet some later point, the allocator requests that the kernel revoke access according to the bitmap generated earlier. This step itself may happen in passes that we'll talk about in another minute (see Multi-pass Revocation).
Any allocations that were translated to bitmaps prior to the previous step are now guaranteed to have no inward authorizing capabilities other than those held by the allocator or internal to the kernel (but not caps passed by the application to the kernel!). These regions can be made available for reuse.
Prior to handing out a reused region to the application, we must ensure that it has no tags. We should perhaps be kind and in fact zero the whole thing.

Despite the seeming simplicity, there're a number of options for an implementation!

When revoking, it would be good to have as few capabilities, and especially pages with capabilities, within dead regions of memory as possible, so it may make sense to zero upon free(), even though the application has access and could unzero it (by mistake or maliciously). Otherwise, the revoker will be visiting pages and cachelines that are certain to be zeroed anyway by the time addresses are reissued to the application.
Despite that we only promise to catch use-after-reallocation, we might want to catch UAF, too. While we could munmap(), we're hoping to have a dedicated MPROT_QUARANTINE implementation that doesn't need to be mated with a mmap() or mprotect() later. This would additionally let us avoid zeroing entire pages on free() and also post-revocation.
Revocation unto itself is expensive: it's at least one visit to every capability-bearing page in the system, and possibly many, depending on the concurrency choices made. Thus, we should attempt to do it as little as possible (and so, quarantine) as well as maximize the benefit of revocations done (see Segmented Quarantines).
But even requesting revocation can be expensive. It may make sense to defer generating the bitmaps rather than setting them on every free(); this is another exciting use of MPROT_QUARANTINE.

What Is Revocation, Exactly?

The revoker is a very simple thing, in rough sketch:

Given: the contents of memory &c and a bitmap of quarantine status

For each capability anywhere the application can access,
  If that capability's *base* is marked as quarantined,
    Replace it with its de-tagged, 0-permission form.

Specifically, "anywhere the application can access" means:

User-visible pages of memory
Processor register files
Kernel callback sources, specifically aio and kqueue

The base of the capability is chosen for two reasons:

It cannot be out of bounds, unlike the cursor.
If the allocator properly sets CHERI bounds, the base must, in fact, be within the original allocation. This depends on the fact that two adjacent capabilities cannot be fused.

We leave base, offset, and length of the capability alone, so as to be compatible with some (probably actually undefined behavior) C idioms.

And thus we see the need to set all bits corresponding to the "shadow" of each quarantined allocation: the application might have derived a capability to a sub-object such that the resulting base is not the base of the allocation. Dually, the top may not be the top of the allocation. Despite both possibilities, the cursor may be pointing not only out of bounds but into another allocation.

At present, we believe that managing the need to clear tags prior to reuse is best left to userland. While the kernel will absolutely test any capability found in dead memory (as well as all the live ones), forcing it to ask about the quarantined status of the memory containing every capability as well as the memory to which each capability points sounds more expensive than asking the allocators to manage this aspect.

Staging Revocation of an Allocation

In General

There are a few pieces to the "translated into a bitmap" step above. First, we're going to need access to the bitmap for a given piece of the virtual address space. To ensure that only the allocator can make (or clear) staging, this access is only given if software can demonstrate that it has a capability bearing VMMAP permission to the region in question, as returned by, for example, mmap().

Note

We expect that the allocator clears VMMAP permission on all objects it gives out; this was already true of the in-tree jemalloc prior to any design for revocation, and we have applied it to the allocators under study here.

So armed with such a capability, called "arena", a system call

void * __capability arena_shadow;

int res = caprevoke_shadow(CAPREVOKE_SHADOW_NOVMMAP, arena, &arena_shadow);

will return a suitable capability to the bitmap. Internally to the revoker, there are in fact other bitmaps (one, for sealing types, accessible via CAPREVOKE_SHADOW_OTYPE, and the other, internal, at page granularity, deliberately not exposed to userland, for the use of MPROT_QUARANTINE). Rather than go through all the tedium of finding the correct bit(s) for an allocation, this logic has been library-fied in libcheri_caprevoke. Before detailing the operation thereof, we must make one final note.

CHERIfied allocators (almost?) universally must be able to re-derive a VMMAP-bearing capability to an allocation given only their globals and the non-VMMAP-bearing capability held by the application. For somewhat tricky reasons (see Concurrent Revocation Safety of Bitmap Marking), libcheri_caprevoke's method to set bits takes both the application capability and this re-derived capability, upon which the spatial bounds must be appropriately set for the allocation in question, as the library sets bits from the base to top of this capability. So armed, the call

int res = caprev_shadow_nomap_set(arena_shadow, rederived_cap, app_cap);

will set the correct bits. Internally, it strives to write 64-bit words (see Are Wide Bitmap Stores Always Safe?) and uses up to two LL/SC loops to ensure that partial updates of words shared with other allocations are not clobbered. The caller must always check the result; a non-zero value indicates an attempted double-free() and should cause free() to return in a mostly-idempotent way (statistical counters are fine to adjust; changes to the heap are almost surely not fine). (See Concurrent Revocation Safety of Bitmap Marking for more details.)

After revocation, the same library can be used to clear the bits prior to reissue:

caprev_shadow_nomap_clear(arena_shadow, rederived_cap);

Assuming We Have MPROT_QUARANTINE

A feature still in the works is to have the kernel manage staging entire pages internally. The kernel knows about revocation and so should be able to do the right thing here. Once this is implemented, this section will be updated.

A 4096-byte page of 1-bit-per-16-byte flags covers 512 KiB of the address space. We believe that this, or a near power-of-two multiple, is a good threshold for distinguishing between "large" and "small" allocations when trading off between caprev_shadow_nomap_set and MPROT_QUARANTINE. The exact cross-over point has yet to be measured.

A little about the mechanism is probably in order. We intend that the allocator can, rather than calling caprev_shadow_nomap_set (or the _raw variant) for large regions, instead call mprotect(MPROT_QUARANTINE) on the region in question. This will remove all pages within the indicated region from the process's address space and (atomically) convert it to the equivalent of a MAP_GUARD region: impossible to back with pages and not reusable as address space for later (unhinted) mmap() calls. The kernel, internally, has a coarse-grained revocation bitmap, at one bit per page, that the revoker also tests during revocation scans; these pages will have their shadow set.

After revocation (so that the coarse bitmap is accurate), but before returning to userland, the kernel can quickly iterate all such regions and clear their bitmaps and revert the address space regions to anonymous memory.

Coalescing in Quarantine

Staging revocation requests early is advantageous: the next revocation allows them to be putback into circulation. However, small allocations incur overheads during staging: only a few bits get changed and the full cost of atomics is borne. Large allocations amortize atomics across potentially multiple full-word writes and, if sufficiently large, may (eventually) be better off being handled by MADV_QUARANTINE.

It is therefore a policy and allocator mechanism question: should free() regions be coalesced in quarantine prior to staging? If so, under what policy? Tentatively, we suggest that a dual-queue design may be the right one:

on free, small allocations that do not coalesce with existing unstaged allocations are placed at the end of the first LRU queue; large allocations that do not coalesce are placed at the end of the second LRU queue.
whenever two or three allocations coalesce in the first queue, if the result is still small, they are replaced with their merge at the end of the first queue. If the result is large, they are removed and their merge is added to the end of the second queue.
whenever an allocation coalesces with one or two objects in the second queue, they are removed and the merger is put on the end of the second queue.

When the quarantine revocation policy demands that memory be revoked for reuse, the second queue (of large objects) is drained in order (i.e., least-recently-coalesced large allocations are staged first) until the policy believes that quarantine will be below threshold. If that fails to satiate the policy, the first queue is drained in order.

We assume that whatever interlocking protects the quarantine coalescing metadata structures can be extended to cover staging to the bitmap. That is, only one thread will attempt to stage a coalesced collection of allocations to the bitmap. Allocators should avail themselves of the "raw" interface in libcheri_caprevoke. To eliminate concerns of representability of coalesced regions as capabilities, this raw interface acts directly on raw vaddr_t cursors into the heap; heap_start points at the first byte of the region and heap_end the last:

caprev_shadow_nomap_set_raw(arena_shadow, heap_start, heap_end);
caprev_shadow_nomap_clear_raw(arena_shadow, heap_start, heap_end);

caprev_shadow_nomap_set_raw returns void as it cannot fail (except to crash if improperly invoked).

These raw bitmap-accessing functions do not use atomics internally, as we presume that allocators obtain their backing stores in units of whole pages, and a 4K page has a 32-byte shadow in the bitmap, which is more than sufficient to ensure that each 64-bit word in the bitmap is associated with only one allocator.

Single-pass Revocation

Having filled out the bitmap "form" of requests for revocation, in the simplest setting, where only one form of caprevoke() is used, revocation itself may be performed by just calling

struct caprevoke_stats crst;

caprevoke(CAPREVOKE_LAST_PASS|CAPREVOKE_IGNORE_START, 0, &crst);

to do all the work.

Warning

Importantly, this is only certain to suffice if no other forms of caprevoke() calls are used (i.e., no other variants of the flag word are given)! As will hopefully become clear later, we are relying on there being no revocation in progress when we make this call here.

caprevoke() is a synchronous call: it will not return until revocation is finished. Upon return, crst contains useful statistics about the work done and some "epoch" counters, to which we now turn our attention.

Segmented Quarantines

Suppose we have several allocators in a program; snmalloc is a good example as it has a dedicated allocator per-thread, but one could just have several truly different allocators encapsulating everything from mmap() to managing their objects' revocation. In such a case, we would like to share the work done by revocation: because revocation necessarily is a global (well, per-address-space) operation, if allocator A request revocation, any requests staged to the bitmap by allocator B will also be revoked! The challenge is for B to know that this has taken place, asynchronously.

Towards this end, revocation divides time into "epochs." There are two, related questions one might ask about epochs:

by which epoch had staging a revocation certainly finished?
given the current epoch, which staged revocations are certainly done?

In the earlier story, epochs did not matter so much: certainly all staging had been done prior to the invocation of caprevoke() and by the end of the call, all revocation had been done. Epoch counters are a kind of lazy synchronization with other threads.

To take advantage of epochs, we divide our quarantine into segments, filled sequentially. If there are no existing segments of our quarantine and we have just filled one, we may call

struct caprevoke_stats crst;

caprevoke(CAPREVOKE_LAST_PASS|CAPREVOKE_IGNORE_START, 0, &crst);

as before to revoke everything in this segment and proceed as before. However, we can also invoke

struct caprevoke_stats crst;

caprevoke(CAPREVOKE_NO_WAIT_OK
           |CAPREVOKE_IGNORE_START
           |CAPREVOKE_LAST_NO_EARLY,
         0, &crst);

(the precise meanings of the flags will be explained later; for now, just roll with it). This does not do any revocation; instead, it merely reports the epochs in crst. We can now label the filled segment of our quarantine as having been fully staged by the reported crst.epoch_init. On the other hand, suppose that there are existing segments in the quarantine, of which earliest_filled is the oldest, and earliest_filled->epoch is the label it has been given; suppose further that we have just filled filling. We can then run

struct caprevoke_stats crst;

caprevoke(CAPREVOKE_LAST_PASS, earliest_filled->epoch, &crst);
filling->epoch = crst.epoch_init;

to (try to) revoke the earliest_filled segment and label the just-filled one. (Again, we are making some assumptions here about other caprevoke() calls; so long as everyone uses this or the above epoch-fetching forms, this will, indeed, revoke all allocations within the earliest_filled segment.) We should then add filling to the queue and then pull segments that have certainly been revoked out of the queue, using the function caprevoke_epoch_clears():

filling->next = NULL;
last_filled->next = filling; /* last_filled might be earliest_filled */

while((earliest_filled != NULL) &&
      caprevoke_epoch_clears(crst.epoch_fini, earliest_filled->epoch)) {

  done = earliest_filled;
  earliest_filled = earliest_filled->next;

  /* de-quarantine all allocations in done; free done itself */
}

Pay careful attention to the use of epoch_init to label the now-filled segment and epoch_fini to do the comparison for removing segments. You are welcome to treat caprevoke_epoch_clears() as a black box (and perhaps should; but see So What Are Epochs, Anyway? if you're curious.)

Because we did not set the CAPREVOKE_IGNORE_START flag and passed in the label of the earliest filled quarantine segment, this caprevoke() call may not have had to do any revocation at all before it could return to us with all allocations quarantined in earliest_filled certainly done.

If it did no work, then the subsequent loop will release earliest_filled and perhaps no other segments of the quarantine. If, on the other hand, it was necessary to do revocation, the loop will release all segments, earliest_filled through filling.

We now have a policy question! It is advantageous to frequently know the label we should use for the contents of our quarantine, as that lets us maximize the work we can piggy-back off other revocations. It's also advantageous to not ask too often, as asking involves a system call and interlocking within the revocation subsystem.

Epochs and MADV_QUARANTINE

Upon marking a region of the address space as MADV_QUARANTINE'd, the kernel should obtain the (inerlocked!) epoch counter at the end of its operation (i.e., just before returning to userland) and mark the region with this value.

When the revoker encounters a MADV_QUARANTINE'd region whose epoch is sufficiently far in the past that this pass clears it, it may, after finishing this pass and before advancing the epoch clock, then reset the region to being ordinary, anonymous address space.

Multi-pass Revocation

The rough sketch of revocation makes sense as an atomic action, but that requires that we stop the world for the duration of a scan through all of memory, a rude proposal at best. We therefore divide revocation into a series of passes:

an "opening" pass, done by a single thread without the world stopped. This visits all capability-bearing pages.
zero or more "middle" passes, which visit "recently capability-dirty" pages (i.e., pages to which capabilities have been stored since they were visited by the opening pass).
a "closing" or "last" pass, which runs with the world stopped, but, like the middle passes, visits only the "recently capability-dirty" pages (but also visits some kernel state while the world is stopped).

If no flags are specified, caprevoke() does an opening or middle pass. CAPREVOKE_LAST_PASS indicates our desire to (open and) close revocation. If caprevoke() must open, as well as close, it will do the opening pass without the world stopped unless CAPREVOKE_LAST_NO_EARLY is also given. If CAPREVOKE_LAST_PASS encounters an open epoch, it will first do a "middle" pass without the world stopped, unless CAPREVOKE_LAST_NO_EARLY is also given. This CAPREVOKE_LAST_NO_EARLY flag is almost surely only useful for experimentation.

It is a policy question as to when an allocator should open and close a multi-pass revocation or run a middle pass. We believe the default policy, of opening and immediately closing, is a sensible one, but experimentation is merited. In all cases, the loop exhibited above suffices to gate segments' being de-quarantined.

When it is necessary to ensure that a segment is revoked, one may have to repeatedly invoke caprevoke():

struct caprevoke_stats crst;

do {
  caprevoke(CAPREVOKE_LAST_PASS, earliest_filled->epoch, &crst);
} while(!epoch_clears(crst.epoch_fini, earliest_filled->epoch));

We can, if we like, replace the earlier example of queueing of a full segment into quarantine with just about any operation we like. An "opportunistic" approach is to just ask for the time, as we did above, with CAPREVOKE_NO_WAIT_OK | CAPREVOKE_IGNORE_START | CAPREVOKE_LAST_NO_EARLY. We can also "hurry along" an open revocation, but not open a new one, with CAPREVOKE_NO_WAIT_OK | CAPREVOKE_ONLY_IF_OPEN | CAPREVOKE_LAST_PASS, which will close an open revocation unless there is already a revoker doing work towards that end (though it may just be doing a "middle" pass).

Multi-pass and Epochs

Epochs are divided in twain to capture the semantics of multi-pass revocation: we refer to "open" epochs and "closed" ones. Their ordering is as you might expect, closed becomes open becomes the next closed becomes the next open, &c. caprevoke_epoch_clears() encapsulates the logic necessary to make sure that staging in an open epoch waits for the closing pass after the next opening pass.

There is some relatively tricky interlocking logic in the kernel to ensure that userland gets useful answers. If, in particular, an epoch is opening or closing at the time of entry to a caprevoke() call, epoch_init will be reported assuming that these operations had already taken place. epoch_fini may thus be less than epoch_init if the caprevoke() call does not fully interlock with any existing revokers, as it might not when a sufficient epoch is specified (i.e., CAPREVOKE_IGNORE_START is not set) or CAPREVOKE_NO_WAIT_OK, whose purpose is exactly this, is given.

Multi-pass and MADV_QUARANTINE

Resetting MADV_QUARANTINE regions must happen in a second pass over the address space after the "closing" pass: the bitmap must remain intact for the duration of the "closing" pass. Fortunately, this is merely iterating the map entries, and so should be quick; if necessary, the map entries can be specificially chained together for faster iteration.

So What Are Epochs, Anyway?

Epochs, like time in general in multi-threaded systems, are complicated things. An epoch identifier refers to a point in time, or, more properly, in the partial order of memory transactions, prior to which some thing(s) certainly had taken place. The intended use of struct caprevoke_stats epoch_init and epoch_fini are:

epoch_init is the observed, synchronized time when the call began,
epoch_fini is the same clock but at the end of the call.

That's all well and good, but more importantly:

epoch_fini tells us which epoch's beginning all program-order observations after the call certainly happen after.
epoch_init tells us which epoch's beginning we are allowed to treat as happening after all program events prior to the call certainly happend before. It is not necessarily actually true that the epoch began after these events.

The sole comparator we care about is caprevoke_epoch_clears(now, then). This is true if the epoch value now releases all staged revocation requests completed prior to then. This function encapsulates the following reasoning:

a staged revocation request is certainly done if revocation has both begun and then ended after its staging.

(Importantly, it does not suffice for one revocation to end and another to begin.) The implementation details are interesting, but this document is already too long.

Concurrent Revocation Safety of Bitmap Marking

All of this is encapsulated in libcheri_caprevoke and can be taken as black box.

There are several worrying possibilities that might arise when we are staging allocations to the kernel's bitmap:

Multiple allocations' shadows may occur within the same byte or word.
This is a double-free() and the allocation is already staged.
This is a free() using a capability whose allocation is staged for revocation.
A revocation pass might happen while we are in the middle of filling out the bitmap for a given structure. If we have filled in the bottom bit, then from the kernel's perspective, any capability we hold to the object whose base has not advanced is worthy of revocation.

caprev_shadow_nomap_set takes both a re-derived capability to the allocation to be staged, which should bear VMMAP permission so as to not be subject to revocation, and the capability provided by the application to free(). This re-derived capability is used to determine which bits to set, and so addresses the last point, regardless of how the revoker ultimately mangle capabilities it finds worthy of revocation.

The first point is addressed by using LL/SC to implement atomic ORs for the first and last 64-bit word occupied by the shadow of the allocation being staged. All intervening words are assumed to be property of this allocation and are simply written to with all bits set; this will be true of an allocator without placement bugs that generate overlapping objects.

The second point, of double-``free()``s is addressed by extending the LL/SC for the first word test, between LL and SC, that none of the bits about to be set are already set. If any are, the call signals a failure.

The third point is the most tricky. Just because the capability passed to free() has been staged does not mean that the revocation has already happened at the time of free() and so there is an inherent risk of a TOCTTOU race. Moreover, the revocation of this allocation and clearing of its shadow in the bitmap may take place between free() and the check inside the LL/SC mentioned above. In order to properly interlock with the revoker, the LL/SC for the first word is again extended to verify the tag of the application-sourced capability between the LL and SC instructions, thereby closing the TOCTTOU window and ensuring that double-``free()``s which straddle epochs remain idempotent.

Finally, because caprevoke() as we have used it here always performs interlocking against other caprevoke() calls, we can be certain that our stores to the bitmap are visible to other threads and CPUs engaged in revocation in time for the test in caprevoke_epoch_clears() to be correct. Closing ("last") passes additionally single-thread the world during their operation and so ensure that all threads on all cores see the tag-clearing as having taken place.

Are Wide Bitmap Stores Always Safe?

The libcheri_caprevoke helpers always write 64-bit words to the kernel's bitmap. A 64-bit word corresponds to an aligned, 1024-byte segment of the address space. As all mmap() results are at least 4096-byte aligned and at least 4096 bytes long, it is impossible that an aligned 64-bit bitmap word cross a permission boundary. While, in principle, one could request bitmap access capabilities for sub-page regions, we assume that our allocators do not.

Unsafe Experimental Options

The revoker tries very hard to not give callers sharp edges. However, for experimentation's sake, we expose some interesting behaviours. In particular, there are a series of CAPREVOKE_JUST_* flags that will cause caprevoke() to forego its usual operation, including interlocking with other callers of caprevoke(). While most of these are useful only to test various aspects of the implementation, one has proven tempting to use with general revocation and deserves special attention.

CAPREVOKE_JUST_THE_TIME performs a non-interlocked read of the epoch counter value and returns this as epoch_fini. This value is as useful as any other reported epoch_fini: certainly the beginning of the denoted epoch is in the observable past. There are no cases in which the revoker implementation advances the globally-visible clock only to revert it later. However, this value must not be treated as an epoch_init value and used to label segments of quarantine. The flip-side of the revoker not advancing the clock prematurely is that it does not advance the clock until the end of an opening or closing pass of a revocation, and so the value reported by CAPREVOKE_JUST_THE_TIME may be lagged. The epoch_init value reported without any CAPREVOKE_JUST_* flags takes this into consideration by interlocking to observe whether a pass is now in progress. We are not, presently, aware of any situation that materially benefits from CAPREVOKE_JUST_THE_TIME save, perhaps, diagnostic printout.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly