-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
non-deterministic GC #1872
Comments
How should the kernel keep track of which imports are still being used? For each vat, the deterministic ("userspace") computation has a specific set of live imports: the The liveslots layer, using its For each imported object ( If, on one run, the apparent window begins at delivery 4 (the kernel includes Now, suppose the vat is reloaded from transcript. On this second run, the What I'm trying to sort out is what exactly the kernel needs to track. I'm thinking that the kernel is allowed to delete the import from the c-list as soon as it sees the On the other edge of the window, if the kernel sees a syscall with an object ID that isn't in the c-list, that definitely means the vat is making an illegal access. But I think there are three cases:
How do we keep the kernel from accidentally revealing its knowledge of GC activity (to other vats) when a vat references a previously-dropped import? If the kernel deletes the c-list entry when it receives the What exactly is the pathway by which the kernel might accidentally reveal its forbidden GC knowledge? If the kernel kills the vat when it references a previously-dropped object ID, that will have two effects: a message is sent to the vatAdmin (to notify the vat's parent, by resolving the However, once those termination messages are delivered, they may provoke externally-targetted messages, which is part of the validator consensus. So all validators must choose the same time to kill the vat, give or take some wiggle room: I think it depends upon the order of other messages currently in flight. If one validator's vatAdmin were notified one crank later than the other validators (still within the same block), but there were no external messages happening about the same time, then an outside observer might not observe a difference in the external message ordering, and might not notice the divergence. However we can't depend upon that. So this is making me think the kernels really do need to agree upon when the vat is dead: the vatAdmin message must be delivered on the same crank across all validators. That limits our options. The kernel's That points to a new "live" column in the kernel object table. I'm still pondering what this means for the exporting vat. |
@FUDCo and I discussed this some more. Both the "when should a vat be killed for an illegal syscall" question and the "when should the comms vat send a One wild idea we explored was a "drop oracle". The idea would be that the kernel somehow reacts to a Vat termination for illegal syscalls could maybe be done the same way. Effectively all nodes vote a reference (or an entire vat) off the island. Death by committee. The downside is, of course, that we're kind of inventing our own consensus layer, and that was on our list of things we shouldn't do. Drafting off the existing voting module probably reduces the hazards, but it still raises the question of whether the incentives are correct. If one validator maliciously claims that they've seen a |
The only viable answer is that it cannot make an illegal syscall. If vat A exports an object O under kernel key K1, the incoming A simple fallback answer is that it uses K1. When the I think we can do better than that, but the fallback is implementable. Foe example, resolved promises will never be exported again. It's really only presences for which this could be an issue. |
That sounds like a decent approach for the exporting side. I don't think it matters too much whether we reuse the exported ID or allocate it a new one. Although we need more bookkeeping if we want to allocate a new one (the My concern has mostly focused on the importing side. But I think it's safe to say that illegal syscalls can only happen if the liveslots layer is buggy. If liveslots is functioning correctly, then even a malicious userspace should not be able to trigger an illegal syscall. So it's a question of how we define the TCB. I don't want to give liveslots the authority to break the integrity of any other vat than its own, but maybe our fallback position is that we do give liveslots the authority to break the swingset as a whole (by making syscalls whose legality depends upon GC-sensitive nondeterminism), and rely upon our implementation to not exercise that authority. I don't see how that helps us with cross-swingset (comms-vat) drops, alas. |
Hm, the TCB argument is stronger than I realized: of course we're giving liveslots access to nondeterminism, because that's the only way to get GC information about what userspace is doing. So of course we're forced to rely upon liveslots not misusing that authority. Which means I think we can stop worrying about a buggy liveslots causing divergence, because we don't have any other choice. So that leaves comms-vat drops. Ugh. |
Oh, and we probably can have the comms-vat drop references from a solo machine, which might mitigate most of the problem: how frequently do our solo-side clients provide objects to the chain, rather than the other way around? |
On the export side, I think we should go with arrival-order nondeterminism. Each Then, before moving on to the next item on the run-queue, we pop each item from the maybe-free list and search for remaining references. This references might come from other vat c-lists (worst case it could scan them all, best case we maintain a reference count), or from the resolution/rejection data of unretired promise table entries. If any references still exist, the kernel does nothing. If there are none, the kernel reads the owner (exporting) vat of the object, deletes the object table entry, and performs a delivery of This delivery will look up the target kernel object ( This delivery is expected to cause liveslots to delete the The Once the
The kernel should probably drain the maybe-free list before returning to the run-queue. bikesheds:
minor design questions:
larger design questions:
|
Hm, if we did apply this to Promises as well, then a crank which does an implicit
The If the sender actually uses the Promise (perhaps they pipeline a message to the result, or they share the result promise with some other vat), then this is less clear. Pipelined messages whose ultimate result promise is dropped should also drop the intermediate results. Shared promises should not be dropped until the last subscriber has dropped. The "imports" of a promise-table entry are the subscribing vats, plus the run-queue messages that target it. The "exporter" of a promise-table entry is the deciding vat, if any, or the run-queue entry which holds that promise as a |
My intuition is the opposite: that generating new IDs will be much simpler, because it means we can retire old ones when we think they're dead instead of having to keep track. |
Task list (obsolete, use checklist in first comment instead):
|
Hm, I just thought of a wrinkle: the Worst case, a vat which is told to shut down (and politely deletes all the internal references it can, to accelerate GC) might not actually see the finalizers run within that same crank. And, if nobody ever talks to that vat again, it won't ever get agency again, which means we won't have a safe place to emit the One slight mitigation would be to have the notifier set a flag and check that flag each time liveslots gets control (so each time an outbound message is sent). That might increase our chances of getting the Another approach would be to add an explicit delivery type whose only job is to check the finalizer flags and report any dead references. We might call this immediately after every crank finishes. I don't like the overhead this would represent, but it's hard for me to resist the idea of adding a perfectly-named It might also be possible to blur the line between vat and kernel a bit more, and have these notifications go directly to the kernel, rather than being handled entirely by liveslots. In particular, if the kernel didn't learn about them through a syscall (but rather through some other channel that travels the same kernel-worker protocol as syscalls), then it might be ok for them to occur outside the usual "only during a delivery" window. This would also remove them from the transcript entirely, which means one fewer exception to implement. |
Would the syscall interface go back to carrying only deterministically logged things? If not, what are the other exceptions? |
Is shutting down distinct from terminating? If a vat is terminating, why do we care what drops it reports before terminating. Terminating will deterministically drop everything anyway, yes? |
Yes, I think if we moved
Yeah, I didn't mean termination, I meant something smaller, in which the vat as a whole keeps running, but the code inside it knows that it's job is complete, so it does as much cleanup as it can (empty all the Stores to facilitate GC) while is still has control. More like emptying out the fridge before a holiday of indeterminate length, rather than burning down the entire house. It might not be that common/useful of a case. Maybe an auction contract that we keep around for a long time but which only gets used occasionally, and which builds up a lot of references to external objects (which we'd really like to drop) when it does get run. Imagine the contract handles a dozen messages, and on the last one, it knows that this particular invocation is complete, so it drops all those references. If the finalizers don't run until after the vat loses agency, it won't have a chance to tell the kernel that it doesn't need them anymore, and we won't be able to GC upstream until the next auction happens.
Agreed. |
This reminds me of the other spill-to-disk persistence scaling problem I've been meaning to raise, that it sounds like you've already been thinking about: Just like we assume that there are a moderate number of active purses at any time, but a zillion idle purses that should not be occupying any ram, we should make the same assumption about vats. That there are a moderate number of active vats at any time, but a zillion idle vats that should not be occupying any ram. Likewise, most intervat capabilities will be from an idle vat and to an idle vat, and therefore also themselves idle. These idle caps between idle endpoints should not take ram bookkeeping space in the kernel, in comms vats, or in captp-on-ibc. How much of this is already implied by our current plans? |
Yep, the heap-snapshot and worker-process design is meant to support vats being demand-paged into RAM. We'll have the option of loading any given vat as soon as the process starts, as late as when someone sends a message to them, or some point in between. The actual policy/heuristic is an open question: the concern is that the vat might take a significant time to load (replay), and if we wait until the last moment, block processing might stall while we catch up. We might be able to peek at the escalators to make a guess about what deliveries are coming up, and use any leftover time to load the vats while the messages are still climbing. OTOH, if there are any messages at all remaining on the escalators at the end of the block, that indicates we're sufficiently overloaded that we can't keep up (we're now rationing/allocating execution time), so we might not have any spare CPU cycles to load vats. OT3H there will be multiple CPU cores, and we might not be able to use all of them for normal message processing: maybe we wind up with one (serialized) core taking things off the escalator, leaving the other cores available to page in vats which have messages on their way. The inter-vat caps will consume space in:
So I think idle caps shouldn't take up any RAM when we finish implementing virtual objects, paging out idle vats, and comms tables on disk. |
I had an idea last night to address the timing /
Benefits are:
The drawback is that it still suffers from the late-drop issue above. A vat which stops using a bunch of objects, but for which the finalization callbacks don't occur until after the end-of-crank is processed, will hold on to those references until some future delivery to that same vat. If the vat is now idle for a long time, that's a lost GC opportunity. Worse, from my experiments it appears that Node.js (at least) decided to make GC notifications ride the IO/timer queue, not the promise queue. So nothing will be added to the Set at all until after our I don't know how GC happens in our JS engines. I'm used to Python, where refcounting causes non-cyclic GC to happen immediately (too fast, in fact, the |
No refcounts in JS so GC never happens immediately by simply dropping a reference. GC might happen in a turn, but our WeakRef design delays all notifications to happen in later turns. As you suspected earlier:
The rest of this seems right. Good stuff! |
This does not surprise me. There was a move to have the spec require this, but I think that died. Not sure --- we should check the spec. If it did die, XS may gc more promptly. But how significant is the difference to us? Doesn't it just mean that things might stick around for one more crank than they should? Do we care? |
Oh, that's disappointing. My initial tests are showing that if I don't provoke Node with a
Not just "later", but it could be much much later. If the engine defers GC until it is feeling memory pressure (hours? days?), and our design depends upon the importing vat getting at least one delivery after the finalizers have run, then big+intermittent vats will be a problem: they might know the imports can be dropped, but they can't safely express it until they get some runtime, and that won't happen until someone sends them another message, which could be days or weeks later, if ever.
If the notification happened promptly (e.g. because of a refcount going to zero):
But since notifications don't appear to happen promptly, the difference is moot. They'll happen some arbitrarily long time in the future (depending upon memory pressure), and then on the delivery after that point, the vat will have a chance to I'm feeling drawn back to |
Notes from today's kernel meeting: The VatTP refcounting protocol is designed to correctly manage object-table drops between two asynchronously-coupled systems. Each side maintains a table of counters, indexed by object-id. When A sends a message to B, it increments the counter for every object-id included in the message. When B receives the message, it maps the object-ids to local representatives or Presences by looking at its table. If the table has an entry for the object-id, B increments the counter and uses the old Presence. If there is no such entry, B creates a new Presence and sets the counter to 1. When B learns that the Presence is no longer referenced locally (e.g. through a In swingset, the link between kernel and vat is synchronous. However the link between Presences becoming unreferenced and the finalizers running is asynchronous (because the finalizer runs in some later turn, perhaps much much later, since JS likes to do extremely lazy GC). We can use this same protocol. The liveslots import table will map object-id to a record containing a Presence and a mutable counter. The finalizer will reference the mutable counter: if (!slotToVal.has(vref)) {
const counter = { count: 1 };
const p = createPresence(vref);
slotToVal.set(vref, { p, counter });
registry.register(p, counter);
return p;
} else {
const { p, counter } = slotToVal.get(Vref);
counter.count += 1;
return p;
} The finalizer will send a DECREF(counter.count) to the kernel. The kernel's c-lists will contain a count of mentions, incremented for each appearance of that object-id in each delivery. @erights had a trick for managing the counter correctly despite a large number of mentions (potentially overflowing any reasonable-sized counter value). He used a deliberately small 16-bit counter. When the receiving side got close (perhaps I'm still trying to find a sensible channel for vats to send these DECREFs to the kernel, but this protocol probably makes it safer to send the messages outside the context of a given message delivery. The kernel would react to DECREF by adding the The trouble with DECREFs that can be sent at any moment is how a nominally-blocking vat-worker subprocess should deliver them. I was hoping these subprocesses could operate with a single pipe to the kernel process, spending all of its idle time in a blocking read, either waiting for a new delivery, or for the results of a syscall. But maybe that's not incompatible with sending DECREFs at other times: perhaps the worker subprocess does a blocking read of the kernel->worker pipe, but is allowed to write to the worker->kernel pipe any time it likes. The kernel will be doing a non-blocking read of the worker->kernel pipe anyways (since it supports multiple worker processes), so it could receive DECREFs at arbitrary times. We would need to add some synchronization on shutdown to make sure we don't drop DECREFs as the process is being torn down, though. |
These two authorities are not part of SES, so they must be pulled from the globals of the Start Compartment and ferried through the kernel to the vat manager factory that calls makeLiveSlots. This gives the outer layer of the vat (liveslots) access to nondeterminism. We rely upon liveslots to not share this power with the user-level ocap-style "vat code". Liveslots must never allow user-level code to observe behavior that depends upon GC activity, because that activity is not part of the specified input to the vat. refs #1872
Node.js v14 provides `WeakRef` and `FinalizationRegistry` as globals. Node.js v12 does not (there might be a command-line flag to enable it, but I think it's marked as experimental). Rather than require all users upgrade to v14, we elect to disable GC when running on v12. This change attempts to pull `WeakRef` and `FinalizationRegistry` from the global, and deliver either the real constructors or `undefined` to the liveslots code that uses it. We'll write that liveslots code to tolerate their lack. refs #1872 refs #1925
The upcoming GC functionality will require `WeakRef` and `FinalizationRegistry`. Node.js v14 provides these as globals, but v12 does not (there might be a command-line flag to enable it, but I think it's marked as experimental). Rather than require all users upgrade to v14, we elect to disable GC when running on v12. This adds a local `weakref.js` module which attempts to pull `WeakRef` and `FinalizationRegistry` from the global, and exports either the real constructors or no-op stubs. refs #1872 refs #1925
These two authorities are not part of SES, so they must be pulled from the globals of the Start Compartment and ferried through the kernel to the vat manager factory that calls makeLiveSlots. This gives the outer layer of the vat (liveslots) access to nondeterminism. We rely upon liveslots to not share this power with the user-level ocap-style "vat code". Liveslots must never allow user-level code to observe behavior that depends upon GC activity, because that activity is not part of the specified input to the vat. refs #1872
In the future, when liveslots implements GC and discovers that a Presence has ceased to be referenced by the user-level vat code, it will call this function to decrement the kernel's reference count for the imported object's c-list entry. `vatDecRef` might be called at any time (although always in its own turn). The kernel will eventually add the decref information to a queue, to be processed between cranks. For now, the kernel only records the information if an option was set to enable it (for future unit tests). Most of this patch is the kernel-worker protocol wiring to allow child-process vat workers to deliver the decref back up to the kernel process. Liveslots does not use this yet. A future patch will switch it on. refs #1872
"Formal" vs "Local" Reference GraphsWe start by acknowledging two distinct views of the reference graph. The "formal" graph is the one agreed to by consensus. It must be a deterministic function of the specified vat/kernel behavior, independent of any JS engine-specific behavior (such as garbage collection and heap snapshots). In contrast, the "local" graph is private to each member of the consensus machine: each validator may have a different local graph. The local graph is allowed to be influenced by non-deterministic actions like GC weakrefs and finalizers. If we weren't obliged to provide a stable consensus view that remains consistent under replay, the local graph would be the only one we cared about: local references are "actual" references. The local graph will be a non-strict subset of the formal graph. If an object is referenced locally, it must also be referenced formally. But it might be referenced formally without also being referenced locally. This latter state (a "nominal" reference: formal but not local/actual) is expected to be the most common. Where References Come FromA swingset machine contains two tables which track kernel objects and promises respectively, each of which has a kernel-side identifier (a
Kernel Data StructuresThe kernel object and promise tables include a formal reference count. This lets us discover nodes which are formally unreachable without performing a full mark-and-sweep pass (but of course one will be necessary to prune cycles). When the formal refcount drops to zero, the object is formally unreachable, and can be deleted entirely. All validators of the consensus group will have the same formal refcount. These tables also contain a local reference count, to discover locally-unreachable nodes. Each validator may have a different local refcount. Finally, each entry has a "local in-use" flag that indicates whether this entry is locally reachable or not. If the local refcount drops to zero (but the formal refcount is nonzero), this flag is cleared, and any outbound references from the table entry are decremented from the local refcount of the targets. (This flag might be redundant; we could use The c-lists for each vat contain a What We Agree ToOur consensus engine will operate upon a subset of the kernel state. To support fast-sync, we need to maintain consensus on enough kernel state to allow a new validator to download and verify that state and then launch vats with less effort than starting the entire chain from scratch. In particular, these new validators can avoid replaying dead vats. However, since they cannot take advantage of (engine-specific) heap snapshots, they must replay the transcript of each live vat from the vat's initial code bundle (but they can be lazy and only load vats on demand). We redact the "local refcount" and "local in-use" columns from the consensus state. To facilitate this, these columns may be stored at different keys of the HostDB. However their values are still updated in atomic transactions with the rest of the entries. We include vat transcripts in the consensus state, but these do not contain the local-only messages described below. Imports, Presences, WeakRefs, Finalizers, The Dead Set"Vat Imports" are created when a delivery arrives at the vat and cites (as an argument) some If the (This test-for-liveness and active management of the Dead Set replaces the increment/decrement counter in the previous design). The "Dead Set" is a The kernel has a special delivery type named Each vat has a way to tell the kernel that it might have a non-empty Dead Set (a function it can invoke). It will call this when the finalizer runs. The kernel adds the caller's Each vref thus goes through the following state diagram:
When the kernel first informs a vat about a vref, it moves from UNKNOWN to LIVE. If the vat code then forgets about the When the kernel invokes Heap snapshots should preserve these states correctly: a vref in the UNREACHABLE state when the snapshot is created will still be there when the snapshot is restored, and eventually (in its new incarnation) it will be collected and finalized. Kernel Delivery / Import-Reaping CycleWhen the kernel processes a delivery, it must translate the For each vref it translates, the c-list's (TODO: this portion is about vrefs that represent imports, not exports. The local-in-use flag for exports is probably handled differently. I need to think about how this affects the delivery sequence.) Later, after the delivery is complete (we have committed the crank to the HostDB, including any changes to the c-list, including the local-in-use flag), the kernel processes any queued "maybe dead" events by processing the vat-IDs in the "May Have Dead" set. This is the set of vats which had finalizers run recently, which might include the vat to which the delivery was just made, but might also include arbitrary other vats (perhaps GC was triggered by memory pressure). In Node.js it appears that the finalizers don't run until the IO queue is serviced, so we might want to insert an additional For each vat-ID being processed, the kernel invokes the vat's The kernel translates each vref through the c-list into a kref, then clears the c-list's local-in-use flag, but leaves the rest of the c-list entry intact. The kernel then looks up the kernel object/promise table entry for the kref and decrements the local-refcount field, to reflect the vat no longer holding a local reference to that object. The decremented When all the queued finalizer events have been processed, from all vats, the kernel can process the Maybe Free Set (it could also defer this for later, if desired). Each kref in this set indicates a kernel object/promise whose local-refcount was reduced recently. We examine this refcount to see if it dropped to zero. If so, the object/promise's When we're done, some number of kernel objects/promises may be locally unreachable. They will probably be formally reachable: the consensus state cannot be told about GC-triggered events, so the consensus state thinks these items are still alive. However we know that they will not be invoked again, so they can be marked as unused (and unusable) locally. Note that any reference from the run-queue will prevent this: anything reachable from the run-queue must be kept alive (for real, not just formally), so that messages can be delivered. However the resolution data of locally-unreachable kernel promise will not keep any included kernel objects locally-alive. Kernel Export-Reaping CycleFor each newly- locally-unreachable object, the last remaining local-in-use citation will be in the c-list of the exporting vat. The kernel needs to clear this flag and perform a special
Liveslots should attach a Finalizer to the exported Remotable, but unlike the import finalizer (which will add the vref to the Dead Set and notify the kernel that it May Have Dead), the only job of the export finalizer is to delete the WeakRef from the Do Refcounts Include The Export?(TODO: this still needs more thought) We need to learn when a vat's exported object/promise is no longer ("locally") referenced by the kernel. For this, we must decide whether the refcounts should include the exporting vat or not. One possibility is:
This might help with scheduling of the I'm not yet sure how to think about Promises here. For objects, there is one clear exporting vat, and some number of importing vats, and the object is "dead" once the exporting vat is the only one left. But for Promises the Note that references coming from the kernel object/promise tables (e.g. in resolution data) always provide a formal reference, and might also provide a local reference if (and only if) the "local-in-use" flag is set. References coming from the run-queue always provide both a formal and a local reference. This way the formal refcount will never be lower than the local refcount. Something might be kept alive by an entry in the run-queue even though no vats have it in their c-lists (perhaps the result promise of a message sent by a vat which then dies before the message is delivered). Dropping Formal ReferencesWhile we expect most opportunities for GC to occur locally, when importing vats have their WeakRefs die and their finalizers run, we may occasionally be able to delete formal references too. This is a consensus-visible change to the reference graph, and can thus reduce the amount of information that new (fast-sync) validators must process. It also allows the kernel to delete object/promise/clist table entries for real, saving space in the database (not just RAM in the exporting vat). Clearly we would prefer to do lots of GC in the full formal reference graph, because that provides the most savings, but in practice we expect most to be merely local. When an importing vat dies, all its c-list entries are deleted. This removes the formal references to the krefs it used to contain, which decrements the We may allow the exporter of an object to "revoke/terminate/cancel" it, after which any invocation (in any vat) will throw an error instead of delivering a message through the kernel to the exporting vat. If, in addition, we do not require/allow these revoked objects to retain their identity, then the importing vat might not need to reference the kref anymore (note that this seems unlikely: we've generally assumed that dead objects retain their identity). If that ends up being the case, then the object-revocation sequence might allow the importing vat to formally drop its import, eventually allowing the exporting vat to formally drop the export. Another possibility is the use of alternate languages inside vats. While JavaScript doesn't have a deterministic specification for GC, other languages might (or we might choose to settle on a specific engine, with deterministic GC). Vats which execute this other language could inform the kernel that they no longer need an import with a new Finally, the comms vat might receive notification from the remote machine that an export is no longer needed. The comms vat would use For the exporter side, I expect we'll have a delivery named When a vat is referencing off-machine objects via the comms vat, and that vat dies, we should definitely take advantage of the opportunity to prune the comms tables, so ReplayWhen the kernel process is restarted, it will need to replay any active vat from its transcript (this can be deferred until someone talks to the vat, and avoided entirely for dead vats). We want to make sure that any RAM savings we managed during the original run of the vat are also achieved in the replayed version. During replay, syscalls are disconnected. The vat worker executes all deliveries without talking to the kernel: each syscall the vat makes is compared against the transcript, but not delivered to the kernel. As this runs (or more likely during some At this point, at the end of replay, the kernel or vat worker should call We might see locally-unreachable vrefs during replay that we didn't see the first time around (the vref is mentioned by There might be a vref whose local-in-use flag is clear, but which does not appear in the We might see a On the export side, we need to reconcile the exports of the replayed vat against the known-locally-referenced set in its c-list. As the transcript is played, liveslots should collect a set of exported vrefs. When replay is complete, some mechanism should iterate through this set and check the vat's c-list for each: if the local-in-use flag is False, the vat should be given a Incremental GC during replayThe previous algorithm applies all GC results at the end of transcript replay, which means long-lived vats will build up some huge heap and then discard nearly all of it. It would be better to perform whatever GC we can during the replay process, rather than only at the end. For exports, this is pretty easy: the vat worker's replay loop can perhaps execute 100 transcript entries, collecting exported vrefs, then pause to probe the c-list for those vrefs, delivering For imports, the memory pressure would come from having a very large Dead Set. We could batch transcript entries as before, and after each batch we call Formal GC during replayWe can ignore
Heap SnapshotsWe currently don't want to specify our chain to use a specific JS engine, which means heap snapshots cannot be part of the consensus state. (This is a pity, because they would provide a super-efficient starting point for fast-sync). Heap snapshots will be used locally, to allow any individual validator to restart faster, but their contents won't be revealed to anyone else. We must make sure that the recorded HostDB and heap snapshot can cooperate to return us to the same local/formal GC state as the first time through. The main feature needed, which ought to fall out of any correct snapshot mechanism, is that the LIVE/COLLECTED/FINALIZED state is retained properly. If a snapshot is taken at a point when a Presence is COLLECTED, but the finalizer hasn't run, we need to know that reloading that snapshot into a new heap will eventually call the finalizer. And if the finalizer has run before the snapshot was taken, it must not run again in the new incarnation (although I think our protocol would tolerate this particular misbehavior). |
GC of imported PromisesAs mentioned, I'm still thinking through how this relates to Promises, rather than objects ( In general, we hope Promises to be easier to manage than objects, because 1: most Promises eventually resolve, and 2: we aren't obligated to maintain identity for resolved Promises. As a result, when a resolution is sent across the vat/kernel boundary, we also "retire the vref" (i.e. the This gives us more opportunities for formal GC than for objects, because we're allowed to delete the JavaScript Promises And Their ReferencesThe ECMAScript specification on the Promise constructor (and on Promise instances) is intentionally silent on the question of which objects retain strong references to which others, because overspecification would limit engine implementators' ability to add performance optimizations. In addition, There are basically four things of interest: the There are two references we can assume exist based upon their importance to keep Promises working as expected:
I ran some In this diagram, the dashed lines ( It appears that the As we build something around this, we must be prepared for the dashed edges to be missing, because engines are not required to maintain them. For example, there's no obvious reason that the There is no obvious way to sense when TerminologyFor objects, which have a single clear exporter (and zero or more importers), we have three nouns: For Promises, the situation is ever so slightly murkier. Promises have a single creator, but its identity isn't very interesting. Instead, message routing depends upon the current decider, which can change over time as resolution authority is passed from a vat, into the kernel, and off to some other vat, before being consumed in an act of resolution. The abstract "promise" is represented in all vats by an actual JavaScript A vat might create a native Vats Which Import a PromiseWhen a vat imports a Promise, Later, if/when the kernel notifies the vat of the promise being resolved, When we change I'm still working things through, but maybe we never put The overall structure will look something like this: We'll have a finalizer on the User code might receive an imported To this end, we need the When the promise is resolved, we can delete the (TODO: investigate what happens during the interval between I'm still thinking through the exporting side, I'll write that up next. |
GC of exported promisesOn the export side, I think we need to add the Promise to the strong When user code creates a Promise and sends it in a message (or in the body of resolve-to-data), the act of serializing the as-yet-unseed Promise object will trigger liveslots to call It's likely that user code will forget about the Promise immediately after sending it, only retaining access to the When will exported Promises be dropped?On the importing side, we must hang on to the Promise ( So, barring some advances in the JS As a result, although we'll probably implement the ability for vats to be told that the kernel no longer (locally) cares about their promise export, we don't expect it to be used. If implemented, this would simply delete the vref from the Instead, we expect to retire Promises if/when they are resolved. The
|
Task List
|
A naming issue: Please use "revoked objects" rather than "cancelled objects". In most systems, "cancel" generally means "stop making it", with a focus of "asynchronously stop consuming resources". Whereas "revoke" means "make it unavailable, preferably synchronously or in order". As a result of canceling, things might get revoked, but that's just one of the actions that could happen as a result of cancelation. |
how exports are droppedI think
At some point in the future, the Finalizer will fire. The Finalizer callback should do the following:
We use the same WeakRef and Finalizer setup for everything in slotToVal/valToSlot, which means both exported Remotables and imported Presences. The only difference within the vat is that Remotables are also referenced by the (Another option is to keep a handle for the Finalizer around, and have the delete-from-everything code use it to proactively unregister the Finalizer, rather than ignoring the finalizer when its vref is already gone, but that'd be marginally more complicated) |
At yesterday's meeting, @dtribble reminded me that exported Remotables must keep the same
So vats treat Remotables and Presences the same, except that Remotables go into the If the vat code drops the Remotable before the kernel calls If instead the vat code retains a reference to the Remotable and re-exports it (after being told to drop the export, the vat will continue to use the same |
Sufficiently Deterministic GCWe don't want to overconstrain validators by requiring them to run an exact binary image (and diversity of validators is critical for security anyways, especially against supply-chain attacks, so we very much want each validator to compile their own code from verified upstream sources). But JS is sufficiently underspecified that we need some constraint to make sure that adversarial contract code cannot cause a consensus fault. E.g. we require that all validators run a JS engine whose Given that existing constraint, we think we can rely upon GC to behave the same across all validators, especially if we slightly weaken our requirements. The idea @dtribble and @erights and I talked through today is:
By making an explicit We might require that finalizers don't run until the explicit |
We landed the changes for #3106 and #2615, and we're mostly not trying to pursue a non-deterministic form. I'm going to leave this open until I can file a ticket with my notes about the remaining non-deterministic handling that we need to do, which is when we take a transcript for a vat that was run under xsnap, and replay it (for debugging) in a local worker, and it's GC behaves differently. |
Do you expect the gc differences to be observable to unprivileged user code? Is there any way in which gc on is observably different from no gc, to unprivileged user code? |
(this tries to capture @dtribble 's "radical GC idea" from our 14-oct-2020 kernel meeting)
Task list:
WeakRef
/FinalizationRegistry
to liveslots, or no-op stubs when platform does not have themdecref
function (only logs), provide vat-specificvatDecRef
to liveslotsslotToVal
) to retain weakrefs to Presences. Finalizers callvatDecref
with count (excluded from replay fidelity checks). Keep strong refs to non-dropped objects. Tests: run withnode --expose-gc
, have the test program usegc()
to provoke a drop, examinevatDecref
calls.c.dump()
. Testing: examinec.dump()
countsm.unserialize
vatDecref
decrements. testing: examinec.dump()
countsc.dump()
clist entries.c.dump()
dispatch.dropExport
to liveslots, deleteslotToVal
andexports
entries. Testing: not suredispatch.dropExport
when deleting the exporting c-list entryWhat is the Problem Being Solved?
@FUDCo estimated about 23 new kernel objects being created for each cycle of a basic "exchange" benchmark. Each of these objects consumes RAM in the liveslots object tables, as well as disk in the kernel clists.
Many of these objects will be Purses or Payments or other "core objects", which will be moved to secondary storage (disk) via #455 and some kind of
makeExternalStore
-style constructor. However there are likely to be several which aren't so core (e.g. a Notifier), and aren't so likely to migrate: it would be a significant ergonomics burden if we asked programmers to make sure every single object they created and exported were put into an external store.The remaining objects will cause memory pressure, perhaps less than 23 per cycle, but still a significant problem. We need a GC solution. Vats which import an object and then cease to reference it must signal the kernel (somehow) that the entry is no longer needed. The importing vat's c-list entry should be removed. When all clients of a kernel object (c-lists and resolution data in the kernel promise table) are gone, the exporting vat's c-list entry should be removed, and the exporting vat should be signalled that the exported object can be dropped. The exporting vat should remove the entry from its liveslots tables, allowing the original object to be dropped.
The big problem, which has blocked us from making much progress on this front, is determinism. JavaScript now has
WeakRef
s, which let us attach a finalizer to fire when an object is dropped, but it does not many any guarantees about when this finalizer runs. It could be immediately after the last strong reference is dropped, or a minute later, or a year later. In a chain-based swingset machine (i.e. cosmic-swingset), if the GC behavior is observable, consensus among multiple validators (with perhaps different memory pressures) could not be reached.Previous Plan (which probably wouldn't work)
The setup: suppose vat A is a client of some object in vat B. So vat A has a Presence that wraps some imported object ID (
o-4
), vat A's c-list mapso-4
to a kernel objectko5
, vat B's c-list mapsko5
to an exportedo+6
, and vat B's liveslots tables mapo+6
to e.g. an exportedPurse
object. Vat A's liveslots tables manage the import by havingslotToVal
(a strong Map) mapo-4
to the Presence, andvalToSlot
(a WeakMap) map the Presence back too-4
, and the initial problem is 1: we aren't using a WeakRef anywhere, and 2: the strength ofslotToVal
would prevent the Presence from being collected even once the vat drops it. On the exporting side, Vat B's liveslots table hasslotToVal
mappingo+6
to the original callable Purse object, andvatToSlot
mapping the purse back to the stringo+6
.Previously, I was expecting a system in which liveslots uses
WeakRef
to detect Presences becoming unreferenced: we changeslotToVal
to use values which arenew WeakRef(Presence)
, and add a finalizer/notifier (FinalizationRegistry
) to sense when userspace has dropped it. At that point, liveslots removes theslotToVal
entry (knowing thevalToSlot
entry is already gone), at which point it would send asyscall.drop('o-4')
to the kernel. The kernel reacts to that by removing the vat A c-list entry foro-4
, which then decrements theko5
reference count. If/when all other importing vats (and any resolved promise data) cease referencingko5
, such that the only remaining reference is from the exporting vat (i.e. the kernel object table records vat B as the owner ofko5
, and the vat B c-list is the only remaining reference), then the kernel deletesko5
from the object table and delivers adispatch.drop(
o+6)
to vat B. When this is delivered, the kernel deletes the vat B c-list entry, delivers the DROP, and then vat and kernel agree to "never speak of this ID again", just like they do for retired promise IDs.If it weren't for the nondeterminism of the
FinalizationRegistry
, this would probably work. But the timing of finalization calls is far to uncertain for us to rely upon in a consensus machine. Even if we require all validators to run the same version of the same engine, the behavior will depend upon memory pressure and perhaps the past history of the engine (heap fragmentation, etc). We aren't even sure we could use this in a non-consensus machine, because simply restarting the solo node (replaying the transcript) might result in different finalizer calls on the second time through, breaking consistency with the previous transcript.What's New
@dtribble 's question was: what if GC wasn't visible to consensus? By relaxing the definition of consensus, we would allow vat execution to vary somewhat between validators: the messages they send must still be identical, but their GC decisions don't.
In this scheme, we'd still have liveslots use a finalizer to sense the client dropping their last reference. Vat A would still notify the kernel about the DROP, but it wouldn't do it with a syscall, or at least the syscall it uses would not be part of the consensus state. Our transcripts record a list of deliveries, each with a list of syscalls made during the processing of that delivery. Vat replay means re-submitting each delivery, and comparing the syscalls made by the new vat against the ones recorded by the old one. In this case, we'd exclude
syscall.drop
from this comparison, allowing divergence between the previous run and the new one.To be precise, we should distinguish between several different places where we compare the behavior of one execution trace against another, for different reasons:
Orthogonal Persistence
Orthogonal persistence is accomplished by starting from some initial state (possibly a #511 heap snapshot) and then replaying a transcript of previous deliveries. Each delivery includes a set of syscalls that were made in the first pass, along with the results of syscalls which include results (device reads, secondary-storage reads). We don't intend to give vats the ability to behave differently on the second pass, but to detect bugs in the surrounding code earlier, we want to detect replay divergence as soon as possible, so we currently compare the syscalls and their arguments against the transcript, and raise a vat-fatal "anachrophobia error".
For the narrow purpose of replay, the transcript really only needs to retain the deliveries, and the return values of the syscalls. When we add early divergence detection to the list of requirements, it would suffice to include a hash of the syscall arguments. In both cases, we can truncate the transcript each time we take a heap snapshot. (Note that we may record the full transcript, with all data, for debugging purposes, in particular to rebuild a vat locally under a debugger, and/or with additional logging enabled. This record is excluded from consensus, is generally write-only, and is independent of the one used for orthogonal persistence.)
Validator Consensus
In a high-credibility machine (i.e. one embedded in a blockchain), the behavior of the composite machine should not depend upon the behavior of any small subset of the validators. Only decisions made identically by a suitable majority (typically 2/3rds) of the validator power should be recognized by clients as authentic behavior of the "chain" as a whole.
But not all decisions need to fall into this category. Validators do not expose all of the inner workings of the shared application. In Cosmos, only data that is stored into the Cosmos state vector needs to be identical across validator nodes (and only that data can be validated by clients: anything else a client might be told will not have the same credibility as something in the state vector).
Our current rule is that the swingset kernels (and the vats they host) can "think" whatever they want, but anything they "speak" must meet consensus. The swingset-on-cosmos kernel writes outbound messages into the cosmos state vector for delivery to other systems (either through IBC or our initial custom protocol). Different validators could conceivably have vats which behave in different ways, as long as the comms-layer externally-targetted messages they create are identical. In practice, however, if the vat internals diverge, the messages are likely to diverage as well.
Fast Validator Catch-up
(This might be known as "fast sync" or "smart sync" in some environments.)
When a new validator joins the chain, it needs to acquire a set of vats which have the same internal state as the existing validators, so they can evolve in the same way, so the validator can produce blocks that the other validators will accept.
The slow-but-steady way to achieve this is to replay the entire chain from the beginning. The new node must find someone willing to feed it with a list of all blocks, starting with the genesis, and replay the transactions in each one. This causes the application state to evolve, one transaction at a time, producing new blocks. At each step, the node checks the consensus signatures and confirms that a threshold of validators has approved the block, and that the locally-generated block matches the one approved by the existing validators. A subset of these blocks will contain messages aimed at the swingset module, and processing these messages will cause the swingset kernel+vat state to evolve, following the same path (and generating the same outbound messages) as the swingset module in those earlier validators.
If the chain was mostly CPU bound the first time through, the new validator will probably be equally CPU bound. They have the advantage of not needing to wait for new blocks to be produced, but their only chance to catch up will be through the idle gaps in the earlier block history. As a result, new validators may take an extremely long time to reach the current state and become prepared to produce new blocks of their own.
The faster way to get caught up is to receive a pre-evolved state vector from some existing validator. If the data included the entire application state, and If the new node could rely upon the correctness of that data, it could simply drop the state vector into its empty and then pretend that it had just rebooted after being fully caught up. It would not have to replay anything, and a new validator could be brought online a few moments after receiving the state vector.
Those caveats point to two conditions that must be addressed. The first is completeness. Swingset does not keep all its state in the cosmos state vector, in fact much of it is not kept on disk at all. The majority of the swingset state is in the RAM heap of the JavaScript engine, dispersed among all the vats. The rest is in the LMDB-based "KernelDB", on disk, which holds the c-lists and vat snapshots/transcripts. The RAM state can be regenerated, however, from the vat transcripts. So, given extra startup time, we can pretend that all swingset state is kept on disk. When catching up, we can take a faster path by starting from a vat snapshot, or by lazily loading of vats (prioritizing the active ones, and deferring the idle ones). We can also parallelize vat reloads in multiple CPU threads, because each vat runs independently.
The second condition is the new validator's ability to rely upon the state vector it has received. This vector must be signed by an appropriate majority of the validators, otherwise a new validator could be tricked by a minority subset, violating the credibility properties. It is not sufficient for the other validators to merely examine and approve the output messages of the new validator: this demonstrates that the current behavior is correct, but does not protect against future divergence. In fact, if the malicious state provider successfully convinces the new validator to accept a state bundle that includes an #1691 -style "sleeper agent" (which doesn't activate until much later), and the new validator is, in turn, relied upon by other new validators to get that state bundle, a patient attacker could manage to eventually convince all validators to believe a state vector that does not match the original, and obtain arbitrary control over the entire chain.
For this reason, whatever fast-sync state vector a new validator might use must be subject to the same consensus rules as the blocks used to derive that state. In practice, this means all validators must periodically produce the vector themselves, and include its hash in their blocks. For these to be compared, all validators must produce the same vector (anything that goes into this vector is "consensus-critical").
To use this for swingset, the vat snapshot, transcripts, kernel c-lists, and kernel object/promise tables must all be identical among all validators.
Description of the Design
So, the proposal from today's meeting is:
slotToVal
table for imported objects will map object IDs to WeakRefs (of Presences) rather than mapping to the Presence directlyFinalizationRegistry
is used to sense when the target Presence has been GCedslotToVal
mappingsyscall.drop(objectID)
call, to inform the kernel that the entry is no longer neededsyscall.drop
by removing the entry from the vat's c-list, and (immediately? eventually?) performing a GC sweep through the c-listsdispatch.drop(objectID)
to the target vat (bypassing the run-queue, so as to not give the target vat an opportunity to emit a new reference to the about-to-be-deleted export, which would need to cancel the drop)drop(vatObjectID)
dispatch.drop
by deleting theslotToVal
table entry, which will drop liveslots' reference to the original exported objectvalToSlot
WeakMap may still include a (weak) reference, if other code within the vat retains itdrop
and export events, this shouldn't cause a problem, but it might be better to somehow invalidate the old object ID and allocate a new one (which would probably involve changingvalToSlot
to point at some cell instead, retaining a strongslotToCell
mapping, and storing anisValid
flag in the cell next to the objectID).Consensus Consequences
For orthogonal persistence purposes, we must not include the
drop
syscalls in the replay comparison check: a replay delivery is considered successful if it makes exactly the same non-drop
syscalls that the transcript recorded. This means filtering out anydrop
syscall before checking the transcript.Should transcripts include the
dispatch.drop
deliveries, such that replayed vats are told to drop exports in the same way their predecessor was told? Certainly they need to be told eventually, otherwise they'll use more RAM than their predecessor did. But we can tell them earlier or later without problems. One suggestion was to have replay watch the syscalls and build up a list of exported object IDs, then compare it against the current c-list, and immediately submit a bunch ofdrop
deliveries for the missing ones (at the end of replay, before any new messages are delivered). When we move to snapshot+transcript, we might track "dropped exports" in a special set, so that the current (minimized) vat state can always be reconstructed by 1: loading the heap snapshot, 2: replaying the truncated (post-snapshot) transcript, and then 3: deliveringdrop
for everything in the set. We can truncate the drop set when we make a new snapshot.For validator consensus purposes, where we only put externally-targetted messages into the Cosmos state vector, it doesn't matter than some validators have a
drop
where others do not. As long as we aren't putting hashes of kernel (c-list) state or vat heap snapshots ordrop
-bearing transcripts into that state vector, validator behavior can safely diverge along the narrow drop/not-drop axis.For fast-sync purposes, we have a problem. Heap snapshots where GC has happened (especially where
dispatch.drop
has been delivered) will be completely different than in vats where it has not. Likewise c-lists will be missing entries wheredrop
took place. These states will differ between validators. As a result, the data we use for fast-sync cannot use heap snapshots ifdrop
is in use. We could work around this by only fast-syncing full from-the-beginning transcripts (which we would have to retain for this purpose, since normally we would truncate the transcript each time we took a heap snapshot).The varying c-lists, however, is harder to deal with. The full swingset state includes the kernel object/promise tables, the c-lists for each vat, and the snapshot+transcript for each vat. Validators can't agree on the kernel state if the c-lists will depend upon whether a given vat did GC and emitted a
drop
or not.@dtribble pointed out one workaround would be to abandon vat-at-a-time replay and instead/additionally record a totally-ordered full-kernel transcript (basically a list of every item added to the run-queue). Replaying this list would regenerate the c-lists as well as the internal vat state, which would remove the need to publish (and agree upon) the c-list contents. The cost would be:
If we went that direction, we could have a system in which "fast-sync" uses one set of data and "reboot catch-up" uses a different set. As @erights pointed out, nodes can safely rely upon their earlier history more so than upon data provided by outsiders. So the slow processing of "fast sync" might be a necessary compromise to allow GC to proceed.
Additional Considerations
Vats should not get access to the
WeakRef
orFinalizationRegistry
constructors, so they should not be able to sense objects being GCed. No user-level vat code should run in reaction to adispatch.drop
: only liveslots will have control.How should this interact with secondary-storage "virtual objects" (#455)? Ideally, when the exporting vat receives the
drop
, it should delete the relevant entries in secondary storage (which might cascade into deleting other virtual objects, somehow). However, even if the secondary-storage syscalls are not consensus-critical, the DB entries which back them are just as consensus-relevant as the vat's c-lists and transcript. This suggests fast-sync cannot include secondary storage either, since it may be missing contents that were released by GC, pushing further into the "replay a full kernel transcript" approach.Comms vat: when GC is distributed across swingset machines, the
drop
message becomes externally visible (and thus subject to much stricter consensus rules). This is the point where we need deterministic behavior. I fear this is "purple box" -level problem (thesis-level complexity).Security Considerations
We have to be very careful to not allow user-level code observe any non-determinism exposed by the GC mechanism, through a combination of:
WeakRef
orFinalizationRegistry
in their globals (SES should filter these out from non- start Compartments because of their nondeterministic nature, just likeMath.random
andDate.now
), so they cannot sense GC directlydrop
messages and acts upon them, but must not allow that to change the behavior that vat code can observeThe consequences of a leak would be that vat code (probably deliberately written to exploit the leak) could cause consensus failure: some validators would be slashed for behaving differently than others. If the divergence is severe enough, the chain could halt entirely if too few validators were able to agree upon new blocks.
The fast-sync state must be validated correctly. If an attacker who supplies alleged state data can get a new validator to accept malicious state, they can control the output of that validator, and eventually (with enough patience and time for the bogus data to spread) that of the entire chain.
The text was updated successfully, but these errors were encountered: