-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Liveslots should not meter GC-sensitive code paths #3458
Comments
In today's kernel meeting, @dtribble pointed out that our current approach (forcing GC at the end of every crank, but not having snapshot-insensitive GC timing) is insufficient. A Representative that is reachable but untouched during the early parts of a crank, then dropped (e.g. an offer that is dropped in favor of a better one), then resurrected by deserialization later in the crank, might or might not have been collected between the drop and the resurrection. We identified part of the ECMAScript spec (maybe related to tc39/proposal-weakrefs#39 (comment)) that mandates the answer of a We decided that it was safe to put virtual object deserialization (and thus kind-constructor invocation) inside the "unmetering box" because we're not yet allowing adversarial contract code, so we're ok with the risk of an infinite loop in the kind constructor causing the vat to cease progress (or stall the entire kernel). The AST-based parser idea in #3462 (comment) could also protect against this: by denying any unusual code from running during the kind constructor, we'd also effectively bound its meter usage to something vaguely |
@erights also floated the idea of effectively disabling organic GC by changing XS to make all WeakRefs behave like strong references except for forced
This is probably equivalent to This would allow organic GC of normal objects within the vat, but not of any vref-based objects (Remotable, Presence, Representative). The cost would be proportional to the number of live vref-based objects in a vat, times the frequency with which we force GC (and thus have to re-populate the |
When a swingset kernel is part of a consensus machine, the visible state must be a deterministic function of userspace activity. Every member kernel must perform the same set of operations. However we are not yet confident that the timing of garbage collection will remain identical between kernels that experience different patterns of snapshot+restart. In particular, up until recently, the amount of "headroom" in the XS memory allocator was reset upon snapshot reload: the new XS engine only allocates as much RAM as the snapshot needs, whereas before the snapshot was taken, the RAM footprint could have been larger (e.g. if a large number of objects we allocated and then released), leading to more "headroom". Automatic GC is triggered by an attempt to allocate space which cannot be satisfied by this headroom, so it will happen more frequently in the post-reload engine than before the snapshot. To accommodate differences in GC timing between kernels that are otherwise operating in consensus, this commit introduces the "unmetered box": a span of execution that does not count against the meter. We take all of the liveslots operations that might be sensitive to the engine's GC behavior and put them "in" the box. This includes any code that calls `deref()` on the WeakRefs used in `slotToVal`, because the WeakRef can change state from "live" to "dead" when GC is provoked, which is sensitive to more than just userspace behavior. closes #3458
When a swingset kernel is part of a consensus machine, the visible state must be a deterministic function of userspace activity. Every member kernel must perform the same set of operations. However we are not yet confident that the timing of garbage collection will remain identical between kernels that experience different patterns of snapshot+restart. In particular, up until recently, the amount of "headroom" in the XS memory allocator was reset upon snapshot reload: the new XS engine only allocates as much RAM as the snapshot needs, whereas before the snapshot was taken, the RAM footprint could have been larger (e.g. if a large number of objects we allocated and then released), leading to more "headroom". Automatic GC is triggered by an attempt to allocate space which cannot be satisfied by this headroom, so it will happen more frequently in the post-reload engine than before the snapshot. To accommodate differences in GC timing between kernels that are otherwise operating in consensus, this commit introduces the "unmetered box": a span of execution that does not count against the meter. We take all of the liveslots operations that might be sensitive to the engine's GC behavior and put them "in" the box. This includes any code that calls `deref()` on the WeakRefs used in `slotToVal`, because the WeakRef can change state from "live" to "dead" when GC is provoked, which is sensitive to more than just userspace behavior. closes #3458
When a swingset kernel is part of a consensus machine, the visible state must be a deterministic function of userspace activity. Every member kernel must perform the same set of operations. However we are not yet confident that the timing of garbage collection will remain identical between kernels that experience different patterns of snapshot+restart. In particular, up until recently, the amount of "headroom" in the XS memory allocator was reset upon snapshot reload: the new XS engine only allocates as much RAM as the snapshot needs, whereas before the snapshot was taken, the RAM footprint could have been larger (e.g. if a large number of objects we allocated and then released), leading to more "headroom". Automatic GC is triggered by an attempt to allocate space which cannot be satisfied by this headroom, so it will happen more frequently in the post-reload engine than before the snapshot. To accommodate differences in GC timing between kernels that are otherwise operating in consensus, this commit introduces the "unmetered box": a span of execution that does not count against the meter. We take all of the liveslots operations that might be sensitive to the engine's GC behavior and put them "in" the box. This includes any code that calls `deref()` on the WeakRefs used in `slotToVal`, because the WeakRef can change state from "live" to "dead" when GC is provoked, which is sensitive to more than just userspace behavior. closes #3458
The `meterControl` tool will give liveslots a way to temporarily disable metering, so it can perform GC-sensitive operations without affecting the vat's metering results. Each supervisor provides their own version: all are dummy facades except for XS. refs #3458
We're trying to hedge against XS not necessarily performing "organic" (non-forced) GC at exactly the same across all members (validators) of a consensus machine, especially when some members have reloaded a vat from snapshot (i.e. restarted) recently and others have not. We rely upon finalizer callbacks only running at explicitly-deterministic times, but we must still guard against objects becoming collected spontaneously, which will cause a WeakRef to become "dead" (`wr.deref() === undefined`) at a random point in the middle of a turn. Any code which calls `wr.deref`, or is conditionally executed/skipped according to the results, is "GC-sensitive". This includes `convertSlotToVal`, and therefore `m.unserialize`. We cannot allow metering results to diverge between validators, because: * 1: it might make the difference between the crank completing successfully, and the vat being terminated for a per-crank metering fault * 2: it will change the large-scale Meter value, which is reported to userspace * 3: it might cause the runPolicy to finish the block earlier on one validator than on others all of which would cause a consensus failure. To prevent this, we run most of the "inbound" side of liveslots without metering. This includes the first turn of all `dispatch.*` methods, which runs entirely within liveslots: * `dispatch.deliver` performs argument deserialization in the first turn, then executes user code in the second and subsequent turns * `dispatch.notify` does the same * the GC deliveries (`dispatch.dropExport`, etc) only use one turn We also disable metering when deserializing the return value from a (synchronous) device call, and when retiring a promise ID (which touches `slotToVal`). Finally, we disable metering for all turns of the post-crank GC `finish()` call. This excludes all invocations of the finalizer callbacks, as well as all the `processDeadSet` code which is highly sensitive to the results. closes #3458
The `meterControl` tool will give liveslots a way to temporarily disable metering, so it can perform GC-sensitive operations without affecting the vat's metering results. Each supervisor provides their own version: all are dummy facades except for XS. refs #3458
We're trying to hedge against XS not necessarily performing "organic" (non-forced) GC at exactly the same across all members (validators) of a consensus machine, especially when some members have reloaded a vat from snapshot (i.e. restarted) recently and others have not. We rely upon finalizer callbacks only running at explicitly-deterministic times, but we must still guard against objects becoming collected spontaneously, which will cause a WeakRef to become "dead" (`wr.deref() === undefined`) at a random point in the middle of a turn. Any code which calls `wr.deref`, or is conditionally executed/skipped according to the results, is "GC-sensitive". This includes `convertSlotToVal`, and therefore `m.unserialize`. We cannot allow metering results to diverge between validators, because: * 1: it might make the difference between the crank completing successfully, and the vat being terminated for a per-crank metering fault * 2: it will change the large-scale Meter value, which is reported to userspace * 3: it might cause the runPolicy to finish the block earlier on one validator than on others all of which would cause a consensus failure. To prevent this, we run most of the "inbound" side of liveslots without metering. This includes the first turn of all `dispatch.*` methods, which runs entirely within liveslots: * `dispatch.deliver` performs argument deserialization in the first turn, then executes user code in the second and subsequent turns * `dispatch.notify` does the same * the GC deliveries (`dispatch.dropExport`, etc) only use one turn We also disable metering when deserializing the return value from a (synchronous) device call, and when retiring a promise ID (which touches `slotToVal`). Finally, we disable metering for all turns of the post-crank GC `finish()` call. This excludes all invocations of the finalizer callbacks, as well as all the `processDeadSet` code which is highly sensitive to the results. closes #3458
The `meterControl` tool will give liveslots a way to temporarily disable metering, so it can perform GC-sensitive operations without affecting the vat's metering results. Each supervisor provides their own version: all are dummy facades except for XS. refs #3458
We're trying to hedge against XS not necessarily performing "organic" (non-forced) GC at exactly the same across all members (validators) of a consensus machine, especially when some members have reloaded a vat from snapshot (i.e. restarted) recently and others have not. We rely upon finalizer callbacks only running at explicitly-deterministic times, but we must still guard against objects becoming collected spontaneously, which will cause a WeakRef to become "dead" (`wr.deref() === undefined`) at a random point in the middle of a turn. Any code which calls `wr.deref`, or is conditionally executed/skipped according to the results, is "GC-sensitive". This includes `convertSlotToVal`, and therefore `m.unserialize`. We cannot allow metering results to diverge between validators, because: * 1: it might make the difference between the crank completing successfully, and the vat being terminated for a per-crank metering fault * 2: it will change the large-scale Meter value, which is reported to userspace * 3: it might cause the runPolicy to finish the block earlier on one validator than on others all of which would cause a consensus failure. To prevent this, we run most of the "inbound" side of liveslots without metering. This includes the first turn of all `dispatch.*` methods, which runs entirely within liveslots: * `dispatch.deliver` performs argument deserialization in the first turn, then executes user code in the second and subsequent turns * `dispatch.notify` does the same * the GC deliveries (`dispatch.dropExport`, etc) only use one turn We also disable metering when deserializing the return value from a (synchronous) device call, and when retiring a promise ID (which touches `slotToVal`). Finally, we disable metering for all turns of the post-crank GC `finish()` call. This excludes all invocations of the finalizer callbacks, as well as all the `processDeadSet` code which is highly sensitive to the results. closes #3458
The `meterControl` tool will give liveslots a way to temporarily disable metering, so it can perform GC-sensitive operations without affecting the vat's metering results. Each supervisor provides their own version: all are dummy facades except for XS. refs #3458
We're trying to hedge against XS not necessarily performing "organic" (non-forced) GC at exactly the same across all members (validators) of a consensus machine, especially when some members have reloaded a vat from snapshot (i.e. restarted) recently and others have not. We rely upon finalizer callbacks only running at explicitly-deterministic times, but we must still guard against objects becoming collected spontaneously, which will cause a WeakRef to become "dead" (`wr.deref() === undefined`) at a random point in the middle of a turn. Any code which calls `wr.deref`, or is conditionally executed/skipped according to the results, is "GC-sensitive". This includes `convertSlotToVal`, and therefore `m.unserialize`. We cannot allow metering results to diverge between validators, because: * 1: it might make the difference between the crank completing successfully, and the vat being terminated for a per-crank metering fault * 2: it will change the large-scale Meter value, which is reported to userspace * 3: it might cause the runPolicy to finish the block earlier on one validator than on others all of which would cause a consensus failure. To prevent this, we run most of the "inbound" side of liveslots without metering. This includes the first turn of all `dispatch.*` methods, which runs entirely within liveslots: * `dispatch.deliver` performs argument deserialization in the first turn, then executes user code in the second and subsequent turns * `dispatch.notify` does the same * the GC deliveries (`dispatch.dropExport`, etc) only use one turn We also disable metering when deserializing the return value from a (synchronous) device call, and when retiring a promise ID (which touches `slotToVal`). Finally, we disable metering for all turns of the post-crank GC `finish()` call. This excludes all invocations of the finalizer callbacks, as well as all the `processDeadSet` code which is highly sensitive to the results. closes #3458
The `meterControl` tool will give liveslots a way to temporarily disable metering, so it can perform GC-sensitive operations without affecting the vat's metering results. Each supervisor provides their own version: all are dummy facades except for XS. refs #3458
We're trying to hedge against XS not necessarily performing "organic" (non-forced) GC at exactly the same across all members (validators) of a consensus machine, especially when some members have reloaded a vat from snapshot (i.e. restarted) recently and others have not. We rely upon finalizer callbacks only running at explicitly-deterministic times, but we must still guard against objects becoming collected spontaneously, which will cause a WeakRef to become "dead" (`wr.deref() === undefined`) at a random point in the middle of a turn. Any code which calls `wr.deref`, or is conditionally executed/skipped according to the results, is "GC-sensitive". This includes `convertSlotToVal`, and therefore `m.unserialize`. We cannot allow metering results to diverge between validators, because: * 1: it might make the difference between the crank completing successfully, and the vat being terminated for a per-crank metering fault * 2: it will change the large-scale Meter value, which is reported to userspace * 3: it might cause the runPolicy to finish the block earlier on one validator than on others all of which would cause a consensus failure. To prevent this, we run most of the "inbound" side of liveslots without metering. This includes the first turn of all `dispatch.*` methods, which runs entirely within liveslots: * `dispatch.deliver` performs argument deserialization in the first turn, then executes user code in the second and subsequent turns * `dispatch.notify` does the same * the GC deliveries (`dispatch.dropExport`, etc) only use one turn We also disable metering when deserializing the return value from a (synchronous) device call, and when retiring a promise ID (which touches `slotToVal`). Finally, we disable metering for all turns of the post-crank GC `finish()` call. This excludes all invocations of the finalizer callbacks, as well as all the `processDeadSet` code which is highly sensitive to the results. closes #3458
What is the Problem Being Solved?
As described in #3457, we want metering to be stable and consistent across validators, while allowing "organic" GC timing to vary. To achieve this, we need liveslots to put all its code that might interact with GC (
processDeadSet
, thegcAndFinalize
call that allows FinalizationRegistry callbacks to run) inside an "unmetered box", by calling some disable-metering function upon entry, and a re-enable-metering function upon exit.For this to work, we need the engine to never invoke finalizer callbacks on its own. I think XS (and
xsnap.c
in particular) gives us this property, but we need to check.We aren't doing metering on Node.js, so it doesn't matter, but for completeness I'll mention that finalizers only appear to run when the IO queue is allowed to run, so I think they wouldn't run until we do a
setImmediate
. Unfortunately, we must initiate asetImmediate
as the crank begins (to sense when userspace has given up agency), so it's possibly that finalizers will get to run before we have a chance to enter the box.I'm adding this to the Testnet Metering Phase milestone, although I suspect we'd be ok leaving it out, and instead making GC timing deterministic, at least for now.
Description of the Design
We'll use the callbacks provided by #3457 wrap the following liveslots code in an enter/exit pair:
m.deserialize()
(or maybe justconvertSlotToVal
where it queriesslotToVal
and the WeakRef therein) during:dispatch.deliver
argument deserializationdispatch.notify
resolution value deserializationfinish()
, which enclosesgcAndFinalize
(where finalizers might run) andprocessDeadSet
(where the dead set is iterated)Security Considerations
Test Plan
The text was updated successfully, but these errors were encountered: