-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
taking XS snapshots causes kernel-observable GC differences, terminates vats #3428
Comments
I think I have a suspect: async function startXSnap(
name,
handleCommand,
metered,
snapshotHash = undefined,
) {
if (snapStore && snapshotHash) {
// console.log('startXSnap from', { snapshotHash });
return snapStore.load(snapshotHash, async snapshot => {
const xs = doXSnap({ snapshot, name, handleCommand, ...xsnapOpts });
await xs.evaluate('null'); // ensure that spawn is done
return xs;
});
} That extra |
@phoddie , FYI: XS seems to be working as-designed here. We initially suspected an XS bug but then remembered that GC as part of taking a snapshot was as-designed. |
I've restarted my testnet three times now, with that line removed, and it seems to have addressed the problem. Phew! So we have two separate issues here:
I'm going to define this ticket as dealing with the Next steps for this ticket are to create a test that creates a vat, runs it long enough to trigger a snapshot, runs it longer still to trigger GC, then stops the kernel and reloads. I want to see the problem happen with the A secondary task is to figure out how to upgrade an existing chain to include this fix without a full reset. The kernel bundle is stored in the DB, so simply changing the software release to one without the |
As @warner noted, we could wrap the returned value's issueCommand and watch for it to finish It's ugly, but it seems to work: |
* during transcript replay, record (simulated) syscalls and delivery results in the slog * previously we only recorded deliveries * this captures GC syscalls, to investigate divergence bugs * it also records metering results for the delivery, for the same reason * during normal delivery, record syscall data even if the syscall fails * vat-fatal syscall errors, like dropping an object that was already retired, now capture the vat's attempted syscall, even if the translation into kernelspace failed, or the kernelSyscallHandler invocation failed * previously we lost all record of these syscall attempts, making it hard to debug the reason for vat termination * slog entries for replayed deliveries/syscalls look just like the normal ones, but now have a `replay` boolean to distinguish the two * replayed deliveries do not have a `crankNum`, since we deliberately do not record that in the transcript, so we don't know the value during replay * we compute a `deliveryNum` for each replayed message from the transcript position offset, which ought to match the original. This might need some refactoring to make fewer (or more) assumptions about the shape of a StreamPosition. refs #3428 (not as a fix, but as a critical debugging tool)
Idea: Snapshot after first delivery in each block, using crankNum
I thought of this in the context of #3430, but this idea connects snapshot time more closely to consensus, rather than less. |
yesterday @mhofman wrote
I just verified that results from |
@mhofman and I did some extensive analysis and brain-storming, and we think we understand the situation now. The executive summary: Regular garbage collection in XS is complete/accurate, but non-compacting, however a snapshot store/reload cycle is effectively compacting. The new process has less extra space for objects, so "small churn" allocate/release activity will need GC more frequently in a reloaded vat (less headroom) than the original vat (whose headroom goes up to the previous high-water mark). Therefore non-forced GC timing is not only a function of user-level activity since the last GC event, but also of the history of the available The details: XS manages memory for most fixed-size objects ( Releasing an object does not immediately add the Therefore, for given rate of "low amplitude" churn ( Saving a snapshot performs GC() just before serialization. More importantly, loading a snapshot only allocates as many So the post-reload program will perform GC more frequently than the pre-snapshot program did, even though the non-GC observable program state is identical. This dashes our hopes of treating "organic" (non-forced) GC calls as being a deterministic function of program activity: it is also a function of snapshot save (easy enough to handle) and reload (not so easy). So, we must treat GC as (once again) spontaneous and non-deterministic, and write our finalizer-handling code to be defensive against this. We cannot allow GC syscalls to appear just anywhere. The basic defensive approach is to define a special Our current approach, in which liveslots does a GC and emits GC syscalls at the end of every delivery, is equivalent to calling So there is a spectrum of
All of these would result in all validators running I think this explains why our #3433 change didn't help: it removed one original-vs-reloaded behavior difference (the extra |
* during transcript replay, record (simulated) syscalls and delivery results in the slog * previously we only recorded deliveries * this captures GC syscalls, to investigate divergence bugs * it also records metering results for the delivery, for the same reason * during normal delivery, record syscall data even if the syscall fails * vat-fatal syscall errors, like dropping an object that was already retired, now capture the vat's attempted syscall, even if the translation into kernelspace failed, or the kernelSyscallHandler invocation failed * previously we lost all record of these syscall attempts, making it hard to debug the reason for vat termination * slog entries for replayed deliveries/syscalls look just like the normal ones, but now have a `replay` boolean to distinguish the two * replayed deliveries do not have a `crankNum`, since we deliberately do not record that in the transcript, so we don't know the value during replay * we compute a `deliveryNum` for each replayed message from the transcript position offset, which ought to match the original. This might need some refactoring to make fewer (or more) assumptions about the shape of a StreamPosition. refs #3428 (not as a fix, but as a critical debugging tool)
* during transcript replay, record (simulated) syscalls and delivery results in the slog * previously we only recorded deliveries * this captures GC syscalls, to investigate divergence bugs * it also records metering results for the delivery, for the same reason * during normal delivery, record syscall data even if the syscall fails * vat-fatal syscall errors, like dropping an object that was already retired, now capture the vat's attempted syscall, even if the translation into kernelspace failed, or the kernelSyscallHandler invocation failed * previously we lost all record of these syscall attempts, making it hard to debug the reason for vat termination * slog entries for replayed deliveries/syscalls look just like the normal ones, but now have a `replay` boolean to distinguish the two * replayed deliveries do not have a `crankNum`, since we deliberately do not record that in the transcript, so we don't know the value during replay * we compute a `deliveryNum` for each replayed message from the transcript position offset, which ought to match the original. This might need some refactoring to make fewer (or more) assumptions about the shape of a StreamPosition. refs #3428 (not as a fix, but as a critical debugging tool)
* during transcript replay, record (simulated) syscalls and delivery results in the slog * previously we only recorded deliveries * this captures GC syscalls, to investigate divergence bugs * it also records metering results for the delivery, for the same reason * during normal delivery, record syscall data even if the syscall fails * vat-fatal syscall errors, like dropping an object that was already retired, now capture the vat's attempted syscall, even if the translation into kernelspace failed, or the kernelSyscallHandler invocation failed * previously we lost all record of these syscall attempts, making it hard to debug the reason for vat termination * slog entries for replayed deliveries/syscalls look just like the normal ones, but now have a `replay` boolean to distinguish the two * replayed deliveries do not have a `crankNum`, since we deliberately do not record that in the transcript, so we don't know the value during replay * we compute a `deliveryNum` for each replayed message from the transcript position offset, which ought to match the original. This might need some refactoring to make fewer (or more) assumptions about the shape of a StreamPosition. refs #3428 (not as a fix, but as a critical debugging tool)
@warner writes in #3428 (comment) Jun 29:
|
* during transcript replay, record (simulated) syscalls and delivery results in the slog * previously we only recorded deliveries * this captures GC syscalls, to investigate divergence bugs * it also records metering results for the delivery, for the same reason * during normal delivery, record syscall data even if the syscall fails * vat-fatal syscall errors, like dropping an object that was already retired, now capture the vat's attempted syscall, even if the translation into kernelspace failed, or the kernelSyscallHandler invocation failed * previously we lost all record of these syscall attempts, making it hard to debug the reason for vat termination * slog entries for replayed deliveries/syscalls look just like the normal ones, but now have a `replay` boolean to distinguish the two * replayed deliveries do not have a `crankNum`, since we deliberately do not record that in the transcript, so we don't know the value during replay * we compute a `deliveryNum` for each replayed message from the transcript position offset, which ought to match the original. This might need some refactoring to make fewer (or more) assumptions about the shape of a StreamPosition. refs #3428 (not as a fix, but as a critical debugging tool)
* during transcript replay, record (simulated) syscalls and delivery results in the slog * previously we only recorded deliveries * this captures GC syscalls, to investigate divergence bugs * it also records metering results for the delivery, for the same reason * during normal delivery, record syscall data even if the syscall fails * vat-fatal syscall errors, like dropping an object that was already retired, now capture the vat's attempted syscall, even if the translation into kernelspace failed, or the kernelSyscallHandler invocation failed * previously we lost all record of these syscall attempts, making it hard to debug the reason for vat termination * slog entries for replayed deliveries/syscalls look just like the normal ones, but now have a `replay` boolean to distinguish the two * replayed deliveries do not have a `crankNum`, since we deliberately do not record that in the transcript, so we don't know the value during replay * we compute a `deliveryNum` for each replayed message from the transcript position offset, which ought to match the original. This might need some refactoring to make fewer (or more) assumptions about the shape of a StreamPosition. refs #3428 (not as a fix, but as a critical debugging tool)
While running integration tests on current trunk (something close to 628ef14), I observed failures due to
vat-bank
being summarily executed by the kernel for an illegal syscall. @dckc and I tracked this down part-way, and we concluded that we must address the side-effects of snapshot-write operations.In this codebase, we're running all vats (except comms) under
xs-worker-no-gc
, which uses XS like usual, but disables the force GC sweep that usually happens towards the end of each delivery. The normalxs-worker
does:gc()
syscall.dropImports
/syscall.retireImports
for the collected objectsThen, after the delivery has finished, the
vatWarehouse
decides whether or not to save a heap snapshot. The current policy makes a snapshot whendeliveryNum % 200 === 2
: after the 2nd delivery (to avoid replaying the expensive largeimportBundle
that contracts do on their very first delivery), and then again every 200 deliveries (to keep the transcript, and consequent replay time, bounded to a reasonable value).The
xs-worker-no-gc
variant does all of that except it omits the forcedgc()
. As a result, the only GC that happens is "organic": triggered by normal vat activity. In my tests, given the kind of load I was applying, I sawvat-bank
performing one GC sweep for each 26 or 27 deliveries (deliveryNum
of 23, 49, 76, 101), and they yielded 2/2/2/5 objects being dropped+retired.In my test, I ran the kernel for 14 minutes (2943 cranks, 168 blocks), during which vat-bank received 124 deliveries. According to the slogfile, organic GC happened during deliveries 1, 2, 23 (drop+retire
o-54
ando-60
), 49 (d+ro-62
ando-63
), 76 (o-65
ando-66
), and 101 (o-68
,o-69
,o-70
,o-72
,o-73
).Then I stopped the process and restarted it. The vat was reloaded from the snapshot taken at the 2nd delivery, then replayed the 122 post-snapshot deliveries. Unfortunately the slogfile does not record the syscalls made during replay, nor any metering numbers like
garbageCollectionCount
. 51 minutes later (at crankNum 4792, vat-bank delivery 128), vat-bank performed GC, and one of the syscalls it made during that delivery was treated as illegal by the kernel:The kernel then terminated
vat-bank
, which causes all sorts of knock-on effects as the other vats connected to it see their result promises rejected. My load generator stopped making progress at that point, although the kernel did not panic. This test was using a one-node chain, so there were no other nodes to disagree with, but if it were a real chain, and the different nodes had different snapshot-timing policies, I think the divergence would have eventually spread to the comms messages, and that would have caused a consensus fault, evicting some of the validators. In addition, since vat-bank spends a lot of time communicating with the cosmos-sdk Bank module, divergence in the cosmos state would probably cause a consensus failure even sooner.In trying to recover the details of the illegal syscall, I've identified a few diagnostic improvements we should make:
slog()
call which records syscall details takes both the vat's originalVatSyscallObject
and the translatedKernelSyscallObject
, but if the translation fails, the syscall is not logged at all. This loses helpful information, and should be changed to record at least theVatSyscallObject
upon a translation error.anachrophobia
error). But since we strip GC-related syscalls before recording the original transcript, and we ignore GC-related syscalls (for comparison purposes) during replay, we aren't recording enough information to find out about any differences.replay-transcript.js
tool, which reconstructs transcripts from a slogfile or kernelDB, does not follow this pattern, and includes the GC syscalls in the generated transcript file. This confuses the do-we-match check.While using
replay-transcript.js
to try and reconstruct the timeline of the vat, I found that GC took place at different times during the (manually) reloaded vat. The organic sweeps happened on deliveries 1 (0 drops), 23 (o-54
ando-60
), 48 (o-62
ando-63
), 75 (o-65
ando-66
), 100 (o-68
o-69
o-70
o-72
), and 128 (o-73
o-75
o-76
o-77
). Note that the specific objects being dropped vary as well:o-73
was dropped 28 deliveries later during the replay.If the second run of the real kernel behaved like my
replay-transcript.js
run, where the recorded c-list tables (as of the last committed state, at delivery 124) treato-73
as already retired, then asyscall.dropImports(o-73)
happening at delivery 128 would not match anything in the c-list, and the kernel would treat it as an illegal syscall. This seems like the most likely cause of the vat-bank termination I observed in the second kernel run.XS snapshot writes are not side-effect free
XS very specifically performs a full GC sweep just before writing out a heap snapshot, probably to simplify and minimize the snapshot data (the reachability flag doesn't need to be recorded). As a result, when
vatWarehouse
decides to record a vat snapshot, it inadvertently provokes a full GC sweep. This "resets the counter" of garbage objects, which, as a side-effect, will forestall any organic GC sweeps that follow.@dckc and I wrote a test that compares three vats, each doing an operation that creates and immediately drops a few hundred objects in a loop, and watches the metering stats to learn when organic GC sweeps happen.
We observed that vats B and C ran GC at the same time, while vat A runs it earlier. This is consistent with vat B seeing a forced GC at the moment of snapshot write (deferring the need for GC for a while), and vat C starting from the post-GCed snapshotted state. And it is consistent with my
replay-transcript.js
run (which does not read or write any snapshots) experiencing GC earlier than the real kernel runs:replay-transcript.js
is behaving like vat A, my first kernel run should behave like vat B, and my second kernel run should have behaved like vat C.It doesn't yet explain the divergence between my first and second real kernel runs. Both should have seen the same forced GC at snapshot creation time, so I don't know why they apparently diverged later.
vatWarehouse policy should be outside of consensus
escalated to #3430
Our intention with the vatWarehouse was that it ought to be able to take snapshots whenever it wants, without being visible to the consensus state. Two different validators should be able to use different policies (one taking snapshots quickly, the other more leisurely), and their vats should nevertheless behave identically. The fact that snapshot writes trigger GC interferes with this goal.
We can overcome this (and we must, since GC-before-snapshot is obviously the right thing to do) by doing more GC. If the consensus state of the chain would have done a GC between deliveries anyways, then the snapshot's GC won't have anything left to collect, and it becomes idempotent with the consensus-driven action.
Before we switched from
xs-worker
toxs-worker-no-gc
, this is almost what we were doing. Inxs-worker
, liveslots does GC after userspace loses agency, but then proceeds to do more JS work (processing finalized vrefs, making syscalls), which builds up more garbage. The code is designed to perform more GC sweeps until they stop yielding freed userspace objects (this loop is not activated, because I don't think we need it yet, but future data structures might). When liveslots gives up control to the kernel and ends the delivery, there may still be garbage, but the quantity will be less than if userspace had been allowed to run after the liveslotsgc()
.Then, if a snapshot is taken, the remaining garbage count drops to zero. The difference between "some" (whatever liveslots leaves around) and "zero" then determines the timing of the next organic GC operation. If enough garbage is left around, and the next delivery is busy enough, an organic sweep might happen before the end of that delivery. But it won't matter, because liveslots won't do anything with the finalizers that run until after it's done its own
gc()
.Basically, by doing a GC at the end of each crank, and not acting upon finalizers any earlier than that, we can tolerate any extra (organic) sweeps. But if we aren't forcing that GC, we're dependent upon the organic sweeps, and that makes us vulnerable to their timing, which is influenced by the GC provoked by snapshot writes, thus making snapshot write timing part of consensus.
The solutions we've thought of:
dispatch.bringOutYourDead
, which would return the list of dropped objects, instead of using syscalls. This approach might be more efficient (liveslots ignores the finalizers anddeadSet
until this special crank), although the comms vat still wants spontaneous GC syscalls.gcActions
they enqueue) immediately, during that cleanup period at the end of the block, because telling one vat that its export has been dropped might cause it to drop more objects. The worst case cascade would be some sort of distributed linked list. We probably need to bound the amount of cleanup work we do, because the whole reason we're at end-of-block is because we've exhausted our budget for doing work. We can make some guesses about how much time we'll need for GC, but in general I think we must be prepared to leave some amount of GC work unfinished, and pick it up again during the next block. (AllgcActions
are kept in a durable queue for exactly this reason).Consequences for the current chain
I think it's possible that we'll experience validators dropping out of consensus as a result of this problem, if what I observed between my first kernel run and my second cannot be explained by something else. I'm looking for an alternate explanation, because my conclusions here are that both runs should have seen the same forced
gc()
during the snapshot write.My next step will be to improve the slogging to capture syscalls and
garbageCollectionCount
metrics during replay. If I can reproduce the vat-termination event with that logging enabled, I should be able to prove or disprove the theory that GC is happening at different times during replay. If different, my next step will be to make a detailed record (using$XSNAP_TEST_RECORD
) of the low-level messages sent to the XS process, and make sure they are identical for both the original and the reloaded vat: I'd be looking for some extra counter or timestamp or default configuration option that's different between the two, which might cause the supervisor to execute some additional code, which doesn't have any semantic effect, but causes allocation, which might modify the organic GC timing. If all the post-snapshot messages appear identical, I'll have to dig deeper.The text was updated successfully, but these errors were encountered: