Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

taking XS snapshots causes kernel-observable GC differences, terminates vats #3428

Closed
warner opened this issue Jun 29, 2021 · 9 comments · Fixed by #3433
Closed

taking XS snapshots causes kernel-observable GC differences, terminates vats #3428

warner opened this issue Jun 29, 2021 · 9 comments · Fixed by #3433
Assignees
Labels
bug Something isn't working SwingSet package: SwingSet xsnap the XS execution tool

Comments

@warner
Copy link
Member

warner commented Jun 29, 2021

While running integration tests on current trunk (something close to 628ef14), I observed failures due to vat-bank being summarily executed by the kernel for an illegal syscall. @dckc and I tracked this down part-way, and we concluded that we must address the side-effects of snapshot-write operations.

In this codebase, we're running all vats (except comms) under xs-worker-no-gc, which uses XS like usual, but disables the force GC sweep that usually happens towards the end of each delivery. The normal xs-worker does:

  • delivery starts
  • userspace runs until the promise queue drains
  • liveslots forces gc()
  • liveslots gives finalizers a chance to run
    • if any finalizers could run liveslots emits syscall.dropImports/syscall.retireImports for the collected objects
  • delivery finishes

Then, after the delivery has finished, the vatWarehouse decides whether or not to save a heap snapshot. The current policy makes a snapshot when deliveryNum % 200 === 2: after the 2nd delivery (to avoid replaying the expensive large importBundle that contracts do on their very first delivery), and then again every 200 deliveries (to keep the transcript, and consequent replay time, bounded to a reasonable value).

The xs-worker-no-gc variant does all of that except it omits the forced gc(). As a result, the only GC that happens is "organic": triggered by normal vat activity. In my tests, given the kind of load I was applying, I saw vat-bank performing one GC sweep for each 26 or 27 deliveries (deliveryNum of 23, 49, 76, 101), and they yielded 2/2/2/5 objects being dropped+retired.

In my test, I ran the kernel for 14 minutes (2943 cranks, 168 blocks), during which vat-bank received 124 deliveries. According to the slogfile, organic GC happened during deliveries 1, 2, 23 (drop+retire o-54 and o-60), 49 (d+r o-62 and o-63), 76 (o-65 and o-66), and 101 (o-68, o-69, o-70, o-72, o-73).

Then I stopped the process and restarted it. The vat was reloaded from the snapshot taken at the 2nd delivery, then replayed the 122 post-snapshot deliveries. Unfortunately the slogfile does not record the syscalls made during replay, nor any metering numbers like garbageCollectionCount. 51 minutes later (at crankNum 4792, vat-bank delivery 128), vat-bank performed GC, and one of the syscalls it made during that delivery was treated as illegal by the kernel:

{
  "time": 1624914008.3725057,
  "type": "deliver-result",
  "crankNum": 4792,
  "vatID": "v1",
  "deliveryNum": 128,
  "dr": [
    "error",
    "syscall.dropImports failed, prepare to die: syscall translation error: prepare to die",
    {
      "meterType": "xs-meter-8",
      "compute": 89906,
      "allocate": 42074144,
      "allocateChunksCalls": 1,
      "allocateSlotsCalls": 2,
      "garbageCollectionCount": 5,
      "mapSetAddCount": 4193,
      "mapSetRemoveCount": 149,
      "maxBucketSize": 3
    }
  ]
}

The kernel then terminated vat-bank, which causes all sorts of knock-on effects as the other vats connected to it see their result promises rejected. My load generator stopped making progress at that point, although the kernel did not panic. This test was using a one-node chain, so there were no other nodes to disagree with, but if it were a real chain, and the different nodes had different snapshot-timing policies, I think the divergence would have eventually spread to the comms messages, and that would have caused a consensus fault, evicting some of the validators. In addition, since vat-bank spends a lot of time communicating with the cosmos-sdk Bank module, divergence in the cosmos state would probably cause a consensus failure even sooner.

In trying to recover the details of the illegal syscall, I've identified a few diagnostic improvements we should make:

  • The slog() call which records syscall details takes both the vat's original VatSyscallObject and the translated KernelSyscallObject, but if the translation fails, the syscall is not logged at all. This loses helpful information, and should be changed to record at least the VatSyscallObject upon a translation error.
  • We record GC-related syscalls in the slog (when they succeed), which is great. We don't record them in the transcript, because we're trying to be somewhat tolerant of variations in their behavior, which is basically ok.
  • We don't record any syscalls during replay, because they're supposed to be the same as in the original transcript (and if they diverge, we signal an anachrophobia error). But since we strip GC-related syscalls before recording the original transcript, and we ignore GC-related syscalls (for comparison purposes) during replay, we aren't recording enough information to find out about any differences.
    • We should record all syscalls during replay in the slogfile, ideally in the same format as the first time around, maybe with a flag to remind ourselves that we're in a replay and the syscalls were simulated (not actually delivered to the kernel).
    • The "do we match" syscall comparison code assumes that all GC-related syscalls were stripped from the transcript. This is generally the case for the normal kernel, but my standalone replay-transcript.js tool, which reconstructs transcripts from a slogfile or kernelDB, does not follow this pattern, and includes the GC syscalls in the generated transcript file. This confuses the do-we-match check.
  • We should record delivery results during replay, specifically the metering results.

While using replay-transcript.js to try and reconstruct the timeline of the vat, I found that GC took place at different times during the (manually) reloaded vat. The organic sweeps happened on deliveries 1 (0 drops), 23 (o-54 and o-60), 48 (o-62 and o-63), 75 (o-65 and o-66), 100 (o-68 o-69 o-70 o-72), and 128 (o-73 o-75 o-76 o-77). Note that the specific objects being dropped vary as well: o-73 was dropped 28 deliveries later during the replay.

If the second run of the real kernel behaved like my replay-transcript.js run, where the recorded c-list tables (as of the last committed state, at delivery 124) treat o-73 as already retired, then a syscall.dropImports(o-73) happening at delivery 128 would not match anything in the c-list, and the kernel would treat it as an illegal syscall. This seems like the most likely cause of the vat-bank termination I observed in the second kernel run.

XS snapshot writes are not side-effect free

XS very specifically performs a full GC sweep just before writing out a heap snapshot, probably to simplify and minimize the snapshot data (the reachability flag doesn't need to be recorded). As a result, when vatWarehouse decides to record a vat snapshot, it inadvertently provokes a full GC sweep. This "resets the counter" of garbage objects, which, as a side-effect, will forestall any organic GC sweeps that follow.

@dckc and I wrote a test that compares three vats, each doing an operation that creates and immediately drops a few hundred objects in a loop, and watches the metering stats to learn when organic GC sweeps happen.

  • vat A just runs the loop
  • vat B runs the loop a few times, then writes out a snapshot, then resumes running the loop
  • vat C is loaded from that snapshot, then resumes running the loop

We observed that vats B and C ran GC at the same time, while vat A runs it earlier. This is consistent with vat B seeing a forced GC at the moment of snapshot write (deferring the need for GC for a while), and vat C starting from the post-GCed snapshotted state. And it is consistent with my replay-transcript.js run (which does not read or write any snapshots) experiencing GC earlier than the real kernel runs: replay-transcript.js is behaving like vat A, my first kernel run should behave like vat B, and my second kernel run should have behaved like vat C.

It doesn't yet explain the divergence between my first and second real kernel runs. Both should have seen the same forced GC at snapshot creation time, so I don't know why they apparently diverged later.

vatWarehouse policy should be outside of consensus

escalated to #3430

Our intention with the vatWarehouse was that it ought to be able to take snapshots whenever it wants, without being visible to the consensus state. Two different validators should be able to use different policies (one taking snapshots quickly, the other more leisurely), and their vats should nevertheless behave identically. The fact that snapshot writes trigger GC interferes with this goal.

We can overcome this (and we must, since GC-before-snapshot is obviously the right thing to do) by doing more GC. If the consensus state of the chain would have done a GC between deliveries anyways, then the snapshot's GC won't have anything left to collect, and it becomes idempotent with the consensus-driven action.

Before we switched from xs-worker to xs-worker-no-gc, this is almost what we were doing. In xs-worker, liveslots does GC after userspace loses agency, but then proceeds to do more JS work (processing finalized vrefs, making syscalls), which builds up more garbage. The code is designed to perform more GC sweeps until they stop yielding freed userspace objects (this loop is not activated, because I don't think we need it yet, but future data structures might). When liveslots gives up control to the kernel and ends the delivery, there may still be garbage, but the quantity will be less than if userspace had been allowed to run after the liveslots gc().

Then, if a snapshot is taken, the remaining garbage count drops to zero. The difference between "some" (whatever liveslots leaves around) and "zero" then determines the timing of the next organic GC operation. If enough garbage is left around, and the next delivery is busy enough, an organic sweep might happen before the end of that delivery. But it won't matter, because liveslots won't do anything with the finalizers that run until after it's done its own gc().

Basically, by doing a GC at the end of each crank, and not acting upon finalizers any earlier than that, we can tolerate any extra (organic) sweeps. But if we aren't forcing that GC, we're dependent upon the organic sweeps, and that makes us vulnerable to their timing, which is influenced by the GC provoked by snapshot writes, thus making snapshot write timing part of consensus.

The solutions we've thought of:

  • go back to forcing a GC in liveslots at the end of each crank
  • don't do any extra GC, and include the snapshot writing schedule (vat warehouse policy) as part of consensus
  • force GC on a consensus schedule, and only allow snapshot writes to happen just after these forced GC
    • e.g. do a GC after every 10 deliveries, and then give the vat warehouse policy the freedom to snapshot every 10th or 20th or 200th delivery
    • or (@dtribble will probably prefer this one), do GC on a block boundary: the kernel would accumulate the vatIDs which have had deliveries, then the host application would call some special kernel API that means "this is the end of a block, it's a great time to do cleanup and stuff before we commit a lot of state". The kernel would react by making a special delivery to all the "dirty" vats, which would force a GC sweep and emit drop/retire syscalls. After doing this, the vat warehouse has the option of snapshotting any of the dirty vats.
    • the original non-deterministic GC #1872 GC design called for a dispatch.bringOutYourDead, which would return the list of dropped objects, instead of using syscalls. This approach might be more efficient (liveslots ignores the finalizers and deadSet until this special crank), although the comms vat still wants spontaneous GC syscalls.
    • it's not clear whether we'd want to act upon those GC syscalls (and the gcActions they enqueue) immediately, during that cleanup period at the end of the block, because telling one vat that its export has been dropped might cause it to drop more objects. The worst case cascade would be some sort of distributed linked list. We probably need to bound the amount of cleanup work we do, because the whole reason we're at end-of-block is because we've exhausted our budget for doing work. We can make some guesses about how much time we'll need for GC, but in general I think we must be prepared to leave some amount of GC work unfinished, and pick it up again during the next block. (All gcActions are kept in a durable queue for exactly this reason).

Consequences for the current chain

I think it's possible that we'll experience validators dropping out of consensus as a result of this problem, if what I observed between my first kernel run and my second cannot be explained by something else. I'm looking for an alternate explanation, because my conclusions here are that both runs should have seen the same forced gc() during the snapshot write.

My next step will be to improve the slogging to capture syscalls and garbageCollectionCount metrics during replay. If I can reproduce the vat-termination event with that logging enabled, I should be able to prove or disprove the theory that GC is happening at different times during replay. If different, my next step will be to make a detailed record (using $XSNAP_TEST_RECORD) of the low-level messages sent to the XS process, and make sure they are identical for both the original and the reloaded vat: I'd be looking for some extra counter or timestamp or default configuration option that's different between the two, which might cause the supervisor to execute some additional code, which doesn't have any semantic effect, but causes allocation, which might modify the organic GC timing. If all the post-snapshot messages appear identical, I'll have to dig deeper.

@warner warner added bug Something isn't working SwingSet package: SwingSet xsnap the XS execution tool labels Jun 29, 2021
@warner
Copy link
Member Author

warner commented Jun 29, 2021

I think I have a suspect:

  async function startXSnap(
    name,
    handleCommand,
    metered,
    snapshotHash = undefined,
  ) {
    if (snapStore && snapshotHash) {
      // console.log('startXSnap from', { snapshotHash });
      return snapStore.load(snapshotHash, async snapshot => {
        const xs = doXSnap({ snapshot, name, handleCommand, ...xsnapOpts });
        await xs.evaluate('null'); // ensure that spawn is done
        return xs;
      });
    }

That extra await xs.evaluate('null') will happen in the second pass, but not the first. If that call allocates more objects, it will shift the timing of subsequent GC operations, which might cause the divergence.

@dckc
Copy link
Member

dckc commented Jun 29, 2021

@phoddie , FYI: XS seems to be working as-designed here. We initially suspected an XS bug but then remembered that GC as part of taking a snapshot was as-designed.

@warner
Copy link
Member Author

warner commented Jun 29, 2021

I've restarted my testnet three times now, with that line removed, and it seems to have addressed the problem. Phew!

So we have two separate issues here:

  • XS snapshot writes cause (an appropriate) GC, which limits our flexibility around consensus-visibility of snapshot scheduling
  • The swingset snapshot reload process was adding work (and therefore garbage creation) that wasn't present on the original run, changing the timing of organic GC, causing it to occur slightly earlier in the replay
    • The divergence was not detected during replay(no "anchrophobia error") because we strip GC syscalls from both the transcript and the replay comparison.
    • This was intended to tolerate slight differences in GC behavior, based on design work from non-deterministic GC #1872 , where we'd ignore drops for objects that were not still in the c-list. But vref-aware GC design (was: WeakMap + garbage-collected Presences = non-determinism) #2724 taught us that we must track drop and retire as separate events, so the rules would be more complicated, and anyways I didn't actually implement this "ignore" behavior.
    • I think we must acknowledge that we're dependent upon precisely deterministic GC. We can probably tolerate some variation better by forcing GC before acting upon finalizers, because XS GC is complete (it wipes the slate clean).

I'm going to define this ticket as dealing with the xs.evaluate('null') problem, and create a new one for the question of how to allow chain nodes to pick their own snapshot schedule without having the snapshot writes being visible to the consensus state.

Next steps for this ticket are to create a test that creates a vat, runs it long enough to trigger a snapshot, runs it longer still to trigger GC, then stops the kernel and reloads. I want to see the problem happen with the evaluate in place, then confirm that removing the evaluate makes it go away.

A secondary task is to figure out how to upgrade an existing chain to include this fix without a full reset. The kernel bundle is stored in the DB, so simply changing the software release to one without the evaluate won't be sufficient. We'll either need to change the software to (sometimes?) replace the saved kernel bundle key with a newly-bundled copy, or distribute a specialized tool which includes the new kernel bundle and simply does an LMDB write to replace the saved bundle in-place.

@dckc
Copy link
Member

dckc commented Jun 29, 2021

I'm going to define this ticket as dealing with the xs.evaluate('null') problem

As @warner noted, we could wrap the returned value's issueCommand and watch for it to finish
the tempfile would sit around until someone talked to the vat

It's ugly, but it seems to work:
https://github.com/Agoric/agoric-sdk/tree/snap-tmp-delay d13046d

warner added a commit that referenced this issue Jun 29, 2021
* during transcript replay, record (simulated) syscalls and delivery results
in the slog
  * previously we only recorded deliveries
  * this captures GC syscalls, to investigate divergence bugs
  * it also records metering results for the delivery, for the same reason
* during normal delivery, record syscall data even if the syscall fails
  * vat-fatal syscall errors, like dropping an object that was already
    retired, now capture the vat's attempted syscall, even if the translation
    into kernelspace failed, or the kernelSyscallHandler invocation failed
  * previously we lost all record of these syscall attempts, making it hard
    to debug the reason for vat termination
* slog entries for replayed deliveries/syscalls look just like the normal
  ones, but now have a `replay` boolean to distinguish the two
  * replayed deliveries do not have a `crankNum`, since we deliberately do
    not record that in the transcript, so we don't know the value during replay
* we compute a `deliveryNum` for each replayed message from the transcript
  position offset, which ought to match the original. This might need some
  refactoring to make fewer (or more) assumptions about the shape of a
  StreamPosition.

refs #3428 (not as a fix, but as a critical debugging tool)
dckc added a commit that referenced this issue Jun 30, 2021
@dckc dckc self-assigned this Jun 30, 2021
@dckc
Copy link
Member

dckc commented Jun 30, 2021

Idea: Snapshot after first delivery in each block, using crankNum

  • when taking a snapshot, record crankNum along with snapshotID and startPos
  • at the beginning of each block, have the host / controller tell the kernel to make a note of the current crankNum
  • in vatWarehouse.maybeSaveSnapshot, if this vat hasn't been snapshoted since the starting crankNum for this block, snapshot it.

I thought of this in the context of #3430, but this idea connects snapshot time more closely to consensus, rather than less.

@dckc dckc added this to the Testnet: Stress Test Phase milestone Jun 30, 2021
@dckc
Copy link
Member

dckc commented Jun 30, 2021

yesterday @mhofman wrote

... and I found 2 differences in the slog; this order and the maxBucketSize

maxBucketSize is a high water mark we added to xsMapSet.c. It seems to be a function of an fxSumEntry hashing function, which, in the case of XS_REFERENCE_KIND (i.e. objects), depends on memory addresses:

			else if (XS_REFERENCE_KIND == kind) {
			address = (txU1*)&slot->value.reference;
			size = sizeof(txSlot*);

I just verified that results from fxSumEntry do vary between executions of the same code.
For details, see https://gist.github.com/dckc/cbedda3bf4851297fbd88df0636f46de

@warner
Copy link
Member Author

warner commented Jul 1, 2021

@mhofman and I did some extensive analysis and brain-storming, and we think we understand the situation now. The executive summary: Regular garbage collection in XS is complete/accurate, but non-compacting, however a snapshot store/reload cycle is effectively compacting. The new process has less extra space for objects, so "small churn" allocate/release activity will need GC more frequently in a reloaded vat (less headroom) than the original vat (whose headroom goes up to the previous high-water mark). Therefore non-forced GC timing is not only a function of user-level activity since the last GC event, but also of the history of the available malloced space for objects. The latter is not something we can account for effectively, so we must give up on deterministic "organic" GC timing, and adopt the defensive stance described in #1872 : either force GC on every crank, or (better performance) pre-scheduled periodic dispatch.bringOutYourDead() which forces GC and allows GC syscalls to be made, and defer all other GC syscalls until that chain-wide (consensus) point in time.

The details:

XS manages memory for most fixed-size objects (Object, Number, etc) with a txSlot structure, which has enough space for some type fields, a body, and a few linked-list pointers. XS keeps a linked list of free slots, and fxNewSlot() first checks the free list. If that doesn't yield a free slot, it performs the mark-and-sweep GC by calling fxCollect(), which finishes by pushing any unused txSlot structs onto the free list (in particular, the sweep phase just walks all heap slabs and builds a brand new free list from slot that is not marked). Then fxNewSlot() checks the free list again. If that still fails (i.e. GC didn't release anything), it calls fxGrowSlots(), which uses malloc() to get another large slab of memory called a "heap" (typically 128KiB, configurable as xsCreation.incrementalHeapCount), divides it up into slots, and adds them all to the free list. If malloc succeeded in providing a heap, then fxNewSlot() is guaranteed to get something from the free list.

Releasing an object does not immediately add the txSlot back onto the free list. XS does not do reference counting, so it does not know that the object is unreferenced until fxCollect() does the mark-and-sweep. So a simple let x = {}; x = null; will consume a slot, and the memory doesn't become usable again until GC happens, and that won't happen until the free list is empty. After GC, the number of entries on the free list is equal to the combined size of all the allocated heaps (128 KiB each), divided by the size of txSlot (maybe 32 bytes?), minus the number of slots still in active use. Most significantly, XS never free()s a heap slab during the lifetime of the XSEngine structure.

Therefore, for given rate of "low amplitude" churn (let x = {}; x = null;, one live object at a time), the rate of GC sweeps is inversely proportional to the "headroom", which is the high-water-mark of live objects ever simultaneously used by the process minus the current live-object count. An XS process which uses a lot of memory briefly, then frees it, will forevermore have a lot of free txSlot entries, so it can go a long time between GC() calls. A process which has never had this spike in memory usage will not have allocated the extra heap slabs, and will have a smaller free list, and must therefore perform GC more frequently.

Saving a snapshot performs GC() just before serialization. More importantly, loading a snapshot only allocates as many txSlot entries as it needs. The number of heap slabs used by the predecessor is not included in the snapshot. So the post-reload state will have all the same JS objects as before, the same number of live txSlot structs as before, but much fewer free txSlot structs. It will have some, because the heap slabs are still allocated in large fixed sizes, so there will be some leftovers, but it won't allocate as many as the earlier process did. For 128KiB heaps and a 32-byte txSlot, there will be at most 8191 free slots available.

So the post-reload program will perform GC more frequently than the pre-snapshot program did, even though the non-GC observable program state is identical. This dashes our hopes of treating "organic" (non-forced) GC calls as being a deterministic function of program activity: it is also a function of snapshot save (easy enough to handle) and reload (not so easy).

So, we must treat GC as (once again) spontaneous and non-deterministic, and write our finalizer-handling code to be defensive against this. We cannot allow GC syscalls to appear just anywhere.

The basic defensive approach is to define a special dispatch.bringOutYourDead() delivery (as described in #1872 ), which is the only time GC syscalls are allowed to be made. Each validator (with a different snapshot policy) will observe the finalizers to run at different times, but it won't matter because we conceal the results (i.e. deadSet.add(vref), rather than an immediate syscall, which also mandates code that recognizes re-imports and removes the vref from deadSet). All validators will run bringOutYourDead() at the same time, which must force a GC (to accelerate any stragglers), allow their finalizers to run, and then finally emit the GC syscalls like syscall.dropImports/etc. This way the consensus-sensitive kernel state will always see the GC events at the same time. bringOutYourDead() must be careful to sort the vrefs, to remove sensitivity to the order in which the finalizers ran.

Our current approach, in which liveslots does a GC and emits GC syscalls at the end of every delivery, is equivalent to calling bringOutYourDead() after every delivery. This is sound, but it would be less expensive if we did it less often. On the other hand, the sooner we run it, the sooner we can inform the kernel of the drops, allowing the kernel to inform other vats of the drops, including the comms vat, which allows remote machines to drop their objects too.

So there is a spectrum of bringOutYourDead scheduling options, with tradeoffs between immediate CPU cost, longer-term storage costs, and complexity. What matters most is that all validators us the same schedule:

  • run it every delivery
  • run it every N deliveries
  • run it on every vat touched during a block, just before the first delivery is made (@dckc suggested this one, it makes the book-keeping pretty easy, however it clearly leaves garbage in the vat for an indeterminate amount of time)
  • same, but after the last delivery (this minimizes garbage as well as "every delivery", but requires the kernel keep a list of "dirty vats", and the host must give the kernel a chance to clean up at the end of block, for which it's hard to budget the scarce block time)
  • run it on every dirty vat after N blocks
  • wait for a special externally-sourced transaction, which tells the kernel which vats to run it on

All of these would result in all validators running bringOutYourDead at the same time.

I think this explains why our #3433 change didn't help: it removed one original-vs-reloaded behavior difference (the extra xs.evaluate('null') that wasn't run in the original pass), but couldn't prevent the variation in GC timing caused by the new+smaller free-list size. The latter was probably more significant anyways.

@dckc
Copy link
Member

dckc commented Jul 1, 2021

warner added a commit that referenced this issue Jul 15, 2021
* during transcript replay, record (simulated) syscalls and delivery results
in the slog
  * previously we only recorded deliveries
  * this captures GC syscalls, to investigate divergence bugs
  * it also records metering results for the delivery, for the same reason
* during normal delivery, record syscall data even if the syscall fails
  * vat-fatal syscall errors, like dropping an object that was already
    retired, now capture the vat's attempted syscall, even if the translation
    into kernelspace failed, or the kernelSyscallHandler invocation failed
  * previously we lost all record of these syscall attempts, making it hard
    to debug the reason for vat termination
* slog entries for replayed deliveries/syscalls look just like the normal
  ones, but now have a `replay` boolean to distinguish the two
  * replayed deliveries do not have a `crankNum`, since we deliberately do
    not record that in the transcript, so we don't know the value during replay
* we compute a `deliveryNum` for each replayed message from the transcript
  position offset, which ought to match the original. This might need some
  refactoring to make fewer (or more) assumptions about the shape of a
  StreamPosition.

refs #3428 (not as a fix, but as a critical debugging tool)
warner added a commit that referenced this issue Jul 15, 2021
* during transcript replay, record (simulated) syscalls and delivery results
in the slog
  * previously we only recorded deliveries
  * this captures GC syscalls, to investigate divergence bugs
  * it also records metering results for the delivery, for the same reason
* during normal delivery, record syscall data even if the syscall fails
  * vat-fatal syscall errors, like dropping an object that was already
    retired, now capture the vat's attempted syscall, even if the translation
    into kernelspace failed, or the kernelSyscallHandler invocation failed
  * previously we lost all record of these syscall attempts, making it hard
    to debug the reason for vat termination
* slog entries for replayed deliveries/syscalls look just like the normal
  ones, but now have a `replay` boolean to distinguish the two
  * replayed deliveries do not have a `crankNum`, since we deliberately do
    not record that in the transcript, so we don't know the value during replay
* we compute a `deliveryNum` for each replayed message from the transcript
  position offset, which ought to match the original. This might need some
  refactoring to make fewer (or more) assumptions about the shape of a
  StreamPosition.

refs #3428 (not as a fix, but as a critical debugging tool)
warner added a commit that referenced this issue Jul 17, 2021
* during transcript replay, record (simulated) syscalls and delivery results
in the slog
  * previously we only recorded deliveries
  * this captures GC syscalls, to investigate divergence bugs
  * it also records metering results for the delivery, for the same reason
* during normal delivery, record syscall data even if the syscall fails
  * vat-fatal syscall errors, like dropping an object that was already
    retired, now capture the vat's attempted syscall, even if the translation
    into kernelspace failed, or the kernelSyscallHandler invocation failed
  * previously we lost all record of these syscall attempts, making it hard
    to debug the reason for vat termination
* slog entries for replayed deliveries/syscalls look just like the normal
  ones, but now have a `replay` boolean to distinguish the two
  * replayed deliveries do not have a `crankNum`, since we deliberately do
    not record that in the transcript, so we don't know the value during replay
* we compute a `deliveryNum` for each replayed message from the transcript
  position offset, which ought to match the original. This might need some
  refactoring to make fewer (or more) assumptions about the shape of a
  StreamPosition.

refs #3428 (not as a fix, but as a critical debugging tool)
@dckc
Copy link
Member

dckc commented Jul 19, 2021

@warner writes in #3428 (comment) Jun 29:

I'm going to define this ticket as dealing with the xs.evaluate('null') problem

We landed that fix in a0493d7 / #3433

@dckc dckc closed this as completed Jul 19, 2021
warner added a commit that referenced this issue Aug 18, 2021
* during transcript replay, record (simulated) syscalls and delivery results
in the slog
  * previously we only recorded deliveries
  * this captures GC syscalls, to investigate divergence bugs
  * it also records metering results for the delivery, for the same reason
* during normal delivery, record syscall data even if the syscall fails
  * vat-fatal syscall errors, like dropping an object that was already
    retired, now capture the vat's attempted syscall, even if the translation
    into kernelspace failed, or the kernelSyscallHandler invocation failed
  * previously we lost all record of these syscall attempts, making it hard
    to debug the reason for vat termination
* slog entries for replayed deliveries/syscalls look just like the normal
  ones, but now have a `replay` boolean to distinguish the two
  * replayed deliveries do not have a `crankNum`, since we deliberately do
    not record that in the transcript, so we don't know the value during replay
* we compute a `deliveryNum` for each replayed message from the transcript
  position offset, which ought to match the original. This might need some
  refactoring to make fewer (or more) assumptions about the shape of a
  StreamPosition.

refs #3428 (not as a fix, but as a critical debugging tool)
warner added a commit that referenced this issue Sep 22, 2021
* during transcript replay, record (simulated) syscalls and delivery results
in the slog
  * previously we only recorded deliveries
  * this captures GC syscalls, to investigate divergence bugs
  * it also records metering results for the delivery, for the same reason
* during normal delivery, record syscall data even if the syscall fails
  * vat-fatal syscall errors, like dropping an object that was already
    retired, now capture the vat's attempted syscall, even if the translation
    into kernelspace failed, or the kernelSyscallHandler invocation failed
  * previously we lost all record of these syscall attempts, making it hard
    to debug the reason for vat termination
* slog entries for replayed deliveries/syscalls look just like the normal
  ones, but now have a `replay` boolean to distinguish the two
  * replayed deliveries do not have a `crankNum`, since we deliberately do
    not record that in the transcript, so we don't know the value during replay
* we compute a `deliveryNum` for each replayed message from the transcript
  position offset, which ought to match the original. This might need some
  refactoring to make fewer (or more) assumptions about the shape of a
  StreamPosition.

refs #3428 (not as a fix, but as a critical debugging tool)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working SwingSet package: SwingSet xsnap the XS execution tool
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants