Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider including GC calls in vat transcript #5564

Closed
mhofman opened this issue Jun 10, 2022 · 1 comment · Fixed by #5656
Closed

Consider including GC calls in vat transcript #5564

mhofman opened this issue Jun 10, 2022 · 1 comment · Fixed by #5656
Assignees
Labels
enhancement New feature or request SwingSet package: SwingSet
Milestone

Comments

@mhofman
Copy link
Member

mhofman commented Jun 10, 2022

What is the Problem Being Solved?

In #5557, @warner suggested:

I'm wondering if we should stop stripping the GC syscalls from the transcript. We did this when we wanted to tolerate different organic GC schedules between different vats (the "formal vs local GC" era). Given our need to have metering be identical, and our work to make GC consequences only observable during bringOutYourDead, I think at this point we could reasonably expect all GC activity to be in consensus, and thus it might no longer be necessary to ignore GC syscalls when replaying the transcript.

If we stopped ignoring them, this bug would probably have surfaced much faster: if the error caused a second GC syscall to not happen, we'd have gotten an AnacrophobiaError during replay.

Description of the Design

Include GC syscalls in the transcript and expect identical replay during BOYD.

Test Plan

TBD

@warner
Copy link
Member

warner commented Jun 15, 2022

I realized that an even better argument for this is our hope for deterministic heap snapshots. The contents of a heap snapshot are more sensitive to vat execution than the metering results, because we can (and do) disable metering during deserialization operations which might probe a WeakRef. But the state (full/empty) of that WeakRef depends upon whether GC has happened or not, and that state is obviously included in the heap snapshot. If we want all validators to get the same heap snapshot, we can't afford to allow them to perform GC at different times.

A related issue is transcript replay for debugging, where we take the original (xs-worker) transcript from the chain, and replay it locally under a Node.js worker (so you can point a real debugger at it). Our use of forced GC, coupled with our accumulation/suppression of finalizer results until an explicit bringOutYourDead, ought to result in the same observable GC activity (i.e. syscalls) on both the local worker and the original XS process. We're still seeing Node.js/V8 doing something weird (not finalizing or releasing some objects as fast when it should), which can cause that to break, but I think if we could find a fix for that, our local replay would match the xs-worker one.

Omitting/ignoring GC syscalls in the transcript used to be a sort of workaround for that bug: before we introduced vatstore-held refcounts, the only consequence of GC was a set of dropImports/retireImports/retireExports, and they always happened at the end of a crank, and none of them return a result. So if GC happened a during delivery 5 instead of delivery 3, then d3 would be missing some GC syscalls (but the filtered list was identical), and d5 would have some extra syscalls (but we could synthesize their (empty) results without needing to get them from a transcript, and again the filtered list was identical).

But once we introduced vatstore refcounts, that changed: now GC happening in delivery 3 means we get some refcount checks in d3, which does syscall.vatstoreGet, which demands a result. If that syscall gets delayed until delivery 5, we won't know what result to give it, since the d5 transcript doesn't have a record of the syscall.

So I think our "replay locally" (under Node) story depends upon finding and fixing the Node/V8 "GC sometimes happens too late" bug (#3240 , #5575, #5202), otherwise our local replays will fail as soon as they experience a late GC event. We can replay them locally under xs, of course, but the debugging experience leaves something to be desired.

So actions to take:

  • in src/kernel/vat-loader/transcript.js:
    • in addSyscall(), remove the gcSyscalls.has check
    • in simulateSyscall(), remove the gcSyscalls.has check
    • remove the gcSyscalls definition altogether
  • probably update some tests

FUDCo added a commit that referenced this issue Jun 23, 2022
FUDCo added a commit that referenced this issue Jun 23, 2022
FUDCo added a commit that referenced this issue Jun 23, 2022
michaelfig pushed a commit that referenced this issue Jun 25, 2022
@mergify mergify bot closed this as completed in #5656 Jun 25, 2022
@Tartuffo Tartuffo added this to the Mainnet 1 milestone Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants