gcAndFinalize might be insufficient on Node.js in test-transcript.js #3240

warner · 2021-06-02T02:26:57Z

While testing GC (so after #3109 lands), I found that test-transcript.js sometimes fails. The symptoms suggest that my gcAndFinalize() is insufficient to provoke enough GC in some circumstances.

To reproduce, land #3109, then modify test-transcript.js to remove the c.dump().gcActions.length check in the buildTrace while loop. Checking both runQueue.length and gcActions.length is the correct thing to do. The original code (which exhibited the bug) only checked runQueue.length, which means the test would end early (when userspace was done, but there were still GC actions to perform). Adding the proper gcActions.length check made the bug go away, but I can't think of a good reason it. Maybe it provokes more object creation and thus accelerates a more-complete GC pass.

The problem manifests as a failure of either of the two unit tests in that file: transcript-one save (which launches two swingsets from the same starting state, runs both to completion, then compares their full state vector), or transcript-one load (which records the full state vector after each crank, then replays the swingset from each recorded vector, and makes sure all replays emit a matching state at every crank). The problem appears on the 4th(?) crank, which delivers a notify to the bootstrap vat, which finishes its work and allows the entire scope frame to be dropped, which ought to drop all the vat root objects it was given in the vats argument. In some (most?) runs, this delivery emits a syscall.dropImports for all the vat root objects, but there is at least one run which does not. As a result, the gcActions state is different, as is the number of cranks being executed (since without syscall.dropImports the kernel won't schedule the resulting dispatch.dropExports).

The code did not originally do a await c.shutdown() after each run, but adding it did not fix the problem (and is only really necessary when using a non-local worker like xs-worker). Adding the gcActions.length check made the bug go away, as did switching to xs-worker, as did simply duplicating the gcAndFinalize() call in liveslots. The latter is the most disturbing, as it suggests that we need to finalize harder somehow.

I tried augmenting test-gc-and-finalize.js with a cyclic "victim" object graph, with the idea that maybe Node.js is able to notice a refcount dropping to zero immediately (instead of waiting for mark-and-sweep to see it), and so maybe introducing a deliberate cycle would expose an insufficiency in gcAndFinalize. But test-gc-and-finalize continued to pass.

function setup1() {
  // The worst case I can think of is a four-pole counter-rotating cycle,
  // held in place by N and S, with E as our victim/sensor. Dropping N or S
  // does not drop a refcount to zero.
  const N = {};
  const S = {};
  const E = { N, S };
  const W = { N, S };
  N.E = E;
  N.W = W;
  S.E = E;
  S.W = W;
  const finalized = ['finalizer not called'];
  const fr = new FinalizationRegistry(_tag => {
    finalized[0] = 'finalizer was called';
  });
  const wr = new WeakRef(E);
  fr.register(E, 'tag');
  return { N, S, finalized, fr, wr };
}

function setup2() {
  const { N: _ignoreN, S: _ignoreS, finalized, fr, wr } = setup1();
  // I want N and S to be dropped after E and W.
  return { finalized, fr, wr };
}

To experiment with this, comment out the gcActions.length check, and test-transcript.js ought to fail pretty quickly.

The text was updated successfully, but these errors were encountered:

michaelfig · 2021-06-02T02:41:40Z

Consider a gc(true) invocation instead to gc-me-harder on Node.

michaelfig · 2021-06-02T02:48:00Z

Consider a gc(true) invocation instead to gc-me-harder on Node.

true means "full gc" but without it is just "minor".

(Blah blah blah ephemerons blah blah generational blah blah.)

Update test-transcript.js to wait for the combined run-queue and GC-actions queue to drain, now that c.run()/c.step() executes GC actions first. Also shutdown the controller properly after each call to buildTrace(). This test exhibits odd GC behavior when: * it lacks the wait for `gcActions.length` * and uses the default 'local' vat worker * liveslots calls `gcAndFinalize()` only once This suggests that our `gcAndFinalize()` is insufficient, at least on Node.js. See #3240 for more details.

warner · 2021-06-02T07:57:16Z

Oh. Oops. Guess I should read the docs :).

thank you !

warner · 2021-06-03T01:27:59Z

Hm, I tracked down the header file (it doesn't seem to be documented anywhere else), and it appears that no argument is equivalent to { type: 'major' }:

// Provides garbage collection on invoking |fun_name|(options), where
// - options is a dictionary like object. See supported properties below.
// - no parameter refers to options:
//   {type: 'major', execution: 'sync'}.
// - truthy parameter that is not setting any options:
//   {type: 'minor', execution: 'sync'}.
//
// Supported options:
// - type: 'major' or 'minor' for full GC and Scavenge, respectively.
// - execution: 'sync' or 'async' for synchronous and asynchronous execution,
// respectively.
// - Defaults to {type: 'major', execution: 'sync'}.

The implementation seems to match: if args.Length() == 0 then it calls InvokeGC with kFullGarbageCollection.

Rats, I was hoping that would explain it.

test-controller.js uses `c.step()` to examine the kernel one crank at a time, but was written before some of those cranks are really GC actions. XS seems to do a pretty thorough job of GC when asked, so the kernel has several to do, and the test did not see the regular vat delivery progress it was expecting. The fix is to step through all GC actions before taking the one vat-delivery `c.step()` it intended to do. test-promises.js "refcount while queued" failed because the bootstrap method under test did not pin all vats in place. As a result, by the time the test tried to `c.queueToVatExport()` a message to vat-right, vat-right's root object had been collected. The fix is easy: make `bootstrap()` pin `vats.right` in place. The interesting question is why this did not fail under Node.js (`managerType='local'`). Apparently Node.js is less inclined to release objects, even when a full `gc()` is invoked. I suspect some pattern of definition hoisting or optimization is keeping the `vats` variable around longer than the source code would suggest. Might be related to #3240, but I haven't been able to find a `gc()` mode that causes Node.js to collect it.

warner · 2021-06-11T22:23:15Z

Experimentally, gc() and gc(false) seem to provoke GC events, but not any of: gc(true), gc({ type: 'major' }), gc({ type: 'major', execution: 'sync' }), gc({}). So.. I have no idea what's going on. I'm adding one pattern in PR #3298 :

// test/basedir-promises-3/bootstrap.js
export function buildRootObject() {
  const pk1 = makePromiseKit();
  return Far('root', {
    bootstrap(vats) {
      const p2 = E(vats.right).one();
      E(p2).four(pk1.promise);
    },
    two() {
      pk1.resolve(3);
    },
  });
}

in which the vats variable is held (under Node.js) until the vat receives notification of the resolution of the unnamed result of E(p2).four(). Under XS, vats is dropped by the end of bootstrap(), as expected. I don't have a theory of what's holding onto it. I think I've seen some hoisting happen, where an inner function doesn't always limit the scope as much as I would have expected, so out of paranoia I've been defining many functions at the top level. But in this case it's an argument that's being held.

test-controller.js uses `c.step()` to examine the kernel one crank at a time, but was written before some of those cranks are really GC actions. XS seems to do a pretty thorough job of GC when asked, so the kernel has several to do, and the test did not see the regular vat delivery progress it was expecting. The fix is to step through all GC actions before taking the one vat-delivery `c.step()` it intended to do. test-promises.js "refcount while queued" failed because the bootstrap method under test did not pin all vats in place. As a result, by the time the test tried to `c.queueToVatExport()` a message to vat-right, vat-right's root object had been collected. The fix is easy: make `bootstrap()` pin `vats.right` in place. The interesting question is why this did not fail under Node.js (`managerType='local'`). Apparently Node.js is less inclined to release objects, even when a full `gc()` is invoked. I suspect some pattern of definition hoisting or optimization is keeping the `vats` variable around longer than the source code would suggest. Might be related to #3240, but I haven't been able to find a `gc()` mode that causes Node.js to collect it.

warner · 2021-06-13T00:10:17Z

I also tried the following without success:

    bootstrap({ right } {
      const p2 = E(right).one();
      E(p2).four(pk1.promise);
    },

The other members of vats were retained until the notify.

@michaelfig suggested (maybe in an offline conversation prompted by #3207 (comment)) that a V8 optimization is probably moving the lifetime scope (but not syntactic reachability scope) up a level or two, for some const/let declarations and/or arguments. In general, for the small number of tests that need to specifically exercise the JS engine's ability to collect things, we'll have to be careful about how we write that code, at least when testing under Node.js (XS seems much more thorough when you ask it to gc()).

test-controller.js uses `c.step()` to examine the kernel one crank at a time, but was written before some of those cranks are really GC actions. XS seems to do a pretty thorough job of GC when asked, so the kernel has several to do, and the test did not see the regular vat delivery progress it was expecting. The fix is to step through all GC actions before taking the one vat-delivery `c.step()` it intended to do. test-promises.js "refcount while queued" failed because the bootstrap method under test did not pin all vats in place. As a result, by the time the test tried to `c.queueToVatExport()` a message to vat-right, vat-right's root object had been collected. The fix is easy: make `bootstrap()` pin `vats.right` in place. The interesting question is why this did not fail under Node.js (`managerType='local'`). Apparently Node.js is less inclined to release objects, even when a full `gc()` is invoked. I suspect some pattern of definition hoisting or optimization is keeping the `vats` variable around longer than the source code would suggest. Might be related to #3240, but I haven't been able to find a `gc()` mode that causes Node.js to collect it.

warner · 2021-07-16T18:58:59Z

@michaelfig and I experimentally determined that V8 wants one additional setImmediate at the end of gcAndFinalize to allow promises to get collected reliably. So our current contender for an XS+V8-safe gcAndFinalize is:

    // on Node.js, GC seems to work better if the promise queue is empty first
    await new Promise(setImmediate);
    // on xsnap, we must do it twice for some reason
    await new Promise(setImmediate);
    gcPower();
    // this gives finalizers a chance to run
    await new Promise(setImmediate);
    // Node.js needs this to allow all promises to get collected
    await new Promise(setImmediate);

When testing the upcoming #3482 fix, we observed that a vat method which simply delegates to a second vat, like this: ```js getInvitationTarget: () => E(zoe).getInvitationZoe(), ``` would not always drop the "invitation" Presence returned by both methods, when run under Node.js. It dropped the Presence correctly perhaps 20% of the time, and the other 80% it failed to drop it. Under XS it dropped it all of the time. Node.js started working correctly all of the time (N=8 or so) when we changed `gcAndFinalize` to do *two* `setImmediate`s after the `gc()`, instead of just one. I'd like to add a unit test that fails with a similar probability, but I haven't been able to come up with one. Either they fail to collect the object all of the time, or none of the time. refs #3240 Hopefully it fixes that, but I won't be sure until I run more load-generator tests and look for growth in the object counts over time. And I'd like to add that test before closing the issue.

…3486) When testing the upcoming #3482 fix, we observed that a vat method which simply delegates to a second vat, like this: ```js getInvitationTarget: () => E(zoe).getInvitationZoe(), ``` would not always drop the "invitation" Presence returned by both methods, when run under Node.js. It dropped the Presence correctly perhaps 20% of the time, and the other 80% it failed to drop it. Under XS it dropped it all of the time. Node.js started working correctly all of the time (N=8 or so) when we changed `gcAndFinalize` to do *two* `setImmediate`s after the `gc()`, instead of just one. I'd like to add a unit test that fails with a similar probability, but I haven't been able to come up with one. Either they fail to collect the object all of the time, or none of the time. refs #3240 Hopefully it fixes that, but I won't be sure until I run more load-generator tests and look for growth in the object counts over time. And I'd like to add that test before closing the issue.

warner · 2021-10-01T23:10:28Z

@FUDCo determined that AVA-parallelized unit tests on Node.js cause V8 to not run finalizers as soon as they ought to, so some GC-sensitive tests will fail (objects are released late, and/or finalizers are run late, so the syscall.dropImport calls show up in a later crank than expected).

We don't know exactly what's going on, but it seems like a parallel test case interferes with gcAndFinalize.

The fix is to do one of:

make the test not be sensitive to GC timing, by looking less closely at what happens in each crank
run the test by itself, using AVA's test.serial() instead of plain test()
make the test's vat-worker run under XS instead of Node (setting the kernel config's worker-type to xs-worker)

Tartuffo · 2022-02-10T19:26:32Z

@warner When I first started working on our ZH board, I marked this as being for the Mainnet 1 release because it was already in the Review/QA pipeline. The comment makes it seem like it is an open question, not really In Review or In QA. Should this be moved back to the MN-1 backlog, or taken out of the Mainnet 1 release altogether.

Tartuffo · 2022-02-11T16:01:17Z

@warner Since this was moved to the Review/QA pipeline "via the connected PR Agoric/agoric-sdk #3431", I'm moving this back to MN-1 Backlog. LMK if you disagree. Also, this needs an estimate.

warner · 2022-02-22T19:55:21Z

I think we updated enough tests with test.serial to avoid this problem for now, but we still don't understand the root cause. Moving it out of MN-1 because it's not causing problems today.

warner · 2022-02-22T22:22:04Z

#4617 appears to be another instance of this

We don't understand the root cause, but our `gcAndFinalize()` doesn't always work when 1: run on Node.js (not xsnap) and 2: AVA allows tests to run in parallel. The problem happens somewhat more frequently on 14.x, at least in CI. These two tests exercise transcript replay (which needs GC to happen the same way for both the original delivery and the replay), and compare the activityHash (which, of course, is sensitive to all syscalls made, including GC syscalls). They sometimes failed in CI. We don't fully understand why gcAndFinalize doesn't work, but serializing the tests (with test.serial) seems to address the problem. refs #3240 closes #4617

Tartuffo · 2022-02-23T15:53:50Z

@warner Since @michaelfig was able to reproduce this in v16.14.0, presumably this issue needs to go back into MN-1?

warner · 2022-05-03T02:17:04Z

#5266 is another instance of this problem, but in Node v16.14.2 . The specific test case with the problem had a test.serial on it, but the others in that file did not.

warner · 2022-05-04T21:14:28Z

This only seems to affect CI, so we're not going to worry about this for MN-1, for sure. This ticket is about keeping an eye on this problem.

warner · 2023-01-04T18:15:53Z

I'm moving this back to Product Backlog: it is annoying, but doesn't affect product functionality, and certainly doesn't need to be a priority for the Vaults release.

I'm seeing CI failures under Node-18 in this test, presumeably because we get GC variation between the first run and the replay. Super annoying, and not germane to what this test is supposed to be executing. The `test.serial` wrapper wasn't enough to fix it, so I've just switched the test to only use xs-worker. refs #3240

Sometimes, for reasons we don't entirely understand, Node.js doesn't garbage-collect objects when we tell it to, and we get flaky GC-checking tests. This applies our usual fix, which is to only run those tests under XS. refs #3240 refs #5575 fixes #9089

Sometimes, for reasons we don't entirely understand, Node.js doesn't garbage-collect objects when we tell it to, and we get flaky GC-checking tests. This applies our usual fix, which is to only run those tests under XS. It also stops attempting to use `test.serial` as a workaround. refs #3240 refs #5575 fixes #9089

This "forward to fake zoe" in gc-vat.test was added to demonstrate a fix for #3482, in which liveslots was mishandling an intermediate promise by retaining it forever, which made us retain objects that appear in eventual-send results forever. This problem was discovered while investigating an unrelated XS engine bug (#3406), so "is this specific to a single engine?" was on our mind, and I wasn't sure that we were dealing with two independent bugs until I wrote the test and showed that it failed on both V8 and XS. So the test was originally written with a commented-out `managerType:` option to make it easy to switch back and forth between `local` and `xs-worker`. That switch was left in the `local` state, probably because it's slightly faster. What we've learned is that V8 sometimes holds on to objects despite a forced GC pass (see #5575 and #3240), and somehow it only seems to fail in CI runs (and only for people other than me). Our usual response is to make the test use XS instead of V8, either by setting `creationOptions.managerType: 'xs-worker'` on the individual vat, or by setting `defaultManagerType: 'xs-worker'` to set it for all vats. This PR uses the first approach, changing just the one vat being exercised (which should be marginally cheaper than making all vats use XS). closes #9392

) This "forward to fake zoe" in gc-vat.test was added to demonstrate a fix for #3482, in which liveslots was mishandling an intermediate promise by retaining it forever, which made us retain objects that appear in eventual-send results forever. This problem was discovered while investigating an unrelated XS engine bug (#3406), so "is this specific to a single engine?" was on our mind, and I wasn't sure that we were dealing with two independent bugs until I wrote the test and showed that it failed on both V8 and XS. So the test was originally written with a commented-out `managerType:` option to make it easy to switch back and forth between `local` and `xs-worker`. That switch was left in the `local` state, probably because it's slightly faster. What we've learned is that V8 sometimes holds on to objects despite a forced GC pass (see #5575 and #3240), and somehow it only seems to fail in CI runs (and only for people other than me). Our usual response is to make the test use XS instead of V8, either by setting `creationOptions.managerType: 'xs-worker'` on the individual vat, or by setting `defaultManagerType: 'xs-worker'` to set it for all vats. This PR uses the first approach, changing just the one vat being exercised (which should be marginally cheaper than making all vats use XS). closes #9392

warner added bug Something isn't working SwingSet package: SwingSet labels Jun 2, 2021

warner self-assigned this Jun 2, 2021

warner mentioned this issue Jul 16, 2021

pendingPromises retains interior result Promise by mistake #3482

Closed

warner mentioned this issue Jul 17, 2021

fix(swingset): gcAndFinalize needs two post-GC setImmediates on V8 #3486

Merged

warner mentioned this issue Feb 23, 2022

Nondeterministic SwingSet historical inaccuracy in CI #4617

Closed

warner mentioned this issue Feb 23, 2022

test(swingset): use test.serial on GC-sensitive test #4644

Merged

Tartuffo added this to the Mainnet 1 milestone Mar 23, 2022

warner mentioned this issue Mar 27, 2022

intermittent CI failure: swingset test virtualObjects - virtualObjectGC #4936

Closed

Tartuffo modified the milestones: Mainnet 1, RUN Protocol RC0 Apr 5, 2022

warner mentioned this issue May 3, 2022

intermittent test failure in vat replay #5266

Closed

warner removed this from the Mainnet 1 milestone May 4, 2022

warner mentioned this issue Jun 15, 2022

Consider including GC calls in vat transcript #5564

Closed

Tartuffo added migrate-in-progress and removed migrate-in-progress labels Nov 17, 2022

ivanlei added the vaults_triage DO NOT USE label Jan 3, 2023

warner mentioned this issue Mar 23, 2024

flake in gc-kernel device-transfer call to retireImports #9089

Closed

warner mentioned this issue Mar 23, 2024

fix(swingset): force flaky GC test to use XS only #9132

Merged

warner mentioned this issue Jun 1, 2024

test(swingset): fix GC test flake by forcing xsnap worker, not V8 #9442

Merged

warner mentioned this issue Aug 22, 2024

syscall.retireImports translation error (when deleting WeakSet) #9939

Closed

warner mentioned this issue Sep 30, 2024

intermittent failure in liveslots/test/storeGC › lifecycle › store lifecycle 7 #10173

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gcAndFinalize might be insufficient on Node.js in test-transcript.js #3240

gcAndFinalize might be insufficient on Node.js in test-transcript.js #3240

warner commented Jun 2, 2021

michaelfig commented Jun 2, 2021

michaelfig commented Jun 2, 2021

warner commented Jun 2, 2021

warner commented Jun 3, 2021

warner commented Jun 11, 2021

warner commented Jun 13, 2021

warner commented Jul 16, 2021

warner commented Oct 1, 2021

Tartuffo commented Feb 10, 2022

Tartuffo commented Feb 11, 2022

warner commented Feb 22, 2022

warner commented Feb 22, 2022

Tartuffo commented Feb 23, 2022

warner commented May 3, 2022

warner commented May 4, 2022

warner commented Jan 4, 2023

gcAndFinalize might be insufficient on Node.js in test-transcript.js #3240

gcAndFinalize might be insufficient on Node.js in test-transcript.js #3240

Comments

warner commented Jun 2, 2021

michaelfig commented Jun 2, 2021

michaelfig commented Jun 2, 2021

warner commented Jun 2, 2021

warner commented Jun 3, 2021

warner commented Jun 11, 2021

warner commented Jun 13, 2021

warner commented Jul 16, 2021

warner commented Oct 1, 2021

Tartuffo commented Feb 10, 2022

Tartuffo commented Feb 11, 2022

warner commented Feb 22, 2022

warner commented Feb 22, 2022

Tartuffo commented Feb 23, 2022

warner commented May 3, 2022

warner commented May 4, 2022

warner commented Jan 4, 2023