-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gcAndFinalize might be insufficient on Node.js in test-transcript.js #3240
Comments
Consider a |
(Blah blah blah ephemerons blah blah generational blah blah.) |
Update test-transcript.js to wait for the combined run-queue and GC-actions queue to drain, now that c.run()/c.step() executes GC actions first. Also shutdown the controller properly after each call to buildTrace(). This test exhibits odd GC behavior when: * it lacks the wait for `gcActions.length` * and uses the default 'local' vat worker * liveslots calls `gcAndFinalize()` only once This suggests that our `gcAndFinalize()` is insufficient, at least on Node.js. See #3240 for more details.
Oh. Oops. Guess I should read the docs :). thank you ! |
Hm, I tracked down the header file (it doesn't seem to be documented anywhere else), and it appears that no argument is equivalent to
The implementation seems to match: if Rats, I was hoping that would explain it. |
test-controller.js uses `c.step()` to examine the kernel one crank at a time, but was written before some of those cranks are really GC actions. XS seems to do a pretty thorough job of GC when asked, so the kernel has several to do, and the test did not see the regular vat delivery progress it was expecting. The fix is to step through all GC actions before taking the one vat-delivery `c.step()` it intended to do. test-promises.js "refcount while queued" failed because the bootstrap method under test did not pin all vats in place. As a result, by the time the test tried to `c.queueToVatExport()` a message to vat-right, vat-right's root object had been collected. The fix is easy: make `bootstrap()` pin `vats.right` in place. The interesting question is why this did not fail under Node.js (`managerType='local'`). Apparently Node.js is less inclined to release objects, even when a full `gc()` is invoked. I suspect some pattern of definition hoisting or optimization is keeping the `vats` variable around longer than the source code would suggest. Might be related to #3240, but I haven't been able to find a `gc()` mode that causes Node.js to collect it.
Experimentally, // test/basedir-promises-3/bootstrap.js
export function buildRootObject() {
const pk1 = makePromiseKit();
return Far('root', {
bootstrap(vats) {
const p2 = E(vats.right).one();
E(p2).four(pk1.promise);
},
two() {
pk1.resolve(3);
},
});
} in which the |
test-controller.js uses `c.step()` to examine the kernel one crank at a time, but was written before some of those cranks are really GC actions. XS seems to do a pretty thorough job of GC when asked, so the kernel has several to do, and the test did not see the regular vat delivery progress it was expecting. The fix is to step through all GC actions before taking the one vat-delivery `c.step()` it intended to do. test-promises.js "refcount while queued" failed because the bootstrap method under test did not pin all vats in place. As a result, by the time the test tried to `c.queueToVatExport()` a message to vat-right, vat-right's root object had been collected. The fix is easy: make `bootstrap()` pin `vats.right` in place. The interesting question is why this did not fail under Node.js (`managerType='local'`). Apparently Node.js is less inclined to release objects, even when a full `gc()` is invoked. I suspect some pattern of definition hoisting or optimization is keeping the `vats` variable around longer than the source code would suggest. Might be related to #3240, but I haven't been able to find a `gc()` mode that causes Node.js to collect it.
test-controller.js uses `c.step()` to examine the kernel one crank at a time, but was written before some of those cranks are really GC actions. XS seems to do a pretty thorough job of GC when asked, so the kernel has several to do, and the test did not see the regular vat delivery progress it was expecting. The fix is to step through all GC actions before taking the one vat-delivery `c.step()` it intended to do. test-promises.js "refcount while queued" failed because the bootstrap method under test did not pin all vats in place. As a result, by the time the test tried to `c.queueToVatExport()` a message to vat-right, vat-right's root object had been collected. The fix is easy: make `bootstrap()` pin `vats.right` in place. The interesting question is why this did not fail under Node.js (`managerType='local'`). Apparently Node.js is less inclined to release objects, even when a full `gc()` is invoked. I suspect some pattern of definition hoisting or optimization is keeping the `vats` variable around longer than the source code would suggest. Might be related to #3240, but I haven't been able to find a `gc()` mode that causes Node.js to collect it.
test-controller.js uses `c.step()` to examine the kernel one crank at a time, but was written before some of those cranks are really GC actions. XS seems to do a pretty thorough job of GC when asked, so the kernel has several to do, and the test did not see the regular vat delivery progress it was expecting. The fix is to step through all GC actions before taking the one vat-delivery `c.step()` it intended to do. test-promises.js "refcount while queued" failed because the bootstrap method under test did not pin all vats in place. As a result, by the time the test tried to `c.queueToVatExport()` a message to vat-right, vat-right's root object had been collected. The fix is easy: make `bootstrap()` pin `vats.right` in place. The interesting question is why this did not fail under Node.js (`managerType='local'`). Apparently Node.js is less inclined to release objects, even when a full `gc()` is invoked. I suspect some pattern of definition hoisting or optimization is keeping the `vats` variable around longer than the source code would suggest. Might be related to #3240, but I haven't been able to find a `gc()` mode that causes Node.js to collect it.
test-controller.js uses `c.step()` to examine the kernel one crank at a time, but was written before some of those cranks are really GC actions. XS seems to do a pretty thorough job of GC when asked, so the kernel has several to do, and the test did not see the regular vat delivery progress it was expecting. The fix is to step through all GC actions before taking the one vat-delivery `c.step()` it intended to do. test-promises.js "refcount while queued" failed because the bootstrap method under test did not pin all vats in place. As a result, by the time the test tried to `c.queueToVatExport()` a message to vat-right, vat-right's root object had been collected. The fix is easy: make `bootstrap()` pin `vats.right` in place. The interesting question is why this did not fail under Node.js (`managerType='local'`). Apparently Node.js is less inclined to release objects, even when a full `gc()` is invoked. I suspect some pattern of definition hoisting or optimization is keeping the `vats` variable around longer than the source code would suggest. Might be related to #3240, but I haven't been able to find a `gc()` mode that causes Node.js to collect it.
I also tried the following without success: bootstrap({ right } {
const p2 = E(right).one();
E(p2).four(pk1.promise);
}, The other members of @michaelfig suggested (maybe in an offline conversation prompted by #3207 (comment)) that a V8 optimization is probably moving the lifetime scope (but not syntactic reachability scope) up a level or two, for some |
test-controller.js uses `c.step()` to examine the kernel one crank at a time, but was written before some of those cranks are really GC actions. XS seems to do a pretty thorough job of GC when asked, so the kernel has several to do, and the test did not see the regular vat delivery progress it was expecting. The fix is to step through all GC actions before taking the one vat-delivery `c.step()` it intended to do. test-promises.js "refcount while queued" failed because the bootstrap method under test did not pin all vats in place. As a result, by the time the test tried to `c.queueToVatExport()` a message to vat-right, vat-right's root object had been collected. The fix is easy: make `bootstrap()` pin `vats.right` in place. The interesting question is why this did not fail under Node.js (`managerType='local'`). Apparently Node.js is less inclined to release objects, even when a full `gc()` is invoked. I suspect some pattern of definition hoisting or optimization is keeping the `vats` variable around longer than the source code would suggest. Might be related to #3240, but I haven't been able to find a `gc()` mode that causes Node.js to collect it.
test-controller.js uses `c.step()` to examine the kernel one crank at a time, but was written before some of those cranks are really GC actions. XS seems to do a pretty thorough job of GC when asked, so the kernel has several to do, and the test did not see the regular vat delivery progress it was expecting. The fix is to step through all GC actions before taking the one vat-delivery `c.step()` it intended to do. test-promises.js "refcount while queued" failed because the bootstrap method under test did not pin all vats in place. As a result, by the time the test tried to `c.queueToVatExport()` a message to vat-right, vat-right's root object had been collected. The fix is easy: make `bootstrap()` pin `vats.right` in place. The interesting question is why this did not fail under Node.js (`managerType='local'`). Apparently Node.js is less inclined to release objects, even when a full `gc()` is invoked. I suspect some pattern of definition hoisting or optimization is keeping the `vats` variable around longer than the source code would suggest. Might be related to #3240, but I haven't been able to find a `gc()` mode that causes Node.js to collect it.
test-controller.js uses `c.step()` to examine the kernel one crank at a time, but was written before some of those cranks are really GC actions. XS seems to do a pretty thorough job of GC when asked, so the kernel has several to do, and the test did not see the regular vat delivery progress it was expecting. The fix is to step through all GC actions before taking the one vat-delivery `c.step()` it intended to do. test-promises.js "refcount while queued" failed because the bootstrap method under test did not pin all vats in place. As a result, by the time the test tried to `c.queueToVatExport()` a message to vat-right, vat-right's root object had been collected. The fix is easy: make `bootstrap()` pin `vats.right` in place. The interesting question is why this did not fail under Node.js (`managerType='local'`). Apparently Node.js is less inclined to release objects, even when a full `gc()` is invoked. I suspect some pattern of definition hoisting or optimization is keeping the `vats` variable around longer than the source code would suggest. Might be related to #3240, but I haven't been able to find a `gc()` mode that causes Node.js to collect it.
@michaelfig and I experimentally determined that V8 wants one additional // on Node.js, GC seems to work better if the promise queue is empty first
await new Promise(setImmediate);
// on xsnap, we must do it twice for some reason
await new Promise(setImmediate);
gcPower();
// this gives finalizers a chance to run
await new Promise(setImmediate);
// Node.js needs this to allow all promises to get collected
await new Promise(setImmediate); |
When testing the upcoming #3482 fix, we observed that a vat method which simply delegates to a second vat, like this: ```js getInvitationTarget: () => E(zoe).getInvitationZoe(), ``` would not always drop the "invitation" Presence returned by both methods, when run under Node.js. It dropped the Presence correctly perhaps 20% of the time, and the other 80% it failed to drop it. Under XS it dropped it all of the time. Node.js started working correctly all of the time (N=8 or so) when we changed `gcAndFinalize` to do *two* `setImmediate`s after the `gc()`, instead of just one. I'd like to add a unit test that fails with a similar probability, but I haven't been able to come up with one. Either they fail to collect the object all of the time, or none of the time. refs #3240 Hopefully it fixes that, but I won't be sure until I run more load-generator tests and look for growth in the object counts over time. And I'd like to add that test before closing the issue.
When testing the upcoming #3482 fix, we observed that a vat method which simply delegates to a second vat, like this: ```js getInvitationTarget: () => E(zoe).getInvitationZoe(), ``` would not always drop the "invitation" Presence returned by both methods, when run under Node.js. It dropped the Presence correctly perhaps 20% of the time, and the other 80% it failed to drop it. Under XS it dropped it all of the time. Node.js started working correctly all of the time (N=8 or so) when we changed `gcAndFinalize` to do *two* `setImmediate`s after the `gc()`, instead of just one. I'd like to add a unit test that fails with a similar probability, but I haven't been able to come up with one. Either they fail to collect the object all of the time, or none of the time. refs #3240 Hopefully it fixes that, but I won't be sure until I run more load-generator tests and look for growth in the object counts over time. And I'd like to add that test before closing the issue.
…3486) When testing the upcoming #3482 fix, we observed that a vat method which simply delegates to a second vat, like this: ```js getInvitationTarget: () => E(zoe).getInvitationZoe(), ``` would not always drop the "invitation" Presence returned by both methods, when run under Node.js. It dropped the Presence correctly perhaps 20% of the time, and the other 80% it failed to drop it. Under XS it dropped it all of the time. Node.js started working correctly all of the time (N=8 or so) when we changed `gcAndFinalize` to do *two* `setImmediate`s after the `gc()`, instead of just one. I'd like to add a unit test that fails with a similar probability, but I haven't been able to come up with one. Either they fail to collect the object all of the time, or none of the time. refs #3240 Hopefully it fixes that, but I won't be sure until I run more load-generator tests and look for growth in the object counts over time. And I'd like to add that test before closing the issue.
@FUDCo determined that AVA-parallelized unit tests on Node.js cause V8 to not run finalizers as soon as they ought to, so some GC-sensitive tests will fail (objects are released late, and/or finalizers are run late, so the We don't know exactly what's going on, but it seems like a parallel test case interferes with The fix is to do one of:
|
@warner When I first started working on our ZH board, I marked this as being for the Mainnet 1 release because it was already in the Review/QA pipeline. The comment makes it seem like it is an open question, not really In Review or In QA. Should this be moved back to the MN-1 backlog, or taken out of the Mainnet 1 release altogether. |
I think we updated enough tests with |
#4617 appears to be another instance of this |
We don't understand the root cause, but our `gcAndFinalize()` doesn't always work when 1: run on Node.js (not xsnap) and 2: AVA allows tests to run in parallel. The problem happens somewhat more frequently on 14.x, at least in CI. These two tests exercise transcript replay (which needs GC to happen the same way for both the original delivery and the replay), and compare the activityHash (which, of course, is sensitive to all syscalls made, including GC syscalls). They sometimes failed in CI. We don't fully understand why gcAndFinalize doesn't work, but serializing the tests (with test.serial) seems to address the problem. refs #3240 closes #4617
@warner Since @michaelfig was able to reproduce this in v16.14.0, presumably this issue needs to go back into MN-1? |
#5266 is another instance of this problem, but in Node v16.14.2 . The specific test case with the problem had a |
This only seems to affect CI, so we're not going to worry about this for MN-1, for sure. This ticket is about keeping an eye on this problem. |
I'm moving this back to Product Backlog: it is annoying, but doesn't affect product functionality, and certainly doesn't need to be a priority for the Vaults release. |
I'm seeing CI failures under Node-18 in this test, presumeably because we get GC variation between the first run and the replay. Super annoying, and not germane to what this test is supposed to be executing. The `test.serial` wrapper wasn't enough to fix it, so I've just switched the test to only use xs-worker. refs #3240
Sometimes, for reasons we don't entirely understand, Node.js doesn't garbage-collect objects when we tell it to, and we get flaky GC-checking tests. This applies our usual fix, which is to only run those tests under XS. It also stops attempting to use `test.serial` as a workaround. refs #3240 refs #5575 fixes #9089
Sometimes, for reasons we don't entirely understand, Node.js doesn't garbage-collect objects when we tell it to, and we get flaky GC-checking tests. This applies our usual fix, which is to only run those tests under XS. It also stops attempting to use `test.serial` as a workaround. refs #3240 refs #5575 fixes #9089
Sometimes, for reasons we don't entirely understand, Node.js doesn't garbage-collect objects when we tell it to, and we get flaky GC-checking tests. This applies our usual fix, which is to only run those tests under XS. It also stops attempting to use `test.serial` as a workaround. refs #3240 refs #5575 fixes #9089
This "forward to fake zoe" in gc-vat.test was added to demonstrate a fix for #3482, in which liveslots was mishandling an intermediate promise by retaining it forever, which made us retain objects that appear in eventual-send results forever. This problem was discovered while investigating an unrelated XS engine bug (#3406), so "is this specific to a single engine?" was on our mind, and I wasn't sure that we were dealing with two independent bugs until I wrote the test and showed that it failed on both V8 and XS. So the test was originally written with a commented-out `managerType:` option to make it easy to switch back and forth between `local` and `xs-worker`. That switch was left in the `local` state, probably because it's slightly faster. What we've learned is that V8 sometimes holds on to objects despite a forced GC pass (see #5575 and #3240), and somehow it only seems to fail in CI runs (and only for people other than me). Our usual response is to make the test use XS instead of V8, either by setting `creationOptions.managerType: 'xs-worker'` on the individual vat, or by setting `defaultManagerType: 'xs-worker'` to set it for all vats. This PR uses the first approach, changing just the one vat being exercised (which should be marginally cheaper than making all vats use XS). closes #9392
) This "forward to fake zoe" in gc-vat.test was added to demonstrate a fix for #3482, in which liveslots was mishandling an intermediate promise by retaining it forever, which made us retain objects that appear in eventual-send results forever. This problem was discovered while investigating an unrelated XS engine bug (#3406), so "is this specific to a single engine?" was on our mind, and I wasn't sure that we were dealing with two independent bugs until I wrote the test and showed that it failed on both V8 and XS. So the test was originally written with a commented-out `managerType:` option to make it easy to switch back and forth between `local` and `xs-worker`. That switch was left in the `local` state, probably because it's slightly faster. What we've learned is that V8 sometimes holds on to objects despite a forced GC pass (see #5575 and #3240), and somehow it only seems to fail in CI runs (and only for people other than me). Our usual response is to make the test use XS instead of V8, either by setting `creationOptions.managerType: 'xs-worker'` on the individual vat, or by setting `defaultManagerType: 'xs-worker'` to set it for all vats. This PR uses the first approach, changing just the one vat being exercised (which should be marginally cheaper than making all vats use XS). closes #9392
While testing GC (so after #3109 lands), I found that
test-transcript.js
sometimes fails. The symptoms suggest that mygcAndFinalize()
is insufficient to provoke enough GC in some circumstances.To reproduce, land #3109, then modify
test-transcript.js
to remove thec.dump().gcActions.length
check in thebuildTrace
while loop. Checking bothrunQueue.length
andgcActions.length
is the correct thing to do. The original code (which exhibited the bug) only checkedrunQueue.length
, which means the test would end early (when userspace was done, but there were still GC actions to perform). Adding the propergcActions.length
check made the bug go away, but I can't think of a good reason it. Maybe it provokes more object creation and thus accelerates a more-complete GC pass.The problem manifests as a failure of either of the two unit tests in that file:
transcript-one save
(which launches two swingsets from the same starting state, runs both to completion, then compares their full state vector), ortranscript-one load
(which records the full state vector after each crank, then replays the swingset from each recorded vector, and makes sure all replays emit a matching state at every crank). The problem appears on the 4th(?) crank, which delivers anotify
to the bootstrap vat, which finishes its work and allows the entire scope frame to be dropped, which ought to drop all the vat root objects it was given in thevats
argument. In some (most?) runs, this delivery emits asyscall.dropImports
for all the vat root objects, but there is at least one run which does not. As a result, thegcActions
state is different, as is the number of cranks being executed (since withoutsyscall.dropImports
the kernel won't schedule the resultingdispatch.dropExports
).The code did not originally do a
await c.shutdown()
after each run, but adding it did not fix the problem (and is only really necessary when using a non-local worker likexs-worker
). Adding thegcActions.length
check made the bug go away, as did switching toxs-worker
, as did simply duplicating thegcAndFinalize()
call in liveslots. The latter is the most disturbing, as it suggests that we need to finalize harder somehow.I tried augmenting
test-gc-and-finalize.js
with a cyclic "victim" object graph, with the idea that maybe Node.js is able to notice a refcount dropping to zero immediately (instead of waiting for mark-and-sweep to see it), and so maybe introducing a deliberate cycle would expose an insufficiency ingcAndFinalize
. Buttest-gc-and-finalize
continued to pass.To experiment with this, comment out the
gcActions.length
check, andtest-transcript.js
ought to fail pretty quickly.The text was updated successfully, but these errors were encountered: