Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flake in SwingSet gc test #9392

Open
turadg opened this issue May 21, 2024 · 6 comments · Fixed by #9442
Open

flake in SwingSet gc test #9392

turadg opened this issue May 21, 2024 · 6 comments · Fixed by #9442
Assignees
Labels
flake flakey test

Comments

@turadg
Copy link
Member

turadg commented May 21, 2024

What is the Problem Being Solved?

ok 17 - gc-kernel › two importers: drop+retire, cross-import, drop+retire %ava-dur=556ms
🤞 UnhandledPromiseRejection: (TypeError#2)
TypeError#2: Cannot deliver then to target; typeof target is undefined
TypeError: Cannot deliver "then" to target; typeof target is "undefined"

ok 18 - gc-kernel › terminated vat %ava-dur=40813ms
ok 19 - gc-kernel › device transfer %ava-dur=49753ms
ok 20 - gc › gc-vat › drop presence (export retains) %ava-dur=23021ms
ok 21 - gc › gc-vat › drop presence (export drops) %ava-dur=221[46](https://github.com/Agoric/agoric-sdk/actions/runs/9177455292/job/25235221306?pr=9390#step:6:47)ms
invitation: ko27
targetID: v6
calling makeInvitation
not ok 22 - gc › gc-vat › forward to fake zoe %ava-dur=19744ms
  ---
    name: AssertionError
    assertion: is
    values:
      'Difference (- actual, + expected):': |-
        - 'o-[51](https://github.com/Agoric/agoric-sdk/actions/runs/9177455292/job/25235221306?pr=9390#step:6:52)'
        + undefined
    at: 'packages/SwingSet/test/gc/gc-vat.test.js:159:5'
  ...

Description of the Design

Security Considerations

Scaling Considerations

Test Plan

Upgrade Considerations

@turadg turadg added enhancement New feature or request flake flakey test and removed enhancement New feature or request labels May 21, 2024
@warner
Copy link
Member

warner commented Jun 1, 2024

Ok, so I think this is another case of V8 being not consistent about GC handling, which we've usually dealt with by forcing the test to use an XS/xsnap worker instead of a Node.js/V8/"local" worker.

This forward to fake zoe test was added to demonstrate the fix for #3482 , in which an object that appeared in the result of an eventual-send was being retained forever, due to mishandling of an intermediate promise. The test performs the same sequence as a contract would with Zoe (the problem was first observed in a loadgen contract exercise), lets the promise resolve, then looks at the c-lists to confirm that the object was released.

This depends upon all of:

  • 1: liveslots does not retain the intermediate promise (the pendingPromises retains interior result Promise by mistake #3482 fix)
  • 2: the Presence is no longer held by anything else (check)
  • 3: GC happens, either organic or forced by BOYD (yes, because the test leaves reapInterval at the default of 1, which causes BOYD after every delivery)
  • 4: GC actually releases the object (this is probably what sometimes doesn't happen under V8)
  • 5: the engine actually fires the finalizer callback
  • 6: BOYD happens, which notices the finalization and does a syscall.dropImports / syscall.retireImports, which removes the c-list entry

I think step 4 is sometimes failing under V8. When this same problem has caused other flaky tests, we've generally not been able to reproduce it locally, for some reason it only seems to fail in CI.

The #3482 issue was discovered while investigating an XS bug (#3406), and once that issue was fixed, we found the remaining problem. At first I thought it was a second XS bug, but further experimentation showed it happened on both XS and V8, exonerating XS, and driving the investigation into liveslots. That's why the test I added was using managerType: 'local' to force it to run on V8 (and there's a commented-out line to make it use XS instead, I was flopping back and forth to rule out an engine bug).

So I think the solution is to flip that switch to make it only use XS, where GC behaves more consistently than under Node. I'll make a PR.

warner added a commit that referenced this issue Jun 1, 2024
This "forward to fake zoe" in gc-vat.test was added to demonstrate a
fix for #3482, in which liveslots was mishandling an intermediate
promise by retaining it forever, which made us retain objects that
appear in eventual-send results forever.

This problem was discovered while investigating an unrelated XS engine
bug (#3406), so "is this specific to a single engine?" was on our
mind, and I wasn't sure that we were dealing with two independent bugs
until I wrote the test and showed that it failed on both V8 and XS. So
the test was originally written with a commented-out `managerType:`
option to make it easy to switch back and forth between `local` and
`xs-worker`. That switch was left in the `local` state, probably
because it's slightly faster.

What we've learned is that V8 sometimes holds on to objects despite a
forced GC pass (see #5575 and #3240), and somehow it only seems to
fail in CI runs (and only for people other than me). Our usual
response is to make the test use XS instead of V8, either by setting
`creationOptions.managerType: 'xs-worker'` on the individual vat, or
by setting `defaultManagerType: 'xs-worker'` to set it for all vats.

This PR uses the first approach, changing just the one vat being
exercised (which should be marginally cheaper than making all vats use
XS).

closes #9392
@warner
Copy link
Member

warner commented Jun 1, 2024

@turadg not sure how frequently this was happening, but try including this patch in the failing PRs, and see if it helps. It should be harmless to land on trunk, and follows a pattern we've used before with other test flakes, so we could also just land it and then allow your PRs to go on top of it.

@mergify mergify bot closed this as completed in #9442 Jun 1, 2024
mergify bot added a commit that referenced this issue Jun 1, 2024
)

This "forward to fake zoe" in gc-vat.test was added to demonstrate a fix
for #3482, in which liveslots was mishandling an intermediate promise by
retaining it forever, which made us retain objects that appear in
eventual-send results forever.

This problem was discovered while investigating an unrelated XS engine
bug (#3406), so "is this specific to a single engine?" was on our mind,
and I wasn't sure that we were dealing with two independent bugs until I
wrote the test and showed that it failed on both V8 and XS. So the test
was originally written with a commented-out `managerType:` option to
make it easy to switch back and forth between `local` and `xs-worker`.
That switch was left in the `local` state, probably because it's
slightly faster.

What we've learned is that V8 sometimes holds on to objects despite a
forced GC pass (see #5575 and #3240), and somehow it only seems to fail
in CI runs (and only for people other than me). Our usual response is to
make the test use XS instead of V8, either by setting
`creationOptions.managerType: 'xs-worker'` on the individual vat, or by
setting `defaultManagerType: 'xs-worker'` to set it for all vats.

This PR uses the first approach, changing just the one vat being
exercised (which should be marginally cheaper than making all vats use
XS).

closes #9392
@turadg
Copy link
Member Author

turadg commented Sep 11, 2024

@turadg turadg reopened this Sep 11, 2024
@mhofman
Copy link
Member

mhofman commented Sep 11, 2024

Some of these tests were updated in 9d3c496, but it looks like we didn't switch to using an xs worker for that bootstrap vat.

Changing the following line to xs-worker should fix.

creationOptions: { managerType: 'local' },

Will open a PR

@mhofman
Copy link
Member

mhofman commented Sep 11, 2024

Will address new issue in #10071

@mhofman mhofman closed this as completed Sep 11, 2024
@turadg
Copy link
Member Author

turadg commented Sep 24, 2024

Hit today,

Is this only in Node? Do we expect GC in Node to be deterministic? If not, could we test GC only in XS?

@turadg turadg reopened this Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flake flakey test
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants