Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nondeterministic SwingSet historical inaccuracy in CI #4617

Closed
kriskowal opened this issue Feb 21, 2022 · 5 comments · Fixed by #4644
Closed

Nondeterministic SwingSet historical inaccuracy in CI #4617

kriskowal opened this issue Feb 21, 2022 · 5 comments · Fixed by #4644
Assignees
Labels
bug Something isn't working SwingSet package: SwingSet
Milestone

Comments

@kriskowal
Copy link
Member

Describe the bug

At a5de8dd, the following SwingSet error occurred in CI but did not reappear after re-running the workflow.

  activityhash-vs-start › comms initialize is deterministic

  Rejected promise returned by test. Reason:

  Error {
    message: 'historical inaccuracy in replay of v1',
  }
@kriskowal kriskowal added the bug Something isn't working label Feb 21, 2022
@michaelfig
Copy link
Member

@warner will this mean nondeterminism between validators on the chain?

@michaelfig michaelfig added the SwingSet package: SwingSet label Feb 22, 2022
@warner
Copy link
Member

warner commented Feb 22, 2022

note to self: the test was added to fix #3726, so the problem may resemble something from there

@warner
Copy link
Member

warner commented Feb 22, 2022

One test failure looked like:

  ✔ activityhash-vs-start › restarting kernel does not change activityhash (47s)
anachrophobia strikes vat v1
delivery completed with 3 expected syscalls remaining
expected: {"0":"vatstoreGet","1":"vom.rc.o-51","length":2}
expected: {"0":"vatstoreGet","1":"vom.rc.o-52","length":2}
expected: {"0":"vatstoreGet","1":"vom.rc.o-53","length":2}
REJECTED from ava test: (Error#1)
Error#1: historical inaccuracy in replay of v1
  at Object.finishReplayDelivery (.../swingset-vat/src/kernel/vatManager/transcript.js:87:23)
  at replayOneDelivery (.../swingset-vat/src/kernel/vatManager/manager-helper.js:189:23)
  at async Object.replayTranscript (.../swingset-vat/src/kernel/vatManager/manager-helper.js:220:9)
  at async ensureVatOnline (.../swingset-vat/src/kernel/vatManager/vat-warehouse.js:134:5)
  at async Object.start (.../swingset-vat/src/kernel/vatManager/vat-warehouse.js:177:7)
  at async Object.start (.../swingset-vat/src/kernel/kernel.js:936:5)
  at async makeSwingsetController (packages/SwingSet/src/controller.js:290:3)
  at async packages/SwingSet/test/test-activityhash-vs-start.js:118:14

  ✖ activityhash-vs-start › comms initialize is deterministic Rejected promise returned by test

in https://github.com/Agoric/agoric-sdk/runs/5277256769?check_suite_focus=true

@warner
Copy link
Member

warner commented Feb 22, 2022

Huh, that's not even comms (vom is the virtual object manager, which comes from liveslots, and comms doesn't use liveslots). The only other vat in that test is the bootstrap vat. That looks like the initialization of the VatData collection kinds.

@warner
Copy link
Member

warner commented Feb 23, 2022

This seems to be another instance of #3240. If I run just yarn test test/activityhash-vs-start.js -m "comms initialize is deterministic" by itself in a loop, nothing fails despite an hour of runtime. If I run both test cases within that file (AVA runs them in parallel), I get a failure almost immediately.

The comms test creates the first swingset, lets the bootstrap run, records the state, runs a second message, records the activity hash, then terminates the swingset. Then it launches a second swingset from the recorded state, runs the same message as before, and compares the activity hash.

In both passes, the second message causes a setReceiver() to be sent to an object within the bootstrap vat, with an argument pointing to an object exported by the comms vat. setReceiver() ignores its argument, so this import should be eligible for collection right away.

In my failing case, a slog trace of the first pass shows that imported object being collected during the bringOutYourDead that followed, as expected:

{"type":"deliver","vatID":"v1","deliveryNum":8,"replay":false,"vd":["message","o+4",{"method":"setReceiver","args":{"body":"[{\"@qclass\":\"slot\",\"index\":0}]","slots":["o-55"]},"result":null}]}
{"type":"deliver","vatID":"v1","deliveryNum":9,"replay":false,"vd":["bringOutYourDead"]}
{"type":"syscall","vatID":"v1","deliveryNum":9,"syscallNum":0,"replay":false,"vsc":["vatstoreGet","vom.rc.o-55"]}
{"type":"syscall","vatID":"v1","deliveryNum":9,"syscallNum":1,"replay":false,"vsc":["dropImports",["o-55"]]}
{"type":"syscall","vatID":"v1","deliveryNum":9,"syscallNum":2,"replay":false,"vsc":["retireImports",["o-55"]]}

but in the second pass, it was not:

{"type":"deliver","vatID":"v1","deliveryNum":8,"replay":false,"vd":["message","o+4",{"method":"setReceiver","args":{"body":"[{\"@qclass\":\"slot\",\"index\":0}]","slots":["o-55"]},"result":null}]}
{"type":"deliver","vatID":"v1","deliveryNum":9,"replay":false,"vd":["bringOutYourDead"]}

I think that running both test functions in parallel, on V8, on Node 14, caused the finalizers to not fire during bringOutYourDead on the second pass, therefore the VOM refcount checkers weren't notified about the Presence going away, therefore they did not make the vatstoreGet syscalls to check the on-disk refcount, and liveslots did not emit syscall.dropImports. In my local reproduction, the misbehavior occurred during the live (non-replayed) delivery, so the divergence was detected by the activity hash (one live delivery did syscalls, the other live delivery did not). Most unit tests ignore the activity hash, but this test checks it explicitly, so it caught this form of divergence.

In CI, I think the misbehavior occurred during the replay of the earlier (bootstrap()) delivery, so it made different syscalls than the original (the live delivery made syscalls, the replay did not). That divergence gets caught by the transcript manager. We have a handful of unit tests that exercise replay (not a lot, but still more than the ones that check the activity hash), and this test is also one of those. So it was doubly sensitive.

The fix, until we figure out the root cause behind #3240, is to find all unit tests that are (sensitive to transcript divergence OR sensitive to activity hash) and change them all to either 1: use config.defaultManagerType = 'xsnap', or 2: change the tests to use test.serial.

For this case, I'm going to change test-activityhash-vs-start.js to use test.serial.

warner added a commit that referenced this issue Feb 23, 2022
We don't understand the root cause, but our `gcAndFinalize()` doesn't always
work when 1: run on Node.js (not xsnap) and 2: AVA allows tests to run in
parallel. The problem happens somewhat more frequently on 14.x, at least in
CI.

These two tests exercise transcript replay (which needs GC to happen the same
way for both the original delivery and the replay), and compare the
activityHash (which, of course, is sensitive to all syscalls made, including
GC syscalls). They sometimes failed in CI.

We don't fully understand why gcAndFinalize doesn't work, but serializing the
tests (with test.serial) seems to address the problem.

refs #3240
closes #4617
@mergify mergify bot closed this as completed in #4644 Feb 23, 2022
@Tartuffo Tartuffo added this to the Mainnet 1 milestone Mar 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants