Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrite delivery-result processing, rewind failed upgrades #5730

Merged
merged 9 commits into from
Jul 14, 2022

Conversation

warner
Copy link
Member

@warner warner commented Jul 7, 2022

This PR modifies vat-upgrade to respond correctly to failures during
the upgrade process, by rewinding the vat back to it's pre-upgrade
state.

From the outside world, it will appear as if upgrade was never
requested, except that upgradeVat() rejects.

It also deducts the computrons used during dispatch.startVat
(including the userspace buildRootObject() execution) from both the
vat's Meter and the host application's runPolicy. Metered vats
should plan to provide an additional (estimated) 8M computrons to pay
for the work done during startVat. For block-based host applications,
the blocks that include new vat creation should now be smaller and
closer to the policy-guided limits.

closes #5344

@warner warner added the SwingSet package: SwingSet label Jul 7, 2022
@warner warner added this to the Mainnet 1 milestone Jul 7, 2022
@warner warner self-assigned this Jul 7, 2022
@warner warner force-pushed the 5344-rewind-failed-upgrade branch 4 times, most recently from dbb1bf1 to 14fb81e Compare July 7, 2022 08:45
@warner warner marked this pull request as ready for review July 7, 2022 08:45
@warner
Copy link
Member Author

warner commented Jul 7, 2022

best reviewed one commit at a time, the meat is in the penultimate commit

@warner warner requested review from FUDCo and mhofman July 7, 2022 08:46
Copy link
Contributor

@FUDCo FUDCo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, albeit perhaps slightly too clever.

One minor stylistic nit, tweak or not according to your judgement.

packages/SwingSet/src/kernel/kernel.js Outdated Show resolved Hide resolved
@warner warner force-pushed the 5344-rewind-failed-upgrade branch from 14fb81e to be6c16f Compare July 9, 2022 20:22
Copy link
Member

@mhofman mhofman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general logic seems sound, and fairly well detailed. Thankfully it reduces the scope of some of the global state for vat termination / crank abort.

I am concerned that the no-op delivery cranks now report as 'crank' where they previously were 'none'.

Regarding type nits, in general I'd recommend letting types flow out and avoid explicitly specifying @returns unless needed.

For the vatWarehouse type loss however this is one of those cases where I wish we had noImplicitAny because that regression is way too easy to miss.

The vatAdminMethargs feels very ad-hoc, but I suppose until we have better/more use cases for the kernel generating these calls, it'd be too early to abstract.

The must change is adding an explicit type to vatWarehouse, but I'd also like to restore previous the policyInput behavior on no-delivery.

packages/SwingSet/src/kernel/kernel.js Show resolved Hide resolved
packages/SwingSet/src/kernel/kernel.js Show resolved Hide resolved
packages/SwingSet/src/kernel/vat-warehouse.js Show resolved Hide resolved
packages/SwingSet/src/kernel/kernel.js Show resolved Hide resolved
packages/SwingSet/src/kernel/kernel.js Show resolved Hide resolved
packages/SwingSet/src/kernel/kernel.js Show resolved Hide resolved
packages/SwingSet/src/kernel/kernel.js Show resolved Hide resolved
packages/SwingSet/src/kernel/kernel.js Show resolved Hide resolved
packages/SwingSet/src/kernel/kernel.js Show resolved Hide resolved
@warner warner force-pushed the 5344-rewind-failed-upgrade branch from be6c16f to 0e111cc Compare July 12, 2022 20:57
@warner warner requested a review from mhofman July 12, 2022 20:58
@warner warner force-pushed the 5344-rewind-failed-upgrade branch 2 times, most recently from 207e04f to fd33443 Compare July 12, 2022 23:39
Copy link
Member

@mhofman mhofman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@warner warner force-pushed the 5344-rewind-failed-upgrade branch from fd33443 to 1302d1f Compare July 13, 2022 23:06
@warner warner added the automerge:rebase Automatically rebase updates, then merge label Jul 13, 2022
warner added 7 commits July 14, 2022 07:07
Previously, `kernelKeeper.deductMeter()` both mutated the
meter (downwards) and returned a pair of flags. `overflow` says "you
exhausted the meter", and `notify` says "the value dropped below the
notification threshold, you should tell somebody".

This adds `checkMeter(proposedDeduction)`, which only returns a "the
meter could handle that deduction" flag, and leaves the value
unchanged.

`deductMeter(deduction)` still performs the deduction, but only
returns the `notify` flag.

This will make it easier for a refactored kernel.js to check the
deduction early, then apply it later, rather than needing to pass the
`notify` data across multiple steps.
The kernel has a function named `terminateVat()`, which is called from
both the delivery side (illegal syscall, `syscall.exit`, or meter
overflow), and the device side (when someone asks vat-admin to
terminate a vat from the outside).

Part of this function's job is to shut down any worker that was
online. Doing that is an async task.

Previously, `terminateVat` was synchronous, and when it shut down the
worker, it used `void` to disclaim interest on when exactly the
shutdown occurred (as well as any errors encountered).

This commit changes `terminateVat` to be `async`, and to wait for that
shutdown to finish.

This moves the `void` discomfort into `vat-admin-hooks.js`, where a
new TODO is a reminder that we need to propagate the async-ness
outwards, probably by converting the external shutdown request into a
run-queue event, like `create-vat` and `upgrade-vat`.
This modifies the vat-warehouse API slightly, to provide a better set
of methods for managing (killing) workers.
This commit modifies vat-upgrade to respond correctly to failures
during the upgrade process, by rewinding the vat back to it's
pre-upgrade state.

If a failure occurs either during the `stopVat` sent to the old
version, or during the `startVat` sent to the new version, the
consequences of both deliveries will be unwound, the 'upgrade-vat'
event popped from the delivery queue, and the worker abandoned. The
next delivery to this vat will cause a new worker to be created from
the previous state. The only state change that survives is a message
to vat-admin, instructing it to reject the
`E(adminFacet).upgradeVat()` result promise.

From the outside world, it will appear as if upgrade was never
requested, except that `upgradeVat()` rejects.

To implement this, I rewrote the way delivery results are handled in
kernel.js . `deliverAndLogToVat()` still returns `DeliveryStatus`, but
each `processXYZ` function (`processSend`, `processVatUpgrade`, etc)
is now responsible for transforming that `DeliveryStatus` into a new
`CrankResult`. This `CrankResult` tells `processDeliveryMessage` what
to do in a uniform way: abort the crankbuffer changes, terminate a
vat, deduct a meter, etc.

Most processor functions use a shared `deliveryCrankResults` helper
function to build their results, which takes flags to indicate whether
the delivery should deduct a meter, should decrement the
BOYD/reapCount counter, etc.

`processCreateVat` does its conversion in a different way, to merge
the results of the two deliveries (`stopVat`, `startVat`), and to
abandon the process in the middle if `stopVat` failed.

This will also make it trivial to begin charging a meter for the work
done during `startVat`, and to report this work to the
runPolicy. However it does not do so yet.

closes #5344
With this change, `startVat` computrons are now charged against the
vat's meter, if any. They are also reported to the `runPolicy`.

`startVat` includes both the swingset startup time and the userspace
`buildRootObject()` consumption. Previously, userspace could perform
arbitrary compute during `buildRootObject()` (up to the hard per-crank
limit) without consequence. For metered vats, this meant the meter was
not deducted for this early work.

By omitting this work from the runPolicy, blocks which included a vat
creation event were probably running for longer than we wanted.

Some quick experiments show that it takes 8M computrons to start a
vat, even for a fairly simple `buildRootObject()`. We still need to
experiment to find out what a good block threshold should be, but this
should give us more accurate information to work with.
warner added 2 commits July 14, 2022 07:07
Originally, we set `policyInput = ['none']` when a run-queue event did
not result in a vat delivery (e.g. sends to a terminated vat, or
promise-resolution notifications that were previously delivered as
part of a resolution-graph batch). This branch regressed on that
change slightly, using `['crank']` (with no computrons) even when no
delivery was made.

This commit fixes the branch to behave more like the original:
`['none']` when no delivery is made. It also adds computrons to
`['vat-terminated']`, which previously lacked computrons: this should
give the run policy more accurate information to make runtime
estimates.
There used to be three methods (killWorker, terminate, abandonWorker),
but really we want exactly one of two things:

* 1: make the worker stop, probably because we're terminating the vat,
     and we don't want the child process lingering
* 2: make the worker stop and reset it's snapshot/transcript, because
     we're upgrading the vat, and we want to start with a fresh+empty slate
     with new source code

The first case now uses `stopWorker`, which makes sure the child
process is gone (and is idempotent, because vat-admin might be asked
to terminate a vat which isn't currently paged-in). We can use this
when terminating a vat, but in the future we can also use it when
"pausing" a vat. In that case, some delivery will have triggered a
problem, and we want to unwind that delivery, but *not* terminate the
vat itself. To resume that vat, we'll need a worker that didn't
observe the problematic delivery, so we need to stop that worker, so
we can start a new one (from the previous snapshot/transcript).

The second case now uses `resetWorker`, which both stops the child
process, and also clears the snapshot/transcript settings. It doesn't
delete the transcript per se (to preserve history for some deep/clever
form of replay-based upgrade), but it does update the starting index,
so subsequent worker starts will not replay any of the "old"
deliveries).
@warner warner force-pushed the 5344-rewind-failed-upgrade branch from 1302d1f to 0a2350c Compare July 14, 2022 07:07
@mergify mergify bot merged commit 9b13ea1 into master Jul 14, 2022
@mergify mergify bot deleted the 5344-rewind-failed-upgrade branch July 14, 2022 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge:rebase Automatically rebase updates, then merge SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

failed vat upgrades should be rewound
3 participants