rewrite delivery-result processing, rewind failed upgrades #5730

warner · 2022-07-07T01:43:10Z

This PR modifies vat-upgrade to respond correctly to failures during
the upgrade process, by rewinding the vat back to it's pre-upgrade
state.

From the outside world, it will appear as if upgrade was never
requested, except that upgradeVat() rejects.

It also deducts the computrons used during dispatch.startVat
(including the userspace buildRootObject() execution) from both the
vat's Meter and the host application's runPolicy. Metered vats
should plan to provide an additional (estimated) 8M computrons to pay
for the work done during startVat. For block-based host applications,
the blocks that include new vat creation should now be smaller and
closer to the policy-guided limits.

closes #5344

warner · 2022-07-07T08:45:55Z

best reviewed one commit at a time, the meat is in the penultimate commit

FUDCo

This looks good, albeit perhaps slightly too clever.

One minor stylistic nit, tweak or not according to your judgement.

packages/SwingSet/src/kernel/kernel.js

mhofman

The general logic seems sound, and fairly well detailed. Thankfully it reduces the scope of some of the global state for vat termination / crank abort.

I am concerned that the no-op delivery cranks now report as 'crank' where they previously were 'none'.

Regarding type nits, in general I'd recommend letting types flow out and avoid explicitly specifying @returns unless needed.

For the vatWarehouse type loss however this is one of those cases where I wish we had noImplicitAny because that regression is way too easy to miss.

The vatAdminMethargs feels very ad-hoc, but I suppose until we have better/more use cases for the kernel generating these calls, it'd be too early to abstract.

The must change is adding an explicit type to vatWarehouse, but I'd also like to restore previous the policyInput behavior on no-delivery.

packages/SwingSet/src/kernel/kernel.js

packages/SwingSet/src/kernel/state/kernelKeeper.js

packages/SwingSet/src/kernel/vat-warehouse.js

packages/SwingSet/src/kernel/kernel.js

mhofman

LGTM!

Previously, `kernelKeeper.deductMeter()` both mutated the meter (downwards) and returned a pair of flags. `overflow` says "you exhausted the meter", and `notify` says "the value dropped below the notification threshold, you should tell somebody". This adds `checkMeter(proposedDeduction)`, which only returns a "the meter could handle that deduction" flag, and leaves the value unchanged. `deductMeter(deduction)` still performs the deduction, but only returns the `notify` flag. This will make it easier for a refactored kernel.js to check the deduction early, then apply it later, rather than needing to pass the `notify` data across multiple steps.

The kernel has a function named `terminateVat()`, which is called from both the delivery side (illegal syscall, `syscall.exit`, or meter overflow), and the device side (when someone asks vat-admin to terminate a vat from the outside). Part of this function's job is to shut down any worker that was online. Doing that is an async task. Previously, `terminateVat` was synchronous, and when it shut down the worker, it used `void` to disclaim interest on when exactly the shutdown occurred (as well as any errors encountered). This commit changes `terminateVat` to be `async`, and to wait for that shutdown to finish. This moves the `void` discomfort into `vat-admin-hooks.js`, where a new TODO is a reminder that we need to propagate the async-ness outwards, probably by converting the external shutdown request into a run-queue event, like `create-vat` and `upgrade-vat`.

This modifies the vat-warehouse API slightly, to provide a better set of methods for managing (killing) workers.

This commit modifies vat-upgrade to respond correctly to failures during the upgrade process, by rewinding the vat back to it's pre-upgrade state. If a failure occurs either during the `stopVat` sent to the old version, or during the `startVat` sent to the new version, the consequences of both deliveries will be unwound, the 'upgrade-vat' event popped from the delivery queue, and the worker abandoned. The next delivery to this vat will cause a new worker to be created from the previous state. The only state change that survives is a message to vat-admin, instructing it to reject the `E(adminFacet).upgradeVat()` result promise. From the outside world, it will appear as if upgrade was never requested, except that `upgradeVat()` rejects. To implement this, I rewrote the way delivery results are handled in kernel.js . `deliverAndLogToVat()` still returns `DeliveryStatus`, but each `processXYZ` function (`processSend`, `processVatUpgrade`, etc) is now responsible for transforming that `DeliveryStatus` into a new `CrankResult`. This `CrankResult` tells `processDeliveryMessage` what to do in a uniform way: abort the crankbuffer changes, terminate a vat, deduct a meter, etc. Most processor functions use a shared `deliveryCrankResults` helper function to build their results, which takes flags to indicate whether the delivery should deduct a meter, should decrement the BOYD/reapCount counter, etc. `processCreateVat` does its conversion in a different way, to merge the results of the two deliveries (`stopVat`, `startVat`), and to abandon the process in the middle if `stopVat` failed. This will also make it trivial to begin charging a meter for the work done during `startVat`, and to report this work to the runPolicy. However it does not do so yet. closes #5344

With this change, `startVat` computrons are now charged against the vat's meter, if any. They are also reported to the `runPolicy`. `startVat` includes both the swingset startup time and the userspace `buildRootObject()` consumption. Previously, userspace could perform arbitrary compute during `buildRootObject()` (up to the hard per-crank limit) without consequence. For metered vats, this meant the meter was not deducted for this early work. By omitting this work from the runPolicy, blocks which included a vat creation event were probably running for longer than we wanted. Some quick experiments show that it takes 8M computrons to start a vat, even for a fairly simple `buildRootObject()`. We still need to experiment to find out what a good block threshold should be, but this should give us more accurate information to work with.

Originally, we set `policyInput = ['none']` when a run-queue event did not result in a vat delivery (e.g. sends to a terminated vat, or promise-resolution notifications that were previously delivered as part of a resolution-graph batch). This branch regressed on that change slightly, using `['crank']` (with no computrons) even when no delivery was made. This commit fixes the branch to behave more like the original: `['none']` when no delivery is made. It also adds computrons to `['vat-terminated']`, which previously lacked computrons: this should give the run policy more accurate information to make runtime estimates.

There used to be three methods (killWorker, terminate, abandonWorker), but really we want exactly one of two things: * 1: make the worker stop, probably because we're terminating the vat, and we don't want the child process lingering * 2: make the worker stop and reset it's snapshot/transcript, because we're upgrading the vat, and we want to start with a fresh+empty slate with new source code The first case now uses `stopWorker`, which makes sure the child process is gone (and is idempotent, because vat-admin might be asked to terminate a vat which isn't currently paged-in). We can use this when terminating a vat, but in the future we can also use it when "pausing" a vat. In that case, some delivery will have triggered a problem, and we want to unwind that delivery, but *not* terminate the vat itself. To resume that vat, we'll need a worker that didn't observe the problematic delivery, so we need to stop that worker, so we can start a new one (from the previous snapshot/transcript). The second case now uses `resetWorker`, which both stops the child process, and also clears the snapshot/transcript settings. It doesn't delete the transcript per se (to preserve history for some deep/clever form of replay-based upgrade), but it does update the starting index, so subsequent worker starts will not replay any of the "old" deliveries).

warner added the SwingSet package: SwingSet label Jul 7, 2022

warner added this to the Mainnet 1 milestone Jul 7, 2022

warner self-assigned this Jul 7, 2022

warner mentioned this pull request Jul 7, 2022

vatPowers.exitWithFailure() should override later vatPowers.exit() #5731

Open

warner force-pushed the 5344-rewind-failed-upgrade branch 4 times, most recently from dbb1bf1 to 14fb81e Compare July 7, 2022 08:45

warner marked this pull request as ready for review July 7, 2022 08:45

warner requested review from FUDCo and mhofman July 7, 2022 08:46

FUDCo approved these changes Jul 8, 2022

View reviewed changes

packages/SwingSet/src/kernel/kernel.js Outdated Show resolved Hide resolved

warner force-pushed the 5344-rewind-failed-upgrade branch from 14fb81e to be6c16f Compare July 9, 2022 20:22

mhofman requested changes Jul 12, 2022

View reviewed changes

warner mentioned this pull request Jul 12, 2022

hard metering faults should report computrons to run policy #5752

Open

warner force-pushed the 5344-rewind-failed-upgrade branch from be6c16f to 0e111cc Compare July 12, 2022 20:57

warner requested a review from mhofman July 12, 2022 20:58

warner force-pushed the 5344-rewind-failed-upgrade branch 2 times, most recently from 207e04f to fd33443 Compare July 12, 2022 23:39

mhofman approved these changes Jul 13, 2022

View reviewed changes

warner force-pushed the 5344-rewind-failed-upgrade branch from fd33443 to 1302d1f Compare July 13, 2022 23:06

warner added the automerge:rebase Automatically rebase updates, then merge label Jul 13, 2022

warner added 7 commits July 14, 2022 07:07

chore: move vatWarehouse definition earlier to avoid lint complaints

59be3de

chore(swingset): vat-warehouse API changes

394185d

This modifies the vat-warehouse API slightly, to provide a better set of methods for managing (killing) workers.

chore: feedback updates

a22e193

warner added 2 commits July 14, 2022 07:07

warner force-pushed the 5344-rewind-failed-upgrade branch from 1302d1f to 0a2350c Compare July 14, 2022 07:07

mergify bot merged commit 9b13ea1 into master Jul 14, 2022

mergify bot deleted the 5344-rewind-failed-upgrade branch July 14, 2022 08:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rewrite delivery-result processing, rewind failed upgrades #5730

rewrite delivery-result processing, rewind failed upgrades #5730

warner commented Jul 7, 2022

warner commented Jul 7, 2022

FUDCo left a comment

mhofman left a comment

mhofman left a comment

rewrite delivery-result processing, rewind failed upgrades #5730

rewrite delivery-result processing, rewind failed upgrades #5730

Conversation

warner commented Jul 7, 2022

warner commented Jul 7, 2022

FUDCo left a comment

Choose a reason for hiding this comment

mhofman left a comment

Choose a reason for hiding this comment

mhofman left a comment

Choose a reason for hiding this comment