-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rewrite delivery-result processing, rewind failed upgrades #5730
Conversation
dbb1bf1
to
14fb81e
Compare
best reviewed one commit at a time, the meat is in the penultimate commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, albeit perhaps slightly too clever.
One minor stylistic nit, tweak or not according to your judgement.
14fb81e
to
be6c16f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general logic seems sound, and fairly well detailed. Thankfully it reduces the scope of some of the global state for vat termination / crank abort.
I am concerned that the no-op delivery cranks now report as 'crank'
where they previously were 'none'
.
Regarding type nits, in general I'd recommend letting types flow out and avoid explicitly specifying @returns
unless needed.
For the vatWarehouse
type loss however this is one of those cases where I wish we had noImplicitAny
because that regression is way too easy to miss.
The vatAdminMethargs feels very ad-hoc, but I suppose until we have better/more use cases for the kernel generating these calls, it'd be too early to abstract.
The must change is adding an explicit type to vatWarehouse
, but I'd also like to restore previous the policyInput behavior on no-delivery.
be6c16f
to
0e111cc
Compare
207e04f
to
fd33443
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
fd33443
to
1302d1f
Compare
Previously, `kernelKeeper.deductMeter()` both mutated the meter (downwards) and returned a pair of flags. `overflow` says "you exhausted the meter", and `notify` says "the value dropped below the notification threshold, you should tell somebody". This adds `checkMeter(proposedDeduction)`, which only returns a "the meter could handle that deduction" flag, and leaves the value unchanged. `deductMeter(deduction)` still performs the deduction, but only returns the `notify` flag. This will make it easier for a refactored kernel.js to check the deduction early, then apply it later, rather than needing to pass the `notify` data across multiple steps.
The kernel has a function named `terminateVat()`, which is called from both the delivery side (illegal syscall, `syscall.exit`, or meter overflow), and the device side (when someone asks vat-admin to terminate a vat from the outside). Part of this function's job is to shut down any worker that was online. Doing that is an async task. Previously, `terminateVat` was synchronous, and when it shut down the worker, it used `void` to disclaim interest on when exactly the shutdown occurred (as well as any errors encountered). This commit changes `terminateVat` to be `async`, and to wait for that shutdown to finish. This moves the `void` discomfort into `vat-admin-hooks.js`, where a new TODO is a reminder that we need to propagate the async-ness outwards, probably by converting the external shutdown request into a run-queue event, like `create-vat` and `upgrade-vat`.
This modifies the vat-warehouse API slightly, to provide a better set of methods for managing (killing) workers.
This commit modifies vat-upgrade to respond correctly to failures during the upgrade process, by rewinding the vat back to it's pre-upgrade state. If a failure occurs either during the `stopVat` sent to the old version, or during the `startVat` sent to the new version, the consequences of both deliveries will be unwound, the 'upgrade-vat' event popped from the delivery queue, and the worker abandoned. The next delivery to this vat will cause a new worker to be created from the previous state. The only state change that survives is a message to vat-admin, instructing it to reject the `E(adminFacet).upgradeVat()` result promise. From the outside world, it will appear as if upgrade was never requested, except that `upgradeVat()` rejects. To implement this, I rewrote the way delivery results are handled in kernel.js . `deliverAndLogToVat()` still returns `DeliveryStatus`, but each `processXYZ` function (`processSend`, `processVatUpgrade`, etc) is now responsible for transforming that `DeliveryStatus` into a new `CrankResult`. This `CrankResult` tells `processDeliveryMessage` what to do in a uniform way: abort the crankbuffer changes, terminate a vat, deduct a meter, etc. Most processor functions use a shared `deliveryCrankResults` helper function to build their results, which takes flags to indicate whether the delivery should deduct a meter, should decrement the BOYD/reapCount counter, etc. `processCreateVat` does its conversion in a different way, to merge the results of the two deliveries (`stopVat`, `startVat`), and to abandon the process in the middle if `stopVat` failed. This will also make it trivial to begin charging a meter for the work done during `startVat`, and to report this work to the runPolicy. However it does not do so yet. closes #5344
With this change, `startVat` computrons are now charged against the vat's meter, if any. They are also reported to the `runPolicy`. `startVat` includes both the swingset startup time and the userspace `buildRootObject()` consumption. Previously, userspace could perform arbitrary compute during `buildRootObject()` (up to the hard per-crank limit) without consequence. For metered vats, this meant the meter was not deducted for this early work. By omitting this work from the runPolicy, blocks which included a vat creation event were probably running for longer than we wanted. Some quick experiments show that it takes 8M computrons to start a vat, even for a fairly simple `buildRootObject()`. We still need to experiment to find out what a good block threshold should be, but this should give us more accurate information to work with.
Originally, we set `policyInput = ['none']` when a run-queue event did not result in a vat delivery (e.g. sends to a terminated vat, or promise-resolution notifications that were previously delivered as part of a resolution-graph batch). This branch regressed on that change slightly, using `['crank']` (with no computrons) even when no delivery was made. This commit fixes the branch to behave more like the original: `['none']` when no delivery is made. It also adds computrons to `['vat-terminated']`, which previously lacked computrons: this should give the run policy more accurate information to make runtime estimates.
There used to be three methods (killWorker, terminate, abandonWorker), but really we want exactly one of two things: * 1: make the worker stop, probably because we're terminating the vat, and we don't want the child process lingering * 2: make the worker stop and reset it's snapshot/transcript, because we're upgrading the vat, and we want to start with a fresh+empty slate with new source code The first case now uses `stopWorker`, which makes sure the child process is gone (and is idempotent, because vat-admin might be asked to terminate a vat which isn't currently paged-in). We can use this when terminating a vat, but in the future we can also use it when "pausing" a vat. In that case, some delivery will have triggered a problem, and we want to unwind that delivery, but *not* terminate the vat itself. To resume that vat, we'll need a worker that didn't observe the problematic delivery, so we need to stop that worker, so we can start a new one (from the previous snapshot/transcript). The second case now uses `resetWorker`, which both stops the child process, and also clears the snapshot/transcript settings. It doesn't delete the transcript per se (to preserve history for some deep/clever form of replay-based upgrade), but it does update the starting index, so subsequent worker starts will not replay any of the "old" deliveries).
1302d1f
to
0a2350c
Compare
This PR modifies vat-upgrade to respond correctly to failures during
the upgrade process, by rewinding the vat back to it's pre-upgrade
state.
From the outside world, it will appear as if upgrade was never
requested, except that
upgradeVat()
rejects.It also deducts the computrons used during
dispatch.startVat
(including the userspace
buildRootObject()
execution) from both thevat's Meter and the host application's
runPolicy
. Metered vatsshould plan to provide an additional (estimated) 8M computrons to pay
for the work done during startVat. For block-based host applications,
the blocks that include new vat creation should now be smaller and
closer to the policy-guided limits.
closes #5344