Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tolerate vatstore syscalls during vat startup #2910

Closed
warner opened this issue Apr 18, 2021 · 12 comments · Fixed by #4575
Closed

tolerate vatstore syscalls during vat startup #2910

warner opened this issue Apr 18, 2021 · 12 comments · Fixed by #4575
Assignees
Labels
bug Something isn't working SwingSet package: SwingSet
Milestone

Comments

@warner
Copy link
Member

warner commented Apr 18, 2021

Describe the bug

I realized the other day that, with the introduction of virtual objects, vats might attempt to create some during their startup phase: at the top-level module context, or inside buildRootObject() but outside of any remote message handler (methods of that root object). This will cause vatstoreSet() syscalls to happen much earlier than before. Our vat managers aren't currently prepared to honor syscalls this early.

We need to fix that, since we want to encourage Issuers to use virtual objects, and it's entirely reasonable for contracts to create some sort of singleton infrastructural Purse during startup.

The state changes that result from these syscalls (the kerneldb vatstore writes) should be lumped into the same atomic transaction as the clist creation and scheduling of the resolution of the vatAdmin vat-creation promise (the one that introduces the caller to the new vat's root object).

@warner warner added bug Something isn't working SwingSet package: SwingSet labels Apr 18, 2021
@warner warner self-assigned this Apr 18, 2021
@warner
Copy link
Member Author

warner commented Apr 18, 2021

In addition to #2911, we also need to rearrange the vatManager creation so that it gets the syscallHandler earlier. Currently, because of some vestigal limitation, we build a vatManager and get back a callback with which to provide the syscallHandler in a second step. I have a branch where I was able to rearrange this and deliver the handler in the manager-creation arguments instead.

@warner
Copy link
Member Author

warner commented Apr 18, 2021

In #2911 (comment) @FUDCo said:

Given that syscalls during startup aren't strictly necessary (we worked around it for the comms vat, after all)

I'm not sure we can avoid supporting syscalls during startup. We wrote the comms vat, so we can impose arbitrary restrictions on ourselves. But userspace code can make virtual objects at the top-most module context, outside of any method invocation:

const { makePurse } = makeKind(...);
const allPurses = new SpecialWeakMapThing();
const specialPurse = makePurse();
allPurses.set(specialPurse, { balance: 0 }); // or whatever

export function buildRootObject() {
 ...
}

The syscall.vatstoreSet() triggered by makePurse and/or allPurses.set will happen in the same turn that evaluates the source bundle. That will happen before we send any messages to the new vat, so they must be handled really early. And since state is changing, we need to commit those changes in a sensible way, so making this into it's own crank-like context feels like a good approach.

@katelynsills
Copy link
Contributor

We need to fix that, since we want to encourage Issuers to use virtual objects, and it's entirely reasonable for contracts to create some sort of singleton infrastructural Purse during startup.

Just wanted to question these requirements, as I think the requirements for Zoe and ERTP are actually different than this. Zoe contracts don't get run until a function in the ZCF contract instance vat is called, which is after buildRootObject is completed. The only way users can use ERTP issuers on-chain is in a Zoe contract, so for both Zoe and ERTP, there's currently no requirement to have virtual objects created in buildRootObject.

Were you thinking that this would change in the future? Or that we are making issuers in cosmic-swingset during the startup phase?

@warner
Copy link
Member Author

warner commented Apr 19, 2021

Good point, with our current contract-installation model, the only code with the opportunity for building virtual objects that early are static vats and the initial code bundle for dynamic vats (including ZCF). User-contributed contract code does not get that opportunity.

When I put my swingset/kernel/vat-platform hat on, all userspace is adversarial: static vats, dynamic vats, ZCF, everything. From that perspective, I can't control what userspace does, so I can either allow it or forbid it. Forbidding it means killing the vat if it tries to use a syscall too early. Allowing it means wiring up syscalls early enough to handle them (and recording the resulting state changes correctly). Forbidding it makes the userspace model slightly harder to explain.. it adds a funny footnote to the virtual object docs ("BTW don't use this until after buildRootObject and any promise-turns it creates have finished, under pain of vat death") that I'd be hard-pressed to justify ("why? because it's difficult? really?").

But with a higher-layer-of-abstraction-hat on (the Agoric/contract-platform one), yeah, we don't strictly need this: the people who write the code that gets to run this early (us) can obey the hard-to-justify restrictions imposed by those stubborn kernel people (also us).

I guess I want to minimize the assumptions we build into the lower layers of the system, and avoid tying the hands of developers in a way that will cause problems.

Your point about issuers in cosmic-swingset startup is a good one. The "virtual purses" feature (purses which represent low-level cosmos-sdk Bank module balances) need a special Issuer with some sort of device access. That will probably live in a static vat, and will probably involve virtual objects (for all the non-magic only-in-JS purses/payments created by that issuer). I don't know that it will have a need to create any of those early, though.

I think I'll reduce the priority of this feature a bit. If it's convenient to support as a side-effect of fixing something else (e.g. #2908) then it might get fixed earlier.

@warner warner changed the title tolerate vatstore syscalls during vat startup tolerate (or cleanly reject) vatstore syscalls during vat startup Apr 20, 2021
warner added a commit that referenced this issue Aug 18, 2021
The comms vat has a bunch of counters that need to be initialized exactly
once in the lifetime of the vat's durable state. There is a DB flag named
`initialize` to track whether this has happened already or not. Since vats
can only do syscalls during deliveries, not startup (see #2910), the comms
vat must read this flag on every delivery, just in case this is the very
first one, and the DB needs initialization.

The comms vat was using an in-RAM flag (`needToInitializeState`) to cache the
result, to avoid a DB read for every single delivery. However this caused the
syscall behavior to change for the first delivery after a kernel restart.
There were two extra syscalls taking place: a `vatstoreGet('initialized')`,
and a `vatstoreSet('meta.o+0', 'true')`. The `vatstoreSet` was causing a DB
change, which was picked up by the new kernel activityhash, and caused a
consensus failure on the first post-restart comms message.

This changes the comms vat to always read the DB flag, on every delivery,
making its behavior consistent and independent of kernel restarts. It also
moves the `vatstoreSet` to be guarded by the results of the `initialized`
check, so it only happens once in the lifetime of the vat.

To remove the need for the `vatstoreGet` on each delivery, we'll need to
enhance the way `setup()` is called, to provide an external flag that tells
the vat whether this is the very first time ever, or if it's merely a kernel
restart that required the vat's `dispatch` object to be reconstructed.

fixes #3726
warner added a commit that referenced this issue Aug 18, 2021
The comms vat has a bunch of counters that need to be initialized exactly
once in the lifetime of the vat's durable state. There is a DB flag named
`initialize` to track whether this has happened already or not. Since vats
can only do syscalls during deliveries, not startup (see #2910), the comms
vat must read this flag on every delivery, just in case this is the very
first one, and the DB needs initialization.

The comms vat was using an in-RAM flag (`needToInitializeState`) to cache the
result, to avoid a DB read for every single delivery. However this caused the
syscall behavior to change for the first delivery after a kernel restart.
There were two extra syscalls taking place: a `vatstoreGet('initialized')`,
and a `vatstoreSet('meta.o+0', 'true')`. The `vatstoreSet` was causing a DB
change, which was picked up by the new kernel activityhash, and caused a
consensus failure on the first post-restart comms message.

This changes the comms vat to always read the DB flag, on every delivery,
making its behavior consistent and independent of kernel restarts. It also
moves the `vatstoreSet` to be guarded by the results of the `initialized`
check, so it only happens once in the lifetime of the vat.

To remove the need for the `vatstoreGet` on each delivery, we'll need to
enhance the way `setup()` is called, to provide an external flag that tells
the vat whether this is the very first time ever, or if it's merely a kernel
restart that required the vat's `dispatch` object to be reconstructed.

fixes #3726
warner added a commit that referenced this issue Aug 18, 2021
The comms vat has a bunch of counters that need to be initialized exactly
once in the lifetime of the vat's durable state. There is a DB flag named
`initialize` to track whether this has happened already or not. Since vats
can only do syscalls during deliveries, not startup (see #2910), the comms
vat must read this flag on every delivery, just in case this is the very
first one, and the DB needs initialization.

The comms vat was using an in-RAM flag (`needToInitializeState`) to cache the
result, to avoid a DB read for every single delivery. However this caused the
syscall behavior to change for the first delivery after a kernel restart.
There were two extra syscalls taking place: a `vatstoreGet('initialized')`,
and a `vatstoreSet('meta.o+0', 'true')`. The `vatstoreSet` was causing a DB
change, which was picked up by the new kernel activityhash, and caused a
consensus failure on the first post-restart comms message.

This changes the comms vat to always read the DB flag, on every delivery,
making its behavior consistent and independent of kernel restarts. It also
moves the `vatstoreSet` to be guarded by the results of the `initialized`
check, so it only happens once in the lifetime of the vat.

To remove the need for the `vatstoreGet` on each delivery, we'll need to
enhance the way `setup()` is called, to provide an external flag that tells
the vat whether this is the very first time ever, or if it's merely a kernel
restart that required the vat's `dispatch` object to be reconstructed.

fixes #3726
michaelfig pushed a commit that referenced this issue Aug 18, 2021
The comms vat has a bunch of counters that need to be initialized exactly
once in the lifetime of the vat's durable state. There is a DB flag named
`initialize` to track whether this has happened already or not. Since vats
can only do syscalls during deliveries, not startup (see #2910), the comms
vat must read this flag on every delivery, just in case this is the very
first one, and the DB needs initialization.

The comms vat was using an in-RAM flag (`needToInitializeState`) to cache the
result, to avoid a DB read for every single delivery. However this caused the
syscall behavior to change for the first delivery after a kernel restart.
There were two extra syscalls taking place: a `vatstoreGet('initialized')`,
and a `vatstoreSet('meta.o+0', 'true')`. The `vatstoreSet` was causing a DB
change, which was picked up by the new kernel activityhash, and caused a
consensus failure on the first post-restart comms message.

This changes the comms vat to always read the DB flag, on every delivery,
making its behavior consistent and independent of kernel restarts. It also
moves the `vatstoreSet` to be guarded by the results of the `initialized`
check, so it only happens once in the lifetime of the vat.

To remove the need for the `vatstoreGet` on each delivery, we'll need to
enhance the way `setup()` is called, to provide an external flag that tells
the vat whether this is the very first time ever, or if it's merely a kernel
restart that required the vat's `dispatch` object to be reconstructed.

fixes #3726
@warner
Copy link
Member Author

warner commented Dec 22, 2021

In the most recent upgrade plan, ZCF in all non-initial versions will need to install and instantiate contract code very early, during buildRootObject() (or upgrade() or whatever we decide should execute at the start of the replacement vat's lifetime), before any other messages can be sent to the vat. We need that code to re-connect all virtual-object Kinds to their new behavior before anybody is allowed to talk to their instances, so liveslots should assert that every Kind created by the predecessor has been taken care of at the end of buildRootObject().

This will require (at least) syscall.vatStoreGet calls during startup, and probably vatStoreSet (to register new Kinds that didn't exist in the first version). The "baggage" might contain imports, which means we probably can't prevent user-level code from triggering syscall.send. The first version of the vat won't have any baggage, so buildRootObject at that stage doesn't have anyone to syscall.send to yet. But later versions might. We don't intend for Promises to survive the upgrade, so I wouldn't expect syscall.resolve to happen there.

@warner warner changed the title tolerate (or cleanly reject) vatstore syscalls during vat startup tolerate vatstore syscalls during vat startup Dec 30, 2021
@warner warner assigned FUDCo and unassigned warner Jan 19, 2022
@warner
Copy link
Member Author

warner commented Jan 19, 2022

@FUDCo is working on this now. The goal is to enable syscalls during buildRootObject(), but not attempt (and error cleanly) if any are made during top-level module evaluation (importBundle(vatBundle)). That means things like makeKind and makeBigMapStore outside of any function will cause an error during vat launch, possibly killing the vat (although user code could catch the error).

warner added a commit that referenced this issue Jan 22, 2022
Chip's work on #2910 discovered that the supervisor was not told about
failures during liveslot's dispatch(). This could conceal some bugs in
liveslots, as well as hiding userspace-caused failures during
`buildRootObject()`.

The contract between `dispatch()` and the calling supervisor code has changed
through various bouts of refactoring, and it was ambiguous as to whether
`dispatch()` was supposed to protect against userspace errors or not. This
commit clears up the documentation to make this more explicit.
FUDCo pushed a commit that referenced this issue Jan 22, 2022
Chip's work on #2910 discovered that the supervisor was not told about
failures during liveslot's dispatch(). This could conceal some bugs in
liveslots, as well as hiding userspace-caused failures during
`buildRootObject()`.

The contract between `dispatch()` and the calling supervisor code has changed
through various bouts of refactoring, and it was ambiguous as to whether
`dispatch()` was supposed to protect against userspace errors or not. This
commit clears up the documentation to make this more explicit.
@warner
Copy link
Member Author

warner commented Jan 24, 2022

I realized that syscalls during top-level module evaluation need to actually kill the vat, not just throw an error, because the new collection-manager's obtainStoreKindID() might be invoked at that time, and it has internal state which could be confused if allowed to continue operation after the syscall.vatstoreSet it attempts throws. If userspace caught the error, and allowed to the vat to live, the collection manager would be left half-initialized.

@warner
Copy link
Member Author

warner commented Jan 24, 2022

Note #3552 , which points out that top-level module code might use Promise.resolve().then(..) to retain agency after the initial importBundle promise fires. For an xsnap worker, the delivery won't finish until those turns run out, but that won't help anything in the supervisor. So whatever guard we implement really wants to be on the manager/kernel-process side. Probably in manager-helper at the point where it might redirect the syscall to the transcript. There should be a flag in that helper that says "syscalls are enabled", and if one arrives while it's clear, we should both kill the vat and send back a "illegal syscall, prepare to die" error result.

@warner
Copy link
Member Author

warner commented Jan 28, 2022

@FUDCo and I have a plan. We going with the "allow syscalls during top-level module code just in case" option, which means that the importBundle(vatBundle) needs to happen in the context of a delivery, when a transcript is ready to record the details of any syscalls being made. We give liveslots a new dispatch.startup() (name TBD) delivery to perform both the import and the call to buildRootObject(), which will be called exactly once in any vat's lifetime.

The changes we need to make are:

  • liveslots is currently built with a call to makeLiveSlots(), which returns { vatGlobals, inescapableGlobalProperties, setBuildRootObject, dispatch }. We change this to return just dispatch, and instead makeLiveSlots() accepts a function named createVatNamespace(vatGlobals, inescapableGlobalProperties)
  • supervisor-subprocess-xsnap.js has a setBundle(bundle) handler which currently does makeLiveSlots, importBundle(bundle), setBuildRootObject(vatNS.buildRootObject), and stashes the dispatch function. We change this to create a createVatNamespace function that closes over the bundle and performs the importBundle, returning vatNS, pass this into buildRootObject, and stashing dispatch. We let liveslots decide when to call this function.
    • Note: to avoid keeping the large (>2MB) bundle source around forever, let's make sure the bundle object is not accidentally kept alive by anything else. Liveslots will get this function as an argument, which (I think) means the function will live forever. Either we should change liveslots to explicitly createVatNamespace = null after calling it, or createVatNamespace should null out the bundle variable, or something
    • I'm thinking createVatNamespace instead of createBuildRootObject for future flexibility, to let liveslots see other exports of the vat bundle
  • We change liveslots to react to the startup delivery by calling createVatNamespace, extracting vatNS.buildRootObject, and calling it.
    • Basically the definition of setBuildRootObject is moved to become a dispatch method
    • the startup delivery should to include vatParameters, which moves it out of the arguments to setBundle
  • The dispatch.startup() call needs to be scheduled as the very first delivery to the new vat, for both static and dynamic vats. We want this delivery to happen promptly. The easiest way to do this is to add it to the transcript of the newly created vat record, with vatKeeper.addToTranscript(['startup', vatParameters]), immediately after the vat record is created. There are two places where this happens: src/kernel/initializeKernel.js (for static vats), and processCreateVat() in src/kernel/kernel.js (for dynamic vats). The diff is probably:
index 5b8fbed9c..eee7cf875 100644
--- a/packages/SwingSet/src/kernel/initializeKernel.js
+++ b/packages/SwingSet/src/kernel/initializeKernel.js
@@ -85,6 +85,7 @@ export function initializeKernel(config, hostStorage, verbose = false) {
       logStartup(`assigned VatID ${vatID} for genesis vat ${name}`);
       const vatKeeper = kernelKeeper.provideVatKeeper(vatID);
       vatKeeper.setSourceAndOptions({ bundle, bundleName }, creationOptions);
+      vatKeeper.addToTranscript(['startup', vatParameters]);
       vatKeeper.initializeReapCountdown(creationOptions.reapInterval);
       if (name === 'vatAdmin') {
         // Create a kref for the vatAdmin root, so the kernel can tell it
diff --git a/packages/SwingSet/src/kernel/kernel.js b/packages/SwingSet/src/kernel/kernel.js
index e1eceb4b7..c7c312121 100644
--- a/packages/SwingSet/src/kernel/kernel.js
+++ b/packages/SwingSet/src/kernel/kernel.js
@@ -701,6 +701,7 @@ export default function buildKernel(
       options.reapInterval = kernelKeeper.getDefaultReapInterval();
     }
     vatKeeper.setSourceAndOptions(source, options);
+    vatKeeper.addToTranscript(['startup', options.vatParameters]);
     vatKeeper.initializeReapCountdown(options.reapInterval);

     function makeSuccessResponse() {
  • for dynamic vats, this transcript entry will be "replayed" immediately even on the first time around (even though there was no previous "first time" around for it to be "played" originally).. they are born with a single false memory already implanted
  • for static vats, the same trick is used: the transcript is updated during initializeSwingset without a worker or xsnap in sight, which arranges for the startup to happen each time the vat is loaded without a heap snapshot

This will allow syscalls to be executed (and, more importantly, have their results recorded in the transcript) ...

.. results recorded ..

..

oh fudge

Ok, so injecting a transcript entry doesn't help: what we need are the results of that delivery, and we can't get that until we have a worker to deliver them to. This plan wouldn't make that happen during processCreateVat, and we can't do it at all during initializeSwingset.

Hrm, we could push something onto the kernel run-queue, as @FUDCo 's #4358 PR does, but I don't want the arbitrary delay between worker creation and buildRootObject execution which would happen if that's a dynamic vat.

Ok, gotta think about this some more. I'll add this comment even though it's not going to work in the form that we figured out today, sorry @FUDCo .

@warner
Copy link
Member Author

warner commented Jan 28, 2022

So the requirement is to actually make that startup delivery at the right time.

For a dynamic vat, I think that time is right after the manager is created:

--- a/packages/SwingSet/src/kernel/vatManager/vat-warehouse.js
+++ b/packages/SwingSet/src/kernel/vatManager/vat-warehouse.js
@@ -126,6 +126,9 @@ export function makeVatWarehouse(kernelKeeper, vatLoader, policyOptions) {
       }
     };
     const manager = await chooseLoader()(vatID, source, translators, options);
+    if (!recreate) {
+      await manager.deliver(['startup', options.vatParameters]);
+    }

     // TODO(3218): persist this option; avoid spinning up a vat that isn't pipelined
     const { enablePipelining = false } = options;

For a static vat.. maybe the best answer is the run-queue event that @FUDCo already implemented, pushed onto the run-queue during initializeSwingset, before pushing the bootstrap message. When the kernel starts for the first time, each static vat will get its startup event, then the bootstrap vat gets bootstrap, then things proceed as originally.

An alternative would be for vat-warehouse.js to be aware that it is bringing a static vat online for the first time, and to perform the manager.deliver() at that moment. That'd probably require it to look at the transcript length or something.

@warner
Copy link
Member Author

warner commented Jan 28, 2022

Oh, the "first time" predicate should be made explicit

if (!vatKeeper.hasBeenInitialized()) {
  await manager.deliver(['startup', options.vatParameters]);
}

ensureVatOnline might be the right place for this, or maybe in the two places that feed into it (one for static, one for dynamic) which are not exclusively used for resuming an existing dynamic vat.

@warner
Copy link
Member Author

warner commented Feb 8, 2022

@FUDCo https://github.com/Agoric/agoric-sdk/tree/2910-dispatch-startvat has the work we paired on

The current plan is to use Chip's start-vat run-queue event for static vats, pushed onto the queue by initializeSwingset, just before the bootstrap() delivery gets pushed. But for dynamic vats, we'll do the work from within the create-vat event handler (processCreateVat()), rather than using a separate run-queue event (which would introduce an unpredictable gap between the vat being created and the vat actually being ready for messages). We'll use a shared subroutine for both.

We want to ensure (test) that failures at various points of dynamic vat creation all result in a rejected promise going back to the parent, and all the state getting cleaned up.

FUDCo added a commit that referenced this issue Feb 17, 2022
FUDCo added a commit that referenced this issue Feb 17, 2022
FUDCo added a commit that referenced this issue Feb 22, 2022
FUDCo added a commit that referenced this issue Feb 22, 2022
@mergify mergify bot closed this as completed in #4575 Feb 22, 2022
@Tartuffo Tartuffo added this to the Mainnet 1 milestone Mar 23, 2022
@Tartuffo Tartuffo modified the milestones: Mainnet 1, RUN Protocol RC0 Apr 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants