Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vat-container options: XS, Worker, WASM, etc #1127

Closed
warner opened this issue May 21, 2020 · 15 comments
Closed

vat-container options: XS, Worker, WASM, etc #1127

warner opened this issue May 21, 2020 · 15 comments
Labels
metering charging for execution (was: package: tame-metering and transform-metering) SwingSet package: SwingSet xsnap the XS execution tool

Comments

@warner
Copy link
Member

warner commented May 21, 2020

As part of moving swingset to new-SES (in particular where exactly SES gets initialized, and how Compartments and the metering transform figure into it), I'm drawing up a "Vat Container" taxonomy. Here are my thoughts so far:

Assumptions

  • We can get small changes into XS to support what we're doing, either by writing C code that uses the XS API to configure/connect XS instances, or by working with upstream to get changes into XS proper
  • We can't make any changes to Node.js, V8, or any browser. Writing browser extensions and C/C++/etc -based plugins for Node.js is possible, but not preferred because it's a hassle.

Use Cases

  • Agoric Chain Node: Must be fully deterministic, defensive against malice (both for integrity and for availability), fully metered (all nodes must agree about when exactly they give up on long-running computation), efficiently restartable (checkpoints). Can be a special binary: users (at least validators) are willing to invest considerable effort in building/running/monitoring.
  • Agoric local/wallet node: might be in-browser, might be standalone local binary. Must be fairly easy to get running (users will generally be less willing to invest effort than for chain nodes). Does not need to be fully deterministic: must avoid hangover inconsistency but does not have to participate in consensus with other nodes. Must be defensive for integrity, but maybe not for availability if it does not run externally-supplied code, so it might not need complete metering (and does not need metering to be defensive).
  • general-purpose Vat runner (beyond the Agoric chain ecosystem): like local wallet node but made for running untrusted code, so must be defensive for availability too.

Target Platforms

  • Node.js (like what we're doing now)
  • native XS (SwingSet on xs: exploratory prototypes #47): we build a binary that contains the swingset host, static vats, communication plugins, and all the devices it requires
  • web browser (Can SwingSet run on browser? #58): use browser-local storage for persistence, HTTP/WS/WebRTC for comms. Code comes from a web server or an addon. We could also build an Electron-style app with similar properties/technology.
    • within a web browser, vats might run directly as JS, or we might compile XS to WASM and run the vat JS inside the XS instance

Resulting Properties

Sync vs Async Vats, Large vs Small Capacity

The environment we build might support sync vats, or it might only support async vats. The difference is that sync vats are able to make a synchronous/blocking syscall.deviceRead (#55) call, which invokes a device immediately and returns the resulting data directly. async vats can make all other syscalls (including the async commit-later deviceWrite), but they cannot call deviceRead. We can rewrite most devices (in particular the Timer device) to only require deviceWrite, but the big thing that needs deviceRead is the #455 "hierarchical object identifiers", which enables large tables that live in secondary storage, for Mints with zillions of Purse objects. We've also considered a "bulk-storage API" (#512) and "large-string device" (#46) which would need synchronous reads.

We'll use "large capacity" to describe environments that support these large tables (because they support sync vats), and "small capacity" to describe ones that don't (because all the vat-side state must essentially be kept on the heap in RAM).

Checkpointing: Efficient Long-Term Vat State Management

Our current vats enjoy orthogonal persistence: a carefree belief that they live forever, without the need to explicitly save or restore their state. This is enabled by our kernel, which maintains a transcript of all messages into and out of the vat over time. When the kernel is restarted and the vat needs to be regenerated from stored data, it rebuilds the initial state (by evaluating the original code that created the vat's root object), and then replays the entire transcript, delivering all inbound messages just like they happened the first time around (and comparing+discarding the outbound messages).

To make this efficient enough to use on a chain that runs for years, we must be able to snapshot the JS heap state (#511), one vat at a time, and reload that data into a new instance later. This will also help with scalability, because idle vats can be serialized off to disk and evicted from RAM, then "rehydrated" later when a message arrives for them. We won't checkpoint after every crank or block, but rather we'll periodically flatten the previous checkpoint and subsequent transcript down into a new checkpoint and an empty transcript.

Some environments we build will support this checkpointing: it is certainly critical for the chain nodes. Other environments might not, and would be suitable for low-volume or short-lived nodes, but would become increasingly inefficient to reload as their history grows.

Metering

Vat code might run too large (memory consumption), or too long (infinite loops, infinite Promise chains). This might happen because of simple bugs in otherwise-trusted code, higher usage patterns than we expected, or malice.

To protect the availability of other code running in the same environment, we need a way to shut down the over-enthusiastic computation, either by cancelling the one message (with some semantics that lets both caller and callee recover), or by terminating the vat entirely (#516) (which, while harsh, is easier to reason about). A more sophisticated metering scheme could notify some "keeper" code and give it the opportunity to pay for additional space or CPU cycles.

When swingset nodes are running in a chain environment, the decision of exactly when to give up on runaway computation must be deterministic, and made identically across the validators. This requires more precise metering than a solo environment.

We describe the metering abilities of an environment as:

  • cpu-none: no ability to limit CPU usage
  • cpu-timeout: coarse non-deterministic limits: if a message doesn't finish processing within N seconds, cancel the message or terminate the vat
  • cpu-exact: deterministic metering suitable for a chain node
  • ram-none: no limit on memory usage
  • ram-coarse: use process-level or JS engine-level estimate of bytes used, highly sensitive to GC and other factors, not deterministic
  • ram-exact: deterministic memory-usage metering suitable for chain node

From an operators point of view, it would be nice to assert a limit on MB of RAM in use, or CPU-seconds consumed per block. In particular, a chain's block time is limited by how quickly it can process messages, which is measured in seconds. However the meters we use may not be denominated in bytes or seconds, especially if it is deterministic, since those low-level measures are highly sensitive to unrelated factors like GC pressure and OS-level CPU load.

Implementation Techniques

This is a collection of information about our target platforms; tools we can use to implement the features described above.

HostDB sync-vs-async

p.s. see #3171

The Node.js-based kernel currently enjoys synchronous access to secondary storage, in the form of LMDB. Our choices of database were limited by the requirement for synchronous access: we were unable to use the wider "LevelDB" ecosystem because they offer a purely asynchronous API.

This also prevents us from using secondary storage in a browser environment, where IndexedDB is purely async (and LocalStorage is limited to 10MB, too small to be useful), or even tertiary storage, where we push an encrypted copy of the bulk data to an external server via HTTP when necessary (which is less vulnerable to eviction or the 2GB storage limit).

To enable sync vats, either the host must offer synchronous storage access, or the vat must somehow be paused while it waits for a read-device syscall.

Workers

Our target platforms offer various kinds of "Workers":

  • Node.js: Worker Threads
  • Browser: Worker
  • XS: offers Worker, and/or our C code can create a new XS instance, which is basically the same thing

In most cases the Worker runs on a separate (concurrent) thread, although XS gives us more control over scheduling.

In some environments a Worker can be suspended while waiting on data from outside. We're told this suspend-the-instance won't be too hard to implement in XS.

The benefit of a Worker is:

  • some environments (Node) offer per-Worker usage limits, which could be used for coarse metering
  • all Workers can be preemptively terminated (from the outside), which might be triggered by meter overflow
  • if the Worker can be paused, we can accomodate async-only host storage, such as LevelDB or browser's IndexedDB

The general idea is that the kernel thread sends a postMessage to the worker when it wants to initiate a crank (specifically when it wants to invoke one of the vat's dispatch methods, like dispatch.deliver or dispatch.notifyFulfillToData). Inside the worker, a supervisor uses setImmediate to track when the vat becomes quiescent, then invokes the vat code. While the vat runs, any asynchronous syscalls it makes are converted by the supervisor into postMessage calls back up to the kernel. When the vat becomes quiescent, the supervisor notifies the kernel with another postMessage, allowing the kernel to commit (or reject) the syscalls and finish the crank.

To reduce the overhead of copying, we might take advantage of transferrable ArrayBuffer objects. Node.js, at least, has an API to transfer ownership of a buffer entirely from one Worker to another thread. We could perhaps put the capdata.body string (created by JSON serialization) into one of these, and transfer it to the kernel rather than using the "structured copy" feature of postMessage. If the kernel could hang on to this buffer in the run-queue, it could transfer it to the target vat later, again without copying. It remains to be seen whether this would be useful.

To enable synchronous syscalls (deviceRead), the Worker must be suspended while waiting on the kernel's response. If we're not running in a modified XS engine, we anticipate using SharedArrayBuffer and Atomics.wait. The general idea is that the Worker and the kernel share a moderate-sized SharedArrayBuffer for the response, as well as a small one used for coordination. deviceRead uses postMessage to send the request and the arguments to the kernel, and then waits on the synchronization buffer with Atomics.wait. The kernel receives the request and invokes the device (which may involve various delays and Promises). When the response is ready, the kernel writes the response data into the shared response buffer, then writes a signal into the synchronization buffer. The Worker wakes up, reads the data out of the response buffer, and returns it to the caller in the vat. If the response could not fit in the buffer, the two sides can use a streaming protocol (one segment per round trip) to transfer everything out.

SharedArrayBuffer and Atomics.wait were disabled in many browsers a few years ago to mitigate Spectre/Meltdown attacks. They are likely to work on Node.js, and can probably be made to work on Chrome with a bunch of exciting headers. They remain disabled in Firefox for now. As a result, some platforms will support Worker-based synchronous syscalls, while some will not.

If we put each vat into a separate Worker (#1107), we can achieve the "cpu-timeout" style of metering by starting a timer when we dispatch a message into the Worker, and abandoning the delivery if the timer expires before the Worker becomes idle. This is not deterministic, and will vary depending upon CPU load and other external factors, but would be faster (it can be implemented without code transformations that inject metering code, which adds overhead), and might be good enough for development use.

SES

We want all our code (kernel, vats, untrusted guest code) to run under SES. We see three ways to get a SES environment:

  • import the SES-shim and invoke lockdown() very very early in the lifetime of the application, immediately after running any "vetted shims" (such as tame-metering which instruments global objects to count invocations and allocations)
  • run under XS, which is effectively all-SES all the time
  • run in some future JS engine, where the SES Proposal has been implemented, in which some as-yet-unspecified activation mechanism has been invoked

Each new JS environment needs to be SES-ified, so if we're using Workers, the SES shim must be invoked inside each new Worker before it loads any other code.

Metering Transforms

One way to apply exact metering (without making changes to the underlying JS engine) is to inject counters into all basic blocks (function calls, for loops, etc), and have them throw an exception when exceeded. try/catch blocks are similarly instrumented to prevent these meter-exhausted exceptions from being caught. The globals must also be modified, because some can be called in a way that consumes a lot of memory or CPU.

The injected code looks for a special name (currently getMeter) to acquire control over the counters. Most counters (function counts, memory usage) are increment-only, but the stack-frame counter is increment+decrement. As a result, to prevent confined code from fradulently decrementing its stack counter (enabling it to perform infinite recursion), the transform must also prohibit the confined code from referencing this name. getMeter must be placed in the the global lexical scope (where it will be visible to the injected code), but must not be added to the globalThis object (where confined code could do a computed property lookup to access, which is halting-problem impossible to prohibit).

This doesn't tell us exactly how many bytes are being used, or how many CPU cycles have been consumed. Just as Ethereum's EVM assigns gas costs to each opcode, the injected counters assigns a cost to each function call, or loop iteration, or Object creation. These costs will have a vague, non-linear, but hopefully monotonically-increasing "good enough" relationship with the bytes or CPU-seconds that users really care about. But we describe injected metering as "exact" because the behavior should be consistent across JS engines (modulo the known non-determinisms of the JS specification), and insensitive to things like GC behavior or CPU load from unrelated processes.

This transform must be inescapable. The guest code being metered must be transformed before it is evaluated. In addition, any eval calls used by that code must be changed to apply the transform first. Other pathways to evaluation, such as creating a new Compartment or importing a module, must be similarly instrumented.

Transformed code runs slower, so we want to disable the counters when possible, and avoid injecting code at all unless necessary. We do not need to transform trusted code like the kernel (and perhaps the static vats). The kernel needs to enable the global-object instrumentation just before it gives control to a vat, and disable it when the kernel gets control back again.

Compartments

The Compartment Proposal (currently part of SES, but not strictly bound to it) defines a Compartment as a collection of:

  • a distinct global object
  • a set of evaluators which reference that global: c.evaluate(code), plus internal eval calls
  • a module loader, so c.import(what) can to load a complete module graph into the Compartment
  • a list of transforms applied to all evaluated code

To enforce a transform on a Compartment, we must also wrap it's own Compartment constructor with one that propagates the transforms option. We expect this to live in a library function, rather than being a feature of the Compartment API itself.

We need to use at least three Compartments. The first is the all-powerful "start compartment", whose global provides access to platform authorities like filesystem and network access. The kernel and all non-metered vats should be loaded into a separate less-powerful Compartment, to apply POLA (the kernel does not need arbitrary filesystem access). And all metered vats should go into a third Compartment, where the metering transform is applied.

It is probably easier to use a distinct Compartment for each vat. We don't strictly need a separate global object for each vat (they'll all be the same: frozen and powerless), but they might have different module-loader configurations, and some will have the metering transform enforced.

Compartments do not provide any sort of preemptive termination, but we might run each vat inside a separate Worker, and then create a Compartment inside the Worker for the vat's code to live.

XS

The Moddable XS platform offers a low-footprint JS engine written in C. This lacks many of the high-performance JIT features found in V8 or SpiderMonkey, but has several compelling benefits for our system.

The first is safety: the code is much much smaller, and is more likely to be auditable than something like V8. Both are written in non-memory-safe languages, unfortunately, but XS is written in plain C, whereas V8 uses C++ aggressively. The lack of a JIT makes XS's behavior far more predictable.

We believe XS will implement the JS specification more deterministically. The spec has many areas where the exact behavior is left up to the implementor. For chain nodes, where exact consensus must be reached, we need to define the precise behavior of our platform. We expect to define this as "whatever XS does", and build a list of details as we identify ways in which XS diverges from other engines. It should be possible to write vat code such that it does not trigger this divergent behavior; this will become part of our style guides and linting tools. Chain nodes cannot rely upon this, of course, but we can make sure that local vat applications (non-XS-based, probably running on Node.js or in a browser) behave the same way under non-malicious circumstances.

XS is far more easily extendable than a larger engine like V8. We anticipate adding metering instrumentation into the XS codebase (to count function calls and allocations), rather than continue our source-to-source transformation at the Compartment level. This should have far less overhead and be easier to enforce without e.g. wrapping eval and Compartment.

XS already has most of the code necessary to serialize a Worker (specifically the underlying XS instance) to data, and we're working on the code to unserialize that data back into a running instance. With some touchups to a few objects, we think we can turn this into a new Worker, enabling an efficient save/restore persistence story.

With XS, we can control the communication pathways between Workers, and can probably suspend a Worker while it waits for a syscall to finish. This will allow the vat Worker to think it has a synchronous read-device syscall, while in fact the kernel side is making asynchronous calls to retrieve the necessary state.

XS defines modules much more explicitly than Node or a browser. It has eval (if enabled), but code cannot do arbitrary import statements and expect them to work. An XS application is built from a manifest that defines all the modules that can ever be referenced. Node.js lets you run node main.js and then the code can name other local files to import. For XS, you run a build step like make application that reads main.js and all the rest of your code, and compiles it all into a single executable. Then later you run that executable. This affects the way we build and launch our swingset applications, as well as requiring some careful integration with the Compartment module loader to accomodate the dynamic modules that will be installed into new vats.

Finally, XS is a C program, which means we might compile it down into WASM, and then execute it in a WASM instance. This is most interesting in a browser.

WASM

All modern browsers (as well as Node.js) offer WASM execution engines. Running a vat inside a WASM instance offers some intriguing properties:

  • the WASM specification is extremely deterministic, making consensus behavior somewhat easier to achieve
  • memory usage is explicitly limited by the ArrayBuffer that backs the instance, with a protocol between WASM instance and the host to ask for more, which may be useful for metering
  • persistence can be achieved by simply writing that memory buffer to disk somewhere

However WASM instances run WASM bytecode, not JavaScript. To use this, we'll need to compile XS (written in C) down into WASM (we know this works already, and WASM is an officially supported platform for XS). In doing so, we can apply other customizations, like suspending the vat while it makes an apparently-synchronous syscall (which turns into a WASM-to-host import invocation, which is synchronous but which can be answered by a subsequent host-to-WASM export invocation, to unsuspend the vat). This can enable synchronous vats in a host that only offers asynchronous storage (i.e. IndexedDB).

The Supervisor

If we use Workers, the first code we will execute inside each Worker will be a Supervisor. This must:

  • import tame-metering (if this vat is metered) to instrument the global objects for metering
  • import SES and call lockdown() (if the engine isn't already SES)
  • create a new Compartment for the liveslots and vat code, so it doesn't get access to start-compartment globals
  • install the metering transform into the Compartment (if this vat is metered), and wrap its Compartment object to enforce this transform on internal Compartments
  • prepare an outbound channel for asynchronous syscalls, wrapping the postMessage object available to the start compartment
  • prepare a SharedArrayBuffer for synchronous syscalls, if possible
  • import the vat root-object definition module into the Compartment. The code for this might come from a postMessage event.
  • import the liveslots module into the Compartment
  • invoke liveslots, let it get the root object, connect to its dispatch/syscall objects
  • wait for the kernel to deliver a request

On each delivery, the supervisor must use setImmediate to monitor quiescence, then configure metering, then invoke the liveslots dispatch function. While this runs, it may invoke various syscalls, which must be delivered to the kernel (and, for synchronous deviceRead calls, the results must be returned). When setImmediate fires, the supervisor knows the vat code has lost agency, and cannot regain control until the next delivery is made. At this point, it notifies the kernel that the crank is complete, and goes back to sleep waiting for a new postMessge delivery.

@warner warner added the SwingSet package: SwingSet label May 21, 2020
@warner
Copy link
Member Author

warner commented May 21, 2020

I put together a program to try and analyze our options:

from __future__ import print_function
import sys

# options for swingset in browser, node, and XS

headerstr = "case  XS    WASM  XSmet JSmet Worker Work/Vat Atomics SyncDB:   capacity save     meter-CPU   meter-RAM"
dashstr   = "----  --    ----  ----- ----- ------ -------- ------- ------    -------- ----     ---------   ---------"
formatstr = "{:<5} {:<5} {:<5} {:<5} {:<5} {:<5}  {:<5}    {:<5}   {:<5}     {:<5}    {:<8} {:<11} {}"
def bools(b):
    return "yes" if b else "no"

def describe(case, limitations,
             xs, wasm_and_xs, xs_metering, js_metering,
             vats_in_worker, worker_per_vat, atomics,
             sync_hostdb):
    if "browser" in limitations or "leveldb" in limitations:
        if xs and not wasm_and_xs:
            return
        if sync_hostdb:
            return
    if "firefox" in limitations:
        if xs and not wasm_and_xs:
            return
        if sync_hostdb or atomics:
            return
    if "node" in limitations:
        if xs or wasm_and_xs:
            return
    # wasm_and_xs, and we change XS to suspend the JS thread while waiting
    # for an async wasm-to-host call. Then the JS host can use an async DB.
    suspended_syscalls = (vats_in_worker and atomics) or wasm_and_xs
    sync_syscalls = (sync_hostdb and not vats_in_worker) or suspended_syscalls

    cpu_metering = "cpu-none"
    if (vats_in_worker and worker_per_vat):
        cpu_metering = "cpu-timeout"
    if (wasm_and_xs and xs_metering) or js_metering:
        cpu_metering = "cpu-exact"

    memory_metering = "ram-none"
    if (wasm_and_xs and xs_metering) or js_metering:
        memory_metering = "ram-exact"

    if sync_syscalls:
        capacity = "large" # large object tables can live in DB
    else:
        capacity = "small" # objects must live in RAM

    serialize = "save-no"
    if wasm_and_xs or (xs and vats_in_worker and worker_per_vat):
        serialize = "save-yes"

    print(formatstr.format(case,
                           bools(xs), bools(wasm_and_xs), bools(xs_metering), bools(js_metering),
                           bools(vats_in_worker), bools(worker_per_vat), bools(atomics),
                           bools(sync_hostdb),

                           capacity, serialize, cpu_metering, memory_metering))


case = 1
limitations = sys.argv[1].split(",") if len(sys.argv) > 1 else []
print(headerstr, file=sys.stderr)
print(dashstr, file=sys.stderr)
for xs in [False, True]:
    for wasm_and_xs in [False, True] if xs else [False]:
        for xs_metering in [False, True] if wasm_and_xs else [False]:
            for js_metering in [False, True]:
                for vats_in_worker in [False, True]:
                    for worker_per_vat in [False, True] if vats_in_worker else [False]:
                        # 'atomics' means SharedArrayBuffer, Atomics, which
                        # requires lots of headers to enable in a browser, and
                        # not available yet in some browsers. If vats aren't run
                        # in a worker, it doesn't matter
                        for atomics in [False, True] if vats_in_worker else [False]:
                            for sync_hostdb in [False, True]: # LMDB=true, leveldb=false
                                describe(case, limitations,
                                         xs, wasm_and_xs, xs_metering, js_metering,
                                         vats_in_worker, worker_per_vat, atomics,
                                         sync_hostdb)
                                case += 1

@warner
Copy link
Member Author

warner commented May 21, 2020

The columns are:

  • XS: do we use XS at all?
  • WASM: do we use WASM at all?
  • XSmet: do we instrument WASM to implement metering?
  • JSmet: do we use source-to-source transformations to inject metering into JS code?
  • Worker: do we use Workers at all?
  • Work/Vat: do we use one Worker per vat?
  • Atomics: can we use SharedArrayBuffer and Atomics.wait to pause a worker until the kernel can return data from a synchronous deviceRead syscall?
  • SyncDB: does the kernel have synchronous access to secondary storage (like LMDB)? "no" means it only has async access (like LevelDB or IndexedDB)

In each case, the properties of the resulting environment are:

  • capacity: "large" if the environment can use secondary storage for vat data, enabling large tables (hierarchical identifiers). "small" if it cannot, and RAM usage will grow with table size
  • save: "true" if vat states can be checkpointed and saved, "false" if the transcript grows without bound
  • meter-CPU: "cpu-none" means we have no visibility or control into CPU usage, "cpu-timeout" means non-deterministic cutoff of cranks that take too long, "cpu-exact" means deterministic termination after some fixed number of high-level JS operations
  • meter-RAM: "ram-none" means no visibility or control of memory usage, "ram-exact" means deterministic termination after some fixed number of high-level JS allocations

The full output is:

% python vat-options.py
case  XS    WASM  XSmet JSmet Worker Work/Vat Atomics SyncDB:   capacity save     meter-CPU   meter-RAM
----  --    ----  ----- ----- ------ -------- ------- ------    -------- ----     ---------   ---------
1     no    no    no    no    no     no       no      no        small    save-no  cpu-none    ram-none
2     no    no    no    no    no     no       no      yes       large    save-no  cpu-none    ram-none
3     no    no    no    no    yes    no       no      no        small    save-no  cpu-none    ram-none
4     no    no    no    no    yes    no       no      yes       small    save-no  cpu-none    ram-none
5     no    no    no    no    yes    no       yes     no        large    save-no  cpu-none    ram-none
6     no    no    no    no    yes    no       yes     yes       large    save-no  cpu-none    ram-none
7     no    no    no    no    yes    yes      no      no        small    save-no  cpu-timeout ram-none
8     no    no    no    no    yes    yes      no      yes       small    save-no  cpu-timeout ram-none
9     no    no    no    no    yes    yes      yes     no        large    save-no  cpu-timeout ram-none
10    no    no    no    no    yes    yes      yes     yes       large    save-no  cpu-timeout ram-none
11    no    no    no    yes   no     no       no      no        small    save-no  cpu-exact   ram-exact
12    no    no    no    yes   no     no       no      yes       large    save-no  cpu-exact   ram-exact
13    no    no    no    yes   yes    no       no      no        small    save-no  cpu-exact   ram-exact
14    no    no    no    yes   yes    no       no      yes       small    save-no  cpu-exact   ram-exact
15    no    no    no    yes   yes    no       yes     no        large    save-no  cpu-exact   ram-exact
16    no    no    no    yes   yes    no       yes     yes       large    save-no  cpu-exact   ram-exact
17    no    no    no    yes   yes    yes      no      no        small    save-no  cpu-exact   ram-exact
18    no    no    no    yes   yes    yes      no      yes       small    save-no  cpu-exact   ram-exact
19    no    no    no    yes   yes    yes      yes     no        large    save-no  cpu-exact   ram-exact
20    no    no    no    yes   yes    yes      yes     yes       large    save-no  cpu-exact   ram-exact
21    yes   no    no    no    no     no       no      no        small    save-no  cpu-none    ram-none
22    yes   no    no    no    no     no       no      yes       large    save-no  cpu-none    ram-none
23    yes   no    no    no    yes    no       no      no        small    save-no  cpu-none    ram-none
24    yes   no    no    no    yes    no       no      yes       small    save-no  cpu-none    ram-none
25    yes   no    no    no    yes    no       yes     no        large    save-no  cpu-none    ram-none
26    yes   no    no    no    yes    no       yes     yes       large    save-no  cpu-none    ram-none
27    yes   no    no    no    yes    yes      no      no        small    save-yes cpu-timeout ram-none
28    yes   no    no    no    yes    yes      no      yes       small    save-yes cpu-timeout ram-none
29    yes   no    no    no    yes    yes      yes     no        large    save-yes cpu-timeout ram-none
30    yes   no    no    no    yes    yes      yes     yes       large    save-yes cpu-timeout ram-none
31    yes   no    no    yes   no     no       no      no        small    save-no  cpu-exact   ram-exact
32    yes   no    no    yes   no     no       no      yes       large    save-no  cpu-exact   ram-exact
33    yes   no    no    yes   yes    no       no      no        small    save-no  cpu-exact   ram-exact
34    yes   no    no    yes   yes    no       no      yes       small    save-no  cpu-exact   ram-exact
35    yes   no    no    yes   yes    no       yes     no        large    save-no  cpu-exact   ram-exact
36    yes   no    no    yes   yes    no       yes     yes       large    save-no  cpu-exact   ram-exact
37    yes   no    no    yes   yes    yes      no      no        small    save-yes cpu-exact   ram-exact
38    yes   no    no    yes   yes    yes      no      yes       small    save-yes cpu-exact   ram-exact
39    yes   no    no    yes   yes    yes      yes     no        large    save-yes cpu-exact   ram-exact
40    yes   no    no    yes   yes    yes      yes     yes       large    save-yes cpu-exact   ram-exact
41    yes   yes   no    no    no     no       no      no        large    save-yes cpu-none    ram-none
42    yes   yes   no    no    no     no       no      yes       large    save-yes cpu-none    ram-none
43    yes   yes   no    no    yes    no       no      no        large    save-yes cpu-none    ram-none
44    yes   yes   no    no    yes    no       no      yes       large    save-yes cpu-none    ram-none
45    yes   yes   no    no    yes    no       yes     no        large    save-yes cpu-none    ram-none
46    yes   yes   no    no    yes    no       yes     yes       large    save-yes cpu-none    ram-none
47    yes   yes   no    no    yes    yes      no      no        large    save-yes cpu-timeout ram-none
48    yes   yes   no    no    yes    yes      no      yes       large    save-yes cpu-timeout ram-none
49    yes   yes   no    no    yes    yes      yes     no        large    save-yes cpu-timeout ram-none
50    yes   yes   no    no    yes    yes      yes     yes       large    save-yes cpu-timeout ram-none
51    yes   yes   no    yes   no     no       no      no        large    save-yes cpu-exact   ram-exact
52    yes   yes   no    yes   no     no       no      yes       large    save-yes cpu-exact   ram-exact
53    yes   yes   no    yes   yes    no       no      no        large    save-yes cpu-exact   ram-exact
54    yes   yes   no    yes   yes    no       no      yes       large    save-yes cpu-exact   ram-exact
55    yes   yes   no    yes   yes    no       yes     no        large    save-yes cpu-exact   ram-exact
56    yes   yes   no    yes   yes    no       yes     yes       large    save-yes cpu-exact   ram-exact
57    yes   yes   no    yes   yes    yes      no      no        large    save-yes cpu-exact   ram-exact
58    yes   yes   no    yes   yes    yes      no      yes       large    save-yes cpu-exact   ram-exact
59    yes   yes   no    yes   yes    yes      yes     no        large    save-yes cpu-exact   ram-exact
60    yes   yes   no    yes   yes    yes      yes     yes       large    save-yes cpu-exact   ram-exact
61    yes   yes   yes   no    no     no       no      no        large    save-yes cpu-exact   ram-exact
62    yes   yes   yes   no    no     no       no      yes       large    save-yes cpu-exact   ram-exact
63    yes   yes   yes   no    yes    no       no      no        large    save-yes cpu-exact   ram-exact
64    yes   yes   yes   no    yes    no       no      yes       large    save-yes cpu-exact   ram-exact
65    yes   yes   yes   no    yes    no       yes     no        large    save-yes cpu-exact   ram-exact
66    yes   yes   yes   no    yes    no       yes     yes       large    save-yes cpu-exact   ram-exact
67    yes   yes   yes   no    yes    yes      no      no        large    save-yes cpu-exact   ram-exact
68    yes   yes   yes   no    yes    yes      no      yes       large    save-yes cpu-exact   ram-exact
69    yes   yes   yes   no    yes    yes      yes     no        large    save-yes cpu-exact   ram-exact
70    yes   yes   yes   no    yes    yes      yes     yes       large    save-yes cpu-exact   ram-exact
71    yes   yes   yes   yes   no     no       no      no        large    save-yes cpu-exact   ram-exact
72    yes   yes   yes   yes   no     no       no      yes       large    save-yes cpu-exact   ram-exact
73    yes   yes   yes   yes   yes    no       no      no        large    save-yes cpu-exact   ram-exact
74    yes   yes   yes   yes   yes    no       no      yes       large    save-yes cpu-exact   ram-exact
75    yes   yes   yes   yes   yes    no       yes     no        large    save-yes cpu-exact   ram-exact
76    yes   yes   yes   yes   yes    no       yes     yes       large    save-yes cpu-exact   ram-exact
77    yes   yes   yes   yes   yes    yes      no      no        large    save-yes cpu-exact   ram-exact
78    yes   yes   yes   yes   yes    yes      no      yes       large    save-yes cpu-exact   ram-exact
79    yes   yes   yes   yes   yes    yes      yes     no        large    save-yes cpu-exact   ram-exact
80    yes   yes   yes   yes   yes    yes      yes     yes       large    save-yes cpu-exact   ram-exact

@warner
Copy link
Member Author

warner commented May 21, 2020

If we limit the options to browsers that either don't enable Atomics/SharedArrayBuffer, or we don't figure out how to make them work, we get:

% python vat-options.py firefox
case  XS    WASM  XSmet JSmet Worker Work/Vat Atomics SyncDB:   capacity save     meter-CPU   meter-RAM
----  --    ----  ----- ----- ------ -------- ------- ------    -------- ----     ---------   ---------
1     no    no    no    no    no     no       no      no        small    save-no  cpu-none    ram-none
3     no    no    no    no    yes    no       no      no        small    save-no  cpu-none    ram-none
7     no    no    no    no    yes    yes      no      no        small    save-no  cpu-timeout ram-none
11    no    no    no    yes   no     no       no      no        small    save-no  cpu-exact   ram-exact
13    no    no    no    yes   yes    no       no      no        small    save-no  cpu-exact   ram-exact
17    no    no    no    yes   yes    yes      no      no        small    save-no  cpu-exact   ram-exact
41    yes   yes   no    no    no     no       no      no        large    save-yes cpu-none    ram-none
43    yes   yes   no    no    yes    no       no      no        large    save-yes cpu-none    ram-none
47    yes   yes   no    no    yes    yes      no      no        large    save-yes cpu-timeout ram-none
51    yes   yes   no    yes   no     no       no      no        large    save-yes cpu-exact   ram-exact
53    yes   yes   no    yes   yes    no       no      no        large    save-yes cpu-exact   ram-exact
57    yes   yes   no    yes   yes    yes      no      no        large    save-yes cpu-exact   ram-exact
61    yes   yes   yes   no    no     no       no      no        large    save-yes cpu-exact   ram-exact
63    yes   yes   yes   no    yes    no       no      no        large    save-yes cpu-exact   ram-exact
67    yes   yes   yes   no    yes    yes      no      no        large    save-yes cpu-exact   ram-exact
71    yes   yes   yes   yes   no     no       no      no        large    save-yes cpu-exact   ram-exact
73    yes   yes   yes   yes   yes    no       no      no        large    save-yes cpu-exact   ram-exact
77    yes   yes   yes   yes   yes    yes      no      no        large    save-yes cpu-exact   ram-exact

The simplest/weakest implementation (case 1) wouldn't use Workers at all: each vat goes into a separate Compartment, everything is under SES, but vat code could kill the swingset by allocating too much memory or going into an infinite loop. We'd have to use full transcripts for all persistence, and vats could not call deviceRead.

The barest minimum metering we could is case 7, which prevents runaway CPU usage, but could still kill the machine by excessive memory usage. This doesn't inject metering code, but does put each vat into a separate worker. The kernel would simply apply a timeout to each delivery, and if it didn't complete within a few seconds, the vat is killed. This might be enough for developers.

The simplest way to achieve coarse (non-deterministic) protection against CPU and memory exhaustion is case 11: inject metering code, but not use Workers. This doesn't enable secondary storage, deviceRead, or snapshots. This is the best we can do (in a non-Atomics-enabled browser) without involving XS.

To achieve long-term operation in a browser, we need snapshots, which we can achieve by compiling XS to WASM, running each vat in a separate WASM instance, and saving the WASM linear memory. This is case 41. Since we're using XS, we can modify it to suspend vat execution while the kernel does async storage work, so we don't need Atomics. If we also want full metering, we can add it by instrumenting either the XS code (case 61) or injecting metering into the JS code (case 51)

@warner
Copy link
Member Author

warner commented May 21, 2020

If we look at a browser that does have an Atomic.wait we can use (python vat-options.py browser), we get these additional options, all of which enable large-capacity vats:

case  XS    WASM  XSmet JSmet Worker Work/Vat Atomics SyncDB:   capacity save     meter-CPU   meter-RAM
----  --    ----  ----- ----- ------ -------- ------- ------    -------- ----     ---------   ---------
5     no    no    no    no    yes    no       yes     no        large    save-no  cpu-none    ram-none
9     no    no    no    no    yes    yes      yes     no        large    save-no  cpu-timeout ram-none
15    no    no    no    yes   yes    no       yes     no        large    save-no  cpu-exact   ram-exact
19    no    no    no    yes   yes    yes      yes     no        large    save-no  cpu-exact   ram-exact
45    yes   yes   no    no    yes    no       yes     no        large    save-yes cpu-none    ram-none
49    yes   yes   no    no    yes    yes      yes     no        large    save-yes cpu-timeout ram-none
55    yes   yes   no    yes   yes    no       yes     no        large    save-yes cpu-exact   ram-exact
59    yes   yes   no    yes   yes    yes      yes     no        large    save-yes cpu-exact   ram-exact
65    yes   yes   yes   no    yes    no       yes     no        large    save-yes cpu-exact   ram-exact
69    yes   yes   yes   no    yes    yes      yes     no        large    save-yes cpu-exact   ram-exact
75    yes   yes   yes   yes   yes    no       yes     no        large    save-yes cpu-exact   ram-exact
79    yes   yes   yes   yes   yes    yes      yes     no        large    save-yes cpu-exact   ram-exact

Case 5 adds large-capacity to our simple case 1 (no checkpointing, no metering), by running both the kernel and all vats in a single shared Worker, and suspending that worker when it wants the main thread to do an async storage operation. Case 9 adds large-capacity vats to our case 7 (basic runaway CPU protection).

We still cannot get vat snapshots without going to XS+WASM.

@warner
Copy link
Member Author

warner commented May 21, 2020

If we look at Node.js, without XS, we get:

% python vat-options.py node
case  XS    WASM  XSmet JSmet Worker Work/Vat Atomics SyncDB:   capacity save     meter-CPU   meter-RAM
----  --    ----  ----- ----- ------ -------- ------- ------    -------- ----     ---------   ---------
1     no    no    no    no    no     no       no      no        small    save-no  cpu-none    ram-none
2     no    no    no    no    no     no       no      yes       large    save-no  cpu-none    ram-none
3     no    no    no    no    yes    no       no      no        small    save-no  cpu-none    ram-none
4     no    no    no    no    yes    no       no      yes       small    save-no  cpu-none    ram-none
5     no    no    no    no    yes    no       yes     no        large    save-no  cpu-none    ram-none
6     no    no    no    no    yes    no       yes     yes       large    save-no  cpu-none    ram-none
7     no    no    no    no    yes    yes      no      no        small    save-no  cpu-timeout ram-none
8     no    no    no    no    yes    yes      no      yes       small    save-no  cpu-timeout ram-none
9     no    no    no    no    yes    yes      yes     no        large    save-no  cpu-timeout ram-none
10    no    no    no    no    yes    yes      yes     yes       large    save-no  cpu-timeout ram-none
11    no    no    no    yes   no     no       no      no        small    save-no  cpu-exact   ram-exact
12    no    no    no    yes   no     no       no      yes       large    save-no  cpu-exact   ram-exact
13    no    no    no    yes   yes    no       no      no        small    save-no  cpu-exact   ram-exact
14    no    no    no    yes   yes    no       no      yes       small    save-no  cpu-exact   ram-exact
15    no    no    no    yes   yes    no       yes     no        large    save-no  cpu-exact   ram-exact
16    no    no    no    yes   yes    no       yes     yes       large    save-no  cpu-exact   ram-exact
17    no    no    no    yes   yes    yes      no      no        small    save-no  cpu-exact   ram-exact
18    no    no    no    yes   yes    yes      no      yes       small    save-no  cpu-exact   ram-exact
19    no    no    no    yes   yes    yes      yes     no        large    save-no  cpu-exact   ram-exact
20    no    no    no    yes   yes    yes      yes     yes       large    save-no  cpu-exact   ram-exact

Here, we cannot get snapshots, but we can get large-capacity and deterministic metering in case 12, by injecting metering code into JS, and using a synchronous DB like LMDB.

If we want to use an async DB like LevelDB, we use case 15: inject JS metering, use a worker to suspend vats upon read. We can do this with one-worker-per-vat (19), or a shared worker for the kernel and all vats (15).

@warner
Copy link
Member Author

warner commented May 23, 2020

Endgame

This comment describes my half-baked plans for the targets I currently have in mind.

The plan involves a VatWorker object (which sometimes, but not always, involves a platform-level Worker). There will be one of these for each Vat. Its API will be something like p = vw.dispatch(dispatchData, syscall). The kernel will invoke this API each time a message needs to be sent to the vat. The details of the message (dispatch.deliver, dispatch.notifyFulfillToData, etc) go into dispatchData. The kernel provides the syscall, argument: initially it blocks and returns data, but eventually we switch to one that returns a promise for whatever data it might return (i.e. for readDevice). All syscalls are single-file: once we move to async syscall, it will become illegal for a VatWorker to call it until the previous return Promise is resolved. The VatWorker will have an internal queue to avoid doing that.

The kernel provides this syscall to vw.dispatch each time. This is a reminder that the syscall object is not supposed to be called in between dispatches. The VatWorker will forget the syscall object at the same time it resolves the return promise. At some point it might make sense for the kernel to provide a different syscall object each time, which closes over the kernel state as of the start of the block, or something.

The vat code itself will live in some sort of platform-specific container: it might be a Worker across a postMessage boundary, with maybe a blocking Atomic.wait-based pathway, or an XS instance, with maybe a special blocking glue-code pathway, or a WASM instance, or it might just be a direct function call. @michaelfig tells me there are coroutine libraries that could help here. The VatWorker will interact with the vat code through this pathway, telling the vat what it needs to do, and giving it a way to make syscalls. This makes the VatWorker responsible for platform-specific threads, blocking, SharedArrayBuffer, special XS magic, etc.

The vw.dispatch Promise doesn't fire until the vat is idle once more, so the kernel knows that no more syscalls will be made until the next activation (the vat has "lost agency"). In the meantime, the VatWorker might invoke syscall some number of times. The Promise needs to resolve to some indication of how things went: success, illegal operation (terminating the vat), meter exhaustion (also terminating the vat, for now). We might have syscall return information about illegal operations to let the VatWorker shut things down earlier if the vat is being terminated.

These VatWorkers are made by a function named makeVatWorker(), which the controller (in the start compartment) provides to the kernel (confined in a SES compartment). makeVatWorker needs to take arguments that define the vat being loaded. Once we refactor vat configuration to make liveslots-based vats the common case (removing the setup(..) { return helpers.liveslots.createLiveSlots(..) } boilerplate), the makeVatWorker() arguments might be a bundle of the liveslots code, plus a bundle of the static vat code. Whatever they are, the arguments must be copyable data (because there may be a Worker/postMessage boundary involved).

The data that defines static vats will come from the controller: it passes something into the kernel with addGenesisVat, and then the kernel passes it back into the controller's makeVatWorker function. For our Node-based environment, these can be source bundles as created by bundleSource. For XS, where things are more statically defined, they might just be module names. We still have to figure out how we're going to generate the static module records for dynamic vats when the (XS) platform really prefers pre-compiled modules. MFig points out that we need eval in XS anyways, for dynamic vats, so we should avoid using the static-compiled-modules thing and lean on importBundle and Compartment.eval for everything: kernel, static vats, and dynamic vats. To avoid a separate yarn build step (which we always forget), we can build these bundles in a Node.js program that then immediately launches the XS program, keeping the generated bundle files on disk somewhere the XS program can find them. The goal is POLA, and we could put the controller in a separate Compartment to reduce the authority it gets. The start compartment is used to grab the bundles from disk, and set up the HostDB, and then everything else is done by the reduced-authority Controller Compartment. At that point it might make sense to merge the controller and kernel somehow.

Intermediate Steps

There will be some intermediate steps, to retain a working system despite the time it will take to figure out and implement the following:

  • XS: modifying XS to enable a synchronous/blocking cross-Worker call
  • browser: figuring out the headers necessary to build SharedArrayBuffer/Atomic.wait blocking cross-Worker call
  • figuring out coroutines or fibers or something to suspend workers
  • rewriting the Timer device to use only async calls
  • XS: serializing/reloading XS instances (i.e. Workers)
  • anything involving WASM

I currently think the sequence will be:

  • implement VatWorker in Node.js, without any actual Worker threads, just repartitioning the API. This changes vatManager.js where it instantiates the vat, and handles dispatch/syscall pathways. Most of the code is around using Compartment and metering.
  • rewrite the kernel's calls to vatManager to use VatWorker instead. Initially the syscall is invoked synchronously.
  • split callNow into readDevice and writeDevice, both synchronous. We change the timer device to (ab)use readDevice
  • rewrite the Timer Device to use writeDevice, not readDevice. This means having the caller allocate the ID, or allocate a temporary ID that can be used to correlate the create-timer request with the eventual response. We might need to change the dIBC device in a similar way. At this point nobody is calling readDevice, and nobody is paying attention to the return value of any syscall, but they are all still synchronous
  • next, we change syscalls to be degenerately asynchronous. The kernel provides an async syscall to vw.dispatch that just adds async to each function, or equivalently replaces the return x with return Promise.resolve(x) (at this point, x is always undefined). The kernel uses a flag to guard against reentrancy for later. The VatWorker adds a single-file queue to make sure it only makes one syscall at a time.

After that sequence, we can do a couple different things in parallel:

Async Kernel

Now that kernel syscalls are allowed to return a Promise, the kernel can start using async operations internally. The syscall handlers which do c-list lookups become async, then we change the Keepers to return Promises (initially degenerate Promise.resolve(retval)), and work our way down from the top.

That kernel reentrancy guard is now pretty important. We need a clear story for external inputs (inbox device calls from the host) to make them all single-file, probably some helper function which wraps the exported device function and delays invoking it if the kernel is already busy doing something. All possible pathways into the kernel should go through this queue. We should provide a clear place for pre-crank and post-crank operations to happen, especially the post-block commit point.

When this path is done, we can finally reap the benefits: an async HostDB. At this point we could move from LMDB to LevelDB (in Node), or move this to a browser and use IndexedDB.

XS

We write a C startup program that uses the XS API to create a new XS instance for the controller, and loads code into it. The startup program is also responsible for creating a (sync) HostDB object, probably using LevelDB (whose C/C++ API is synchronous, it's just the JS one that's async). (We can't use e.g. npm's leveldb module, because it's Node.js/gyp specific, but we can probably use other ecosystem modules that build on top of it). We might asyncify the kernel in a parallel development path, but the XS path won't take advantage of that: the XS HostDB will just return Promise.resolve(results) everywhere.

At first, the controller implements createVatWorker by just creating a new Compartment, hardening the global, and using import-bundle to load it with code. This "direct" so-called "worker" doesn't need much of a supervisor, just support for metering, single-file syscall execution, and a setImmediate-based idle detector.

Next, we change createVatWorker to make an actual Worker object and install a more sophisticated supervisor into it. This supervisor must install SES, create the same Compartment as before, and provide a syscall mechanism that uses the worker's start-compartment's postMessage to send syscall requests over to the kernel thread.

Now we have two XS improvements we can work on in parallel:

  • 1: Enhance the C startup program to expose serialization/snapshotting of the worker's XS instance to the VatWorker. Provide a load/restore feature too. Implement fancier save-vat-state mechanisms in the kernel. Go crazy with cool demand-paged vats.
  • 2: Implement Worker-suspending "blocking" calls in C, expose it to the VatWorker. Implement deviceRead() on the supervisor, have it call the kernel's (async) syscall object and get back a Promise for the results. Copy the results over into the worker's space, then unpause the worker, which thinks it just made a blocking read. Implement fancier hierarchical-identifier synthetic-object schemes. Go crazy with cool large-capacity DB-backed vats.

Browser

We can start making swingset work in a browser before fully async-ifying the kernel, but we'll be limited to a fake non-persistent synchronous HostDB that keeps everything in RAM (or stashing everything in LocalStorage, which is limited to something like 10MB).

We can achieve coarse metering by having the VatWorker create a real Worker for each vat, and timing out message deliveries. This might be enough for simple systems: they can guard against runaway vats but would still be killed by excess memory usage.

Once the kernel can use an async HostDB, we can switch to using IndexedDB for storage, which increases our available space to perhaps gigabytes. With some fancy batching, we could also accumulate deltas of state changes and ship them off-site, to some external server with as much storage as we want.

The awesome science-fiction future of this path is to visit a web page (or browser extension), type in a strong credential, retrieve an encrypted state bundle, decrypt just enough to figure out the contents of the run queue and the configuration of the IO channels, and then resume your previously-suspended swingset machine in your browser. The page can fetch specific vats as messages arrive for them, replaying their transcripts and delivering the message. At the end of each block, we re-encrypt the state deltas and commit them to off-site storage, then release any newly-generated messages to the IO channels.

XS+WASM=AWESOME

The VatWorker that runs under XS should also be compilable to WASM. The VatWorker we write for a browser should be able to instantiate that WASM code and install the vat bundle inside it. In that world, we can checkpoint the vat by saving the WASM instance's linear memory buffer. This improves our science-fiction "your vats, anywhere" world with efficient restores of those old vats. We can also achieve exact metering by instrumenting XS before compiling it to WASM.

And whatever C code we write to suspend the Worker can also be included here, so the WASM code makes a blocking invocation of an import, but then suspends the vat code until the kernel gets around to invoking an export with the results. Then we enable deviceRead and get large-capacity DB-backed vats in a browser too.

Target Systems

So, this is what I think things will look like when we're done:

Chain Node

The Agoric chain is our most stringent environment. The validators (and full-node followers) which run here must be fully deterministic, so they can always agree with the other validators. They must protect availability against arbitrarily malicious code, which requires complete (deterministic) metering. They will run for years, so they must be efficiently restartable (requiring snapshots/checkpoints). They require high-capacity vats to store all the Purse data tables in secondary storage.

On the plus side, validators can be asked to do more work to get the node running: we do not need to support arbitrary developer's workstations. So the program that implements a chain node can be more specialized.

This program will be a compiled binary, which links together the Golang-based Cosmos-SDK, the XS library (written in C), and a fair bit of agoric-specific glue code (also written in C). This has the pieces described above in the "XS" path, plus a lot of interaction with the Cosmos-SDK. The SDK gets to invoke devices to deliver inbound messages, and devices get to invoke the SDK to effect outbound changes.

Solo Nodes (desktop)

Initially, solo nodes will run under Node.js, not XS, for a better development cycle and integration with existing libraries. We can use an async kernel so it can use LevelDB. Coarse metering (Worker-per-Vat, timeout each dispatch, no memory-exhaustion protection) is probably sufficient at the start. We won't have checkpoints, so restarting a long-running vat will take a while.

Later we might switch solo nodes over to the XS-based system, for consistency, better metering and checkpointing.

Solo Nodes (browser)

Nodes in a browser will get more fully-featured as we finish the development steps described above. Initially they will be somewhat ephemeral, and have minimal metering or protection against runaway code. But eventually (XS-on-WASM) they should have the same features and defenses as the chain nodes.

@warner
Copy link
Member Author

warner commented May 28, 2020

MarkM's "Slow Caps" Proposal

In today's meeting, @erights analyzed our likely use of "large-capacity vats" (aka "huge tables" aka "hierarchical object references") and proposed a scheme that would let us get away with purely async vats.

There are vat-side details that I didn't entirely grasp, so I'll let him fill in the blanks, but the overall scheme looks a lot like the hierarchical references we described earlier (#455), except the kernel figures out what data the vat will need ahead of time, and delivers it along with the dispatch message. The vat doesn't do any additional reads, but can synthesize the "virtual" objects with just the data that the kernel provided. The vat does do additional writes, to update the kernel with any changes to this extra data, but they go to the kernel, not some separate device. No extra device would be used (probably). In this scheme, the kernel is much more involved and aware of these special identifiers, whereas in our previous #455 thinking the vat makes device calls to get the data it needs (which go through the kernel, but the kernel is otherwise unaware of what's going on).

In more detail:

  • We define a new kind of kernel reference identifier, the slowcap. These will be strings like ksNN that live in the kernel tables, next to object references (koNN), promise references (kpNN), and device-node references (kdNN).
  • Slowcap references go into c-lists just like object references: the vat which "owns" the slowcap has a s+NN <-> ksNN mapping, and vats which are given a remote reference to it will have a s-NN <-> ksNN mapping
  • Each slowcap has some associated capdata (body and slots), just like promises that have been resolved to data. The slots could be objects, promises, devicenodes, or other slowcaps. Many use cases will only involve plain data in the body, and no additional slots.
  • When the kernel is preparing to dispatch a message into a vat (dispatch.deliver, notifyFulfillToData, or notifyReject), the message will include some capdata as arguments or the resolution value. If this capdata includes slowcaps in the .slots which are owned by the receiving vat (s+NN), the kernel must retrieve the associated data for those slowcaps (recursively: slowcaps which point to other slowcaps will cause more fetches). The capdata structure will be changed to include this slowcapData, somehow.
  • When sending a slowcap into some other vat (s-NN), it arrives as a plain remote reference, just like o-NN, and the user-level code gets a Presence as usual. The receiving vat gets no additional data. We might even consider giving the vat an o-NN reference instead of s-NN, and hiding the special nature completely.
  • The kernel can take as long as it wants to fetch this associated data: it can use an async HostDB to retrieve it. The vat is not invoked until all the data is ready.
  • On the vat side, when liveslots sees a s+NN slowcap in the .slots, the deserialization layer consults a pre-registered user-level function that knows how to synthesize a virtual object from the slowcapData. This function may involve some sort of schema for the associated data, a class definition (or moral equivalent thereof), a table of methods, upgrade policies, etc.
  • The NN in s+NN might be a composite identifier (hierarchical), with the earlier components used to tell liveslots which virtual table is being referenced. Or this type identifier might simply live in the slowdata.
  • If the virtual object is sent back out through the serialization layer, liveslots will emit the same s-NN/s+NN value that it received.
  • The virtual object must have some sort of "modify your contents" API (setters, or something more explicit.. think React or ORM objects and their .commit() methods). When invoked, the vat will invoke a new syscall (perhaps syscall.writeSlowData(slowcap, capdata), which returns nothing). The kernel will update the HostDB entries for the slowcap, so that the next time it is passed into the vat, the vat will be given the new associated data. The kernel can take as long as it wants to write these new HostDB entries, as they don't need to happen until after the vat has finished with the crank.

Within the user-level vat code, I think this looks mostly like #455. The main new data structure would behave like a virtual Map or maybe WeakMap, registered with the serialization layer during vat startup. This virtual map (maybe called a "HugeTable") pretends to hold a very large number of objects, but is very particular about the keys and values it holds. You can ask it to create a new row (key/value pair), and the key it creates for you (the "virtual object", or "ephemeral object", or maybe "slow object"??), e.g. a new Purse or Payment object, will be associated with a new slowcap. The value you add must conform to the pre-registered schema (e.g. the balance owned by this Purse).

You can use the virtual object in the arguments of an outbound message send (or Promise resolution), which will be serialized as s+NN and cause the kernel to store the slowdata.

You can probably also use the virtual object in the value of a new row in some other virtual map. This will cause one chunk of slowdata to contain references to other slowcaps.

Normally we expect the virtual object to be dropped before the end of the crank (if it were kept around for a long time, we wouldn't save any RAM by keeping its data on disk). But you might keep the virtual object around in a non-virtual container for a while; perhaps all the active escrow purses of a contract that is expected to conclude in a reasonably short amount of time. This might cause identity problems in subsequent message sends, or maybe not. Without WeakRefs, liveslots may not be able to tell that it still has a virtual object for a given s+NN identifier (and we certainly don't want to keep them alive in a strong map). As a result, we plan to discourage/not-support identity comparison of these virtual objects. You can use them as keys in a virtual map, and they'll work as expected, but the javascript objects that contain them won't be guaranteed to compare as equal.

@zarutian
Copy link
Contributor

zarutian commented Jun 4, 2020

https://github.com/NeilFraser/JS-Interpreter might be a worth a look as yet another substrate to run a vat on top off.

One could make an zygote js-interp instance, load in the ses-shim and other shims or packages wanted.
Then use their serialize.js and get a JSON string.
To decrease its size one might use https://pieroxy.net/blog/pages/lz-string/index.html .
When a new vat is needed, one simply deserialize the zygote JSON and evaluate the spefic vat code into it.
The a vat snapshot is desired use the same serialize as above.
One could use https://github.com/benjamine/jsondiffpatch to
make deltas between the zygote and the vats, and make deltas
between subsequent snapshots of the same vat. Then compress them using lz above.

@zarutian
Copy link
Contributor

zarutian commented Jun 5, 2020

re "Slow caps" proposal: can this extra data be called cookies and the rehydration code inside the vat called cookie monsters?

Just as a reference to a short story by Vernor Vinge :3

@zarutian
Copy link
Contributor

Having thought about it a bit, the slow caps idea could be simplyfied quite a bit.
This assumes, that the caps in question are mainly used as keys into maps, the NN part of it never changeing and the c-list storage is both persistant and cheap.
When the comms deserializer comes across an s+NN (it points into the recieving vat) in capdata, instead of mapping it directly to an object it stuffs it into an SlowCap record, which is mainly just a type signifier. This allows the vat to do (distributed) tries or other such radix sharding using the NN part as the key.
The kernel could provide a syscall to the vat so the vat can tell the kernel it could drop the slow cap.
If the NN might change the just providing an (kernel generated?) uuid as the bodydata could solve that.

@warner
Copy link
Member Author

warner commented Jul 20, 2020

One concern I've heard (I think from @dtribble) about the slow-caps approach is that it's only easy to do for shallow/simple tables, where the vat is able to explain to the kernel, ahead of time, what data it will need for any given slowcap. If the vat had synchronous access to secondary storage, the vat could enact whatever complicated schema it wanted, like an auction handle referencing the purses of all the submitted bids, along with secondary tables referenced by those purses, etc. But if the vat needs to tell the kernel about the relationship between the handle and the tables rows ahead of time, it will be limited to a simpler and less-dynamic schema.

The relationship between the slowcap's ahead-of-time schema and giving the vat full synchronous access to secondary storage is a lot like the relationship between eventual-send and full mobile code. The first case is an optimized / easier-to-implement subset of the latter's more general / harder-to-implement case.

But that made me think about prepare-commit again, and how we might accomplish it without too much kernel involvement. So here's a proposal.

  • each vat gets a dedicated key-value store named "vat storage", whose data lives in secondary storage (on disk)
  • "dedicated" means no other vat can read or modify the contents, so there is no contention, and no locking needed
  • the vat/kernel API acquires three new methods:
    • syscall.queryStorage(handle, key)
    • syscall.writeStorage(key, value)
    • dispatch.storageResults(handle, key, value)
  • syscall.queryStorage queues the request, and will result in a subsequent dispatch.storageResults message (with a copy of the handle, for correlation) on some future crank
  • writeStorage and queryStorage have the normal reads-observe-writes coherency policy; no special new device model: read/writeLater #55 writeLater semantics (which would require defining the block boundary where writes become observable)
  • all syscalls have an empty return value (enabling async syscalls, and vats in isolated workers, and easier atomic commit/abandonment of cranks)
  • Vat operations use a form of prepare/commit: the 'prepare' phase is where the necessary pieces of vat storage are transferred into the vat (into liveslots), and the 'commit' phase is where the vat code is given the message and all the associated stored data
  • liveSlots gets a "prepare agent" from the vat code, whose job is to look at the inbound message, look at the secondary data it has cached, and keep sending out queryStorage requests (and accumulating the storageResults) until it has enough to satisfy the vat code's requirements
  • once the requirements are met, liveslots releases the message to the vat code proper, with synthesized record objects as arguments. These objects can read from the liveslots vat-storage cache to satisfy their client's lookups.
  • the prepare phase will span multiple cranks, so liveslots must queue all other dispatch.deliver and dispatch.notify* messages for later.
  • Liveslots might need a way to know that the vat code has finished with C1 (the commit phase for message 1) before it starts gathering data for P2 (the prepare phase for the second message), so it can refrain from reading secondary storage until the previous commit's writes are complete. This is most significant when the final commit message into the vat code takes multiple turns to complete (do we want to effectively lock vat-storage until the C1 commit phase is complete?). The overall sequence would be is P1/P1/P1/C1/P2/P2/C2/P3/P3/P3/C3.
  • The vat code can submit an arbitrarily complex prepare agent to liveslots, without the kernel being aware of the details.

From the vat code's point of view, it won't be notified until all the data it needs is available within liveslots for synchronous access. It is required to explain how to achieve this (to liveslots), but after that point, it no longer needs to be aware of the asynchronous nature of the secondary storage.

From liveslots' point of view, it has a powerless agent that came from the vat code (so it can be arbitrarily complex and tailored to the vat's needs), which is told the references of each inbound message, and returns additional keys to fetch from secondary storage. It keeps fetching more data and feeding the agent until the agent says the vat will be satisfied. Then the agent is purged and the message can be released to the vat code. Liveslots is responsible for preventing interleaving of multiple messages, making the prepare/commit phase look uninterrupted.

From the kernel's point of view, it just sends dispatch.deliver into liveslots as usual. There might be a dozen deliveries queued up, and the first one might require multiple vat-storage roundtrips before it can be executed, but the kernel is not responsible for tracking that. The kernel sends all the deliveries into the vat, and liveslots queues them internally to keep everything in the right order.

Concerns:

  • atomicity of vat storage: I think it's sufficient to say that the crank is not complete until all queryStorage calls have been executed (with new storageResults messages pushed onto the run-queue), and writeStorage calls have been retired (modifying the secondary store). The secondary store must evolve/step/commit at the same time as the rest of the kernelDB, so restarting from saved state will get coherent/consistent data. All data coming into the vat is still recorded in the transcript, so we don't need old versions of the secondary storage to replay vat states. But we might need to track two versions, old+new, and store the generation number in the kernel state vector (the new vat-storage isn't "committed" until we write the new generation into the kernel state, at which point we can throw out the old generation and start accumulating changes for the next one, ping-ponging or stair-stepping or whatever your favorite metaphor is)
  • transcript size: this doesn't help us remove large data from the vat transcript (only JS engine checkpoints help us with transcript size). But it does mean the vat can forget the data once the operation is complete. Another reason for having the vat code inform liveslots when its commit phase completes is so that liveslots can empty its cache. Ideally liveslots knows exactly what data is needed at all times, so it can flush everything else.
  • insufficient generality: I know @dtribble was interested in vats having arbitrary device access, and described storage as merely a specific instance of what ought to be general-purpose devices. But I'm thinking that giving each vat a dedicated secondary-storage API, where it doesn't need to worry about any other vat changing things between cranks, would allow this prepare-commit scheme to work without making the data reads synchronous.

@erights
Copy link
Member

erights commented Jul 21, 2020

Open question: Can @warner 's "prepare agent" proposal emulate the "slowcap" proposal by vat-side ("user") code?

(When I asked this question verbally, I got opposite answers. Hence "open".)

I do not have an opinion yet about whether emulating slowcaps would be useful, but it would help us understand the proposal.

@warner
Copy link
Member Author

warner commented Jul 22, 2020

After today's discussion, I think we're back to synchronous syscalls to read secondary storage. The concerns with a "prepare agent" (coming mostly from @dtribble) are similar to those with kernel-based "slowcaps" and/or prepare-commit: the programming model changes too much. Needing to explain your schema (and most importantly the rules for what data needs to be fetched) in a separate object, distant from where the data is being accessed, feels like it will become a barrier for programmers to overcome. The "prepare agent" API would be a bundle of source (to construct a pure function, without access to mutable state), that must know enough about the vat code to correctly analyze incoming messages and figure out what state it needs. The "slowcaps" (kernel-managed) approach would need to express this same logic declaratively, for implementation in the kernel. Both sounded like more trouble than the hacks we have in mind to make blocking syscalls work.

So we're going to plan to have a synchronous/blocking "read data now" syscall, which operates on a chunk of secondary storage that is dedicated to the specific vat. There will be a synchronous "write data soon" syscall too: reads observe previous writes, but writes are not fully committed until/unless the crank finishes without error. The API will be in the form of a "hugetable" Store-like object, which needs a schema but is in cahoots with liveslots and is backed to secondary storage.

  • XS in a shared process will use low-level C hacks to block the worker while the kernel does DB work
  • XS in a separate process (build prototype XS-based VatWorker process, define kernel-vatworker protocol #1299) will use a blocking pipe read to stall the worker
  • Node.js in a shared process will use a plain blocking syscall
  • Node.js in a separate Worker will use Atomics to block the worker (this is the lowest-priority approach, I don't think Workers buy us much here)

Web browsers will have the same huge-table API, but the data will stay in RAM (or in LocalStorage), and just won't be able to get so large (unless we explore option 41 and run the JS in an XS-based WASM box, allowing us to block the vat while we wait for an async kernel read). That's probably ok.

I'll write up more in #455 where the API work will happen, leaving this ticket to be about platform requirements.

@warner
Copy link
Member Author

warner commented Jul 29, 2020

We identified a couple of properties that result from running a vat in a separate worker process:

  • If syscalls are implemented with a blocking read to the kernel process, then the vat can block even though the kernel does arbitrary async work to satisfy the syscall. This gives us more flexiiblity of kernel DB tools (we can use LevelDB, whose JS bindings are only async), while retaining the use of synchronous syscalls.
  • It's easier to implement the Vat Worker in a different language than the kernel. Our first example will be a Vat Worker written as an XS program, while the kernel uses Node.js . If this weren't in a separate process, we'd need FFI calls from Node.js out to a library containing the worker code, which is more complexity and work.
  • We can use coarse (ulimit/timer)-based metering on the vat: the ulimit kills the worker if it uses too much memory, and the Vat Manager gives up waiting on the worker if a Delivery takes more than like 5 seconds. This isn't deterministic (and of course is not something we could write up into a deterministic specification), but it's practically free, and could still be useful for solo machines that don't have to agree with anybody else about metering decisions.
  • It also enables some forms of parallelism. There's a lot of work to be done to take advantage of that, but:
    • two messages that go to separate vats could be run in parallel. Each vat will still be deterministic (our usual arrival-order nondeterminism), but kernel identifiers allocated by syscalls might interleave in arbitrary orders, so the kernel as a whole might not be deterministic
    • there might be a way to allocate these things hierarchically, to avoid that, but maybe not
  • A process boundary might make snapshotting the state easier to work with. In a pinch we could just take a coredump and call that the saved state (large, not portable in the slightest, would need something like emacs' undump to load back in, but it might work)

warner added a commit that referenced this issue Aug 7, 2020
Refactor the creation of VatManagers to use a single
function (vatManagerFactory), which takes "managerOptions", which include
everything needed to configure the new manager.

The handling of options was tightened up, with precondition checks on the
options-bag contents.

One new managerOption is `managerType`, which specifies where we want the new
vat to run. Only `local` is supported right now (our usual Compartment
sharing the main thread approach), but #1299 / #1127 will introduce new types
that use separate threads, or subprocesses.

The controller/kernel acquired a new `shutdown()` method, to use in unit
tests. This is unused now, but once we add workers that spawn threads or
subprocesses, we'll need to call+await this at the end of any unit test that
creates thread/subprocess-based workers, otherwise the worker thread would
keep the process from ever exiting.
warner added a commit that referenced this issue Aug 7, 2020
This adds a per-vat option to run the vat code in a separate thread, sharing
the process with the main (kernel) thread, sending VatDelivery and VatSyscall
objects over the postMessage channel. This isn't particularly useful by
itself, but it establishes the protocol for running vats in a
separate *process*, possibly written in a different language or using a
different JS engine (like XS, in #1299).

This 'nodeWorker' managertype has several limitations. The shallow ones are:

* vatPowers is missing transformTildot, which shouldn't be hard to add
* vatPowers.testLog is missing, only used for unit tests so we can probably
live without it
* vatPowers is missing makeGetMeter/transformMetering (and will probably
never get them, since they're only used for within-vat metering and we're
trying to get rid of that)
* metering is not implemented at all
* delivery transcripts (and replay) are not yet implemented

Metering shouldn't be too hard to add, although we'll probably make it an
option, to avoid paying the instrumented-globals penalty when we aren't using
it. We also need to add proper control over vat termination (via meter
exhaustion or manually).

The deeper limitation is that nodeWorkers cannot block to wait for a
syscall (like `callNow`), so they cannot invoke devices.

refs #1127
closes #1384
warner added a commit that referenced this issue Aug 7, 2020
Refactor the creation of VatManagers to use a single
function (vatManagerFactory), which takes "managerOptions", which include
everything needed to configure the new manager.

The handling of options was tightened up, with precondition checks on the
options-bag contents.

One new managerOption is `managerType`, which specifies where we want the new
vat to run. Only `local` is supported right now (our usual Compartment
sharing the main thread approach), but #1299 / #1127 will introduce new types
that use separate threads, or subprocesses.

The controller/kernel acquired a new `shutdown()` method, to use in unit
tests. This is unused now, but once we add workers that spawn threads or
subprocesses, we'll need to call+await this at the end of any unit test that
creates thread/subprocess-based workers, otherwise the worker thread would
keep the process from ever exiting.
warner added a commit that referenced this issue Aug 7, 2020
This adds a per-vat option to run the vat code in a separate thread, sharing
the process with the main (kernel) thread, sending VatDelivery and VatSyscall
objects over the postMessage channel. This isn't particularly useful by
itself, but it establishes the protocol for running vats in a
separate *process*, possibly written in a different language or using a
different JS engine (like XS, in #1299).

This 'nodeWorker' managertype has several limitations. The shallow ones are:

* vatPowers is missing transformTildot, which shouldn't be hard to add
* vatPowers.testLog is missing, only used for unit tests so we can probably
live without it
* vatPowers is missing makeGetMeter/transformMetering (and will probably
never get them, since they're only used for within-vat metering and we're
trying to get rid of that)
* metering is not implemented at all
* delivery transcripts (and replay) are not yet implemented

Metering shouldn't be too hard to add, although we'll probably make it an
option, to avoid paying the instrumented-globals penalty when we aren't using
it. We also need to add proper control over vat termination (via meter
exhaustion or manually).

The deeper limitation is that nodeWorkers cannot block to wait for a
syscall (like `callNow`), so they cannot invoke devices.

refs #1127
closes #1384
warner added a commit that referenced this issue Aug 7, 2020
Refactor the creation of VatManagers to use a single
function (vatManagerFactory), which takes "managerOptions", which include
everything needed to configure the new manager.

The handling of options was tightened up, with precondition checks on the
options-bag contents.

One new managerOption is `managerType`, which specifies where we want the new
vat to run. Only `local` is supported right now (our usual Compartment
sharing the main thread approach), but #1299 / #1127 will introduce new types
that use separate threads, or subprocesses.

The controller/kernel acquired a new `shutdown()` method, to use in unit
tests. This is unused now, but once we add workers that spawn threads or
subprocesses, we'll need to call+await this at the end of any unit test that
creates thread/subprocess-based workers, otherwise the worker thread would
keep the process from ever exiting.
warner added a commit that referenced this issue Aug 7, 2020
This adds a per-vat option to run the vat code in a separate thread, sharing
the process with the main (kernel) thread, sending VatDelivery and VatSyscall
objects over the postMessage channel. This isn't particularly useful by
itself, but it establishes the protocol for running vats in a
separate *process*, possibly written in a different language or using a
different JS engine (like XS, in #1299).

This 'nodeWorker' managertype has several limitations. The shallow ones are:

* vatPowers is missing transformTildot, which shouldn't be hard to add
* vatPowers.testLog is missing, only used for unit tests so we can probably
live without it
* vatPowers is missing makeGetMeter/transformMetering (and will probably
never get them, since they're only used for within-vat metering and we're
trying to get rid of that)
* metering is not implemented at all
* delivery transcripts (and replay) are not yet implemented

Metering shouldn't be too hard to add, although we'll probably make it an
option, to avoid paying the instrumented-globals penalty when we aren't using
it. We also need to add proper control over vat termination (via meter
exhaustion or manually).

The deeper limitation is that nodeWorkers cannot block to wait for a
syscall (like `callNow`), so they cannot invoke devices.

refs #1127
closes #1384
warner added a commit that referenced this issue Oct 1, 2020
This fixes the two ends of the netstring-based "kernel-worker" protocol: the
previous version failed to parse large inbound messages, such as non-trivial
vat bundles.

The replacement netstring parser is based on Node.js "Streams", in their
"object mode". We intend to replace this with one based on async iterators,
once I can figure out some other problems with that branch.

We re-enable test-worker.js for all worker types, now that the decoding
problem is fixed.

refs #1299
refs #1127
@dckc dckc added the xsnap the XS execution tool label Apr 28, 2021
@dckc dckc added the metering charging for execution (was: package: tame-metering and transform-metering) label Jul 1, 2021
@dckc
Copy link
Member

dckc commented Jul 1, 2021

I read this over and I think it's done; I don't see anything outstanding that's not covered by other open issues.

@dckc dckc closed this as completed Jul 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metering charging for execution (was: package: tame-metering and transform-metering) SwingSet package: SwingSet xsnap the XS execution tool
Projects
None yet
Development

No branches or pull requests

4 participants