how to parallelize new vat deliveries to improve throughput performance #5747
Labels
enhancement
New feature or request
needs-design
performance
Performance related issues
SwingSet
package: SwingSet
What is the Problem Being Solved?
We could speed up chain throughput by a factor of maybe 4 or 8 by parallelizing multiple new vat deliveries, ideally making one delivery per CPU core.
(note: this is different than parallelizing replay: that's much easier, because all syscalls are simulated)
Our chain-side execution loop currently looks like:
controller.run()
xsnap
processWith some extra work, we should be able to perform multiple deliveries in parallel. It's made a lot easier by the fact that each vat runs in its own worker process, so we don't even need threads: the host operating system already knows how to allow multiple processes get scheduled on multiple CPU cores at the same time.
The main tricky part is isolating any state that's referenced by multiple vats. Our vats are pretty isolated, but they can all make syscalls, and those syscalls reference (or mutate) state, and some of that state is shared between multiple vats. Our vat syscalls break down into three categories:
vatstoreGet/GetAfter/Set/Delete
send
,resolve
, the GC actions (dropImports
/retireImports
/retireExports
,abandonExports
), things involved with vat exit/upgrade (exit
), and (to some lesser extent)subscribe
callNow
Our plans for backpressure, prioritization, and pausing individual vats, calls for breaking the single kernel-wide run-queue into a separate input and output queue for each vat, plus probably an extra kernel-wide queue for actions like "create new vat" and "terminate vat". Many of a vat's syscalls can thus mutate only their own output queue, giving us a lot more flexibility around parallelism.
The tricky part is access to devices, because
callNow
is synchronous. So two vats which each have access to a device node might both invokecallNow
, and the results they return might depend upon the order in which they were invoked. In a single-threaded one-delivery-at-a-time system, that order is deterministic. But if we allow two deliveries to run in parallel, the invocation order is no longer deterministic.Most devices don't do this, but unless we can rule out the possibility, then we must either exclude vats with device nodes in their c-lists from parallelism, or have some way to abort speculative deliveries that wind up making a device call. We might also want to mark certain device nodes as being mutating or sensitive to invocation order, so that we could continue to allow parallelism for vats which hold those nodes, and only deny it to vats which hold more sensitive device nodes.
We don't use a lot of device nodes in our system: we obviously cannot avoid them entirely (if we want vat code to influence the outside world at all), but most user-provided vats don't interact with them. Off the top of my head, the ones I can think of are:
D(bundlecap).getBundle() -> bundle
to run synchronously. Each contract vat holds at least one bundlecap (delivered invatParameters
), so the ZCF layer knows what contract code to execute. We don't really need the bundlecap past this point, but 1: vatParameters are not GCed very well right now, and 2: device nodes are not GCed very well right now. So all contract vats probably inadvertently hold on to a bundlecap device node forever, even if they never invoke it again after startup.A number of built-in or utilities vats hold onto device nodes, so that userspace/contract vats don't have to. The contract vat references an object within the utility vat, and that object wraps the device node:
delay()
, or set up a repeater, they send a message tovat-timer
, andvat-timer
then needs to interact withdevice-timer
So a lot of contract operations that want to interact with timers, or send messages through the bridge device, will cause vats to be scheduled like:
[contractVat, vat-timer (uses device-timer), contractVat]
, or[contractVat, vat-bridge (uses device-bridge), contractVat]
. We can parallelize multiple contract vat deliveries together, but we'd need to serialize the resulting calls to vat-timer or vat-bridge. The utility vats are never doing very much work. That might suggest we want a scheduler that does some large batch of userspace/contract vat deliveries first (parallelizing heavily), then performs a large number of short-duration serialized deliveries to utility vats. Or, the scheduler groups the potential work to do by the shared resources it wants to access: put all vats that have timer/bridge device-node access in a single group, and serialize all deliveries within that group (while allowing parallelization between that group and non-device-using contract vats).Description of the Design
The basic approach would be:
PF
, maybe 4 or soPF
significantly more than the number of CPU cores in the least-powerful validator is probably a wastePF
deliveries that are going to distinct vatsPF
deliveries to their vats at the same time, allowing them to run in parallelvatStore
, maybe adding things to the kernel-wide extra queue forcreateVat
/etcThe number of deliveries made in any given block will thus be dictated by both the runPolicy's computron limit, but also by the
PF
parallelism factor: we'll do up toPF
deliveries after the limit is reached. We already act this way (the computrons consumed by a block will always be greater than the runPolicy threshold: we always learn about exceeding its limit too late), but currently we act as ifPF = 1
.We'll need to modify the syscall implementations to store their pending state changes in RAM until the delivery is retired, so that we aren't making DB changes in a non-consensus order. We currently use
vatTranslator.js
to convert vat-format syscalls into kernel-format syscalls, and this translation can change shared kernel state (allocation of newly-exported kernel objects, refcount increments). Then a sharedkernelSyscall.js
executes the kernel-format syscalls, which is where e.g.syscall.send
appends new items to the shared run-queue. We'd need to rewrite this to enqueue vat-format syscall objects (VatSyscallObject
) until the delivery can be retired, and perform both translation and execution only at that later point in time.The
vatstore
syscalls are all string-string KV store operations, so translation should not modify refcounts or introduce new objects. So our enqueue-VSO code could executevatstoreGet/GetAfter
calls immediately (reading from the vat's portion of the kernel DB). The (mutating)vatstoreSet/Delete
need to have their changes enqueued.One approach for this might be to create a separate
crankBuffer
for each vat. ThecrankBuffer
only knows about kvStore writes, so a different approach would be to introduce a different kind of buffer (vatBuffer
?) that knows more about syscalls than about kvStore writes.Security Considerations
One of the biggest benefits of CSP (and the Actor model in general) is the complete elimination of shared-state concurrency hazards, so we must be careful to not re-introduce that hazard. Our scheduler needs to be careful to not allow device-node access or kernel DB changes to become dependent upon execution order.
The parallelism factor we choose will put validators with fewer cores at a disadvantage. Number of cores will become part of our minimum validator requirements.
Test Plan
Not sure, obviously some unit tests on the scheduler and the code that merges parallel state changes back into a consensus order for application to the DB, but we also need some sort of stress test to make sure we get a consistent order even though some deliveries take longer wallclock time than others. Probably a randomized test harness that is given a set of parallel deliveries, executes each to completion, then reports a randomized finishing order. This test should assert that the applied state changes (
activityHash
) remains consistent among multiple runs (giving the randomizer a chance to explore a significant portion of the ordering space).The text was updated successfully, but these errors were encountered: