parallelize vat replay, e.g. at kernel restart #5754
Labels
enhancement
New feature or request
needs-design
performance
Performance related issues
SwingSet
package: SwingSet
What is the Problem Being Solved?
We could maybe speed up kernel startup by performing the transcript replay for multiple vats in parallel.
When a kernel is restarted, it has no workers. We have some flexibility around when exactly we should start up workers, but each delivery needs a worker to receive the delivery, so the laziest option is to load a worker on-demand as we notice the run-queue event is going to a vat that doesn't yet have one. The most aggressive option is to load all workers (for both static and dynamic vats) at kernel startup. We currently do the latter, although we have an size-limited LRU "worker replacement policy" that evicts unused workers to cap the number of them to
maxVatsOnline
(currently 50) (and we'll unnecessarily thrash if we have more than 50 vats at restart time).We're aggressively starting all workers at kernel startup to reduce the "jank" latency that would occur if we did a lazy-load of those workers later. It's a tradeoff between kernel startup time and being able to respond quickly later. For a consensus system, it's better to be unavailable a bit longer at startup, than to have surprising pauses during block processing. If the jank is bad enough, a validator might fall behind, and fail to meet it's participation requirements (e.g. jailed for not voting/proposing fast enough). The down side is that we use more host memory (and start more real OS processes) than we necessarily need, especially for vats which never get traffic.
When launching a worker (i.e. bringing it online), we start by loading the most recent
XS
heap snapshot, then replaying transcript entries until we've brought the worker up to date.If we find ourselves launching multiple workers at the same time, we could speed things up by parallelizing that transcript replay. This form of parallelization is even easier than #5747, because transcript replay is not really executing syscalls. The transcript itself supplies both the list of expected syscalls and their results, so the replay harness doesn't even have access to the kernelDB. It just needs a list of transcript entries (which come from the DB, but could be supplied by some read-only wrapper, or the entries could be fetched ahead of time and just held in RAM).
The available speedup depends upon how often we're launching multiple workers at once. If we were doing a maximally-lazy approach, we'd only ever start a single worker at once, which means there'd be no opportunity for parallelism. A more sophisticated scheduler might look ahead in the run-queue to figure out which vats had messages coming up, subtract out the workers that are already running, and then bring up the rest in parallel (before performing any deliveries). Something even more clever might bring up a worker while executing other deliveries within the same block (similar to the previous idea, but don't wait for all new workers to come up before trying to start a delivery).
Description of the Design
One option would be a
vatWarehouse.startWorkers(vatIDs)
. This might be the simplest: only use parallelism during the kernel startup process, before we start doing any other work, as we're aggressively pre-loading workers.Another would be a
vatWarehouse.startWorker(vatID)
that returns a Promise which fires when the worker is ready to go. Then vat-warehouse could use this promise internally to stall deliveries until the worker was ready.This needs caution, because for the most part, the kernel is effectively single-threaded, and does not indulge in parallelism. There are plenty of promises in the kernel, but we almost always do an immediate
await
on each one. The kernel docs instruct the host application to callcontroller.run()
and then wait, not making any other kernel API calls until it fires, and we have a handful of reentrancy-preventing guards on those APIs.Security Considerations
Large numbers of simultaneous workers might threaten the host computer's memory budget (think OOM killer), or might kick it into thrashing, but the relatively small size of each worker, and the "one worker at a time" execution model, makes me not worry about memory pressure very much.
We've designed this to allow the
WarehousePolicyOptions
be different for each validator without causing consensus problems, but we haven't tested this part very thoroughly.Test Plan
Not sure, ideally some unit tests, but parallelism is traditionally hard to exercise detetministically in unit tests.
The text was updated successfully, but these errors were encountered: