parallelize vat replay, e.g. at kernel restart #5754

warner · 2022-07-12T22:00:16Z

What is the Problem Being Solved?

We could maybe speed up kernel startup by performing the transcript replay for multiple vats in parallel.

When a kernel is restarted, it has no workers. We have some flexibility around when exactly we should start up workers, but each delivery needs a worker to receive the delivery, so the laziest option is to load a worker on-demand as we notice the run-queue event is going to a vat that doesn't yet have one. The most aggressive option is to load all workers (for both static and dynamic vats) at kernel startup. We currently do the latter, although we have an size-limited LRU "worker replacement policy" that evicts unused workers to cap the number of them to maxVatsOnline (currently 50) (and we'll unnecessarily thrash if we have more than 50 vats at restart time).

We're aggressively starting all workers at kernel startup to reduce the "jank" latency that would occur if we did a lazy-load of those workers later. It's a tradeoff between kernel startup time and being able to respond quickly later. For a consensus system, it's better to be unavailable a bit longer at startup, than to have surprising pauses during block processing. If the jank is bad enough, a validator might fall behind, and fail to meet it's participation requirements (e.g. jailed for not voting/proposing fast enough). The down side is that we use more host memory (and start more real OS processes) than we necessarily need, especially for vats which never get traffic.

When launching a worker (i.e. bringing it online), we start by loading the most recent XS heap snapshot, then replaying transcript entries until we've brought the worker up to date.

If we find ourselves launching multiple workers at the same time, we could speed things up by parallelizing that transcript replay. This form of parallelization is even easier than #5747, because transcript replay is not really executing syscalls. The transcript itself supplies both the list of expected syscalls and their results, so the replay harness doesn't even have access to the kernelDB. It just needs a list of transcript entries (which come from the DB, but could be supplied by some read-only wrapper, or the entries could be fetched ahead of time and just held in RAM).

The available speedup depends upon how often we're launching multiple workers at once. If we were doing a maximally-lazy approach, we'd only ever start a single worker at once, which means there'd be no opportunity for parallelism. A more sophisticated scheduler might look ahead in the run-queue to figure out which vats had messages coming up, subtract out the workers that are already running, and then bring up the rest in parallel (before performing any deliveries). Something even more clever might bring up a worker while executing other deliveries within the same block (similar to the previous idea, but don't wait for all new workers to come up before trying to start a delivery).

Description of the Design

One option would be a vatWarehouse.startWorkers(vatIDs). This might be the simplest: only use parallelism during the kernel startup process, before we start doing any other work, as we're aggressively pre-loading workers.

Another would be a vatWarehouse.startWorker(vatID) that returns a Promise which fires when the worker is ready to go. Then vat-warehouse could use this promise internally to stall deliveries until the worker was ready.

This needs caution, because for the most part, the kernel is effectively single-threaded, and does not indulge in parallelism. There are plenty of promises in the kernel, but we almost always do an immediate await on each one. The kernel docs instruct the host application to call controller.run() and then wait, not making any other kernel API calls until it fires, and we have a handful of reentrancy-preventing guards on those APIs.

Security Considerations

Large numbers of simultaneous workers might threaten the host computer's memory budget (think OOM killer), or might kick it into thrashing, but the relatively small size of each worker, and the "one worker at a time" execution model, makes me not worry about memory pressure very much.

We've designed this to allow the WarehousePolicyOptions be different for each validator without causing consensus problems, but we haven't tested this part very thoroughly.

Test Plan

Not sure, ideally some unit tests, but parallelism is traditionally hard to exercise detetministically in unit tests.

The text was updated successfully, but these errors were encountered:

mhofman · 2024-10-30T23:16:22Z

If we ever do this, we need to consider the impact on the slog and the slog sender. Either we need to

embargo the slog events of the replay to keep them grouped, but generate an out of order slog file (timestamps and monotime regressing at the edges of replay groups)
support grouping / filtering by vatID in the slog sender and tools that are aware of the structure of slogs. E.g. the new context aware slog sender would need to keep a map of replay context per vatID, and we'd need to lookup/assert the existence of that context for replay slog events.

warner added enhancement New feature or request SwingSet package: SwingSet performance Performance related issues labels Jul 12, 2022

warner changed the title ~~parallelize vat replay~~ parallelize vat replay, e.g. at kernel restart Jul 12, 2022

warner added the needs-design label Jul 13, 2022

Tartuffo added migrate-product-backlog and removed migrate-product-backlog labels Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallelize vat replay, e.g. at kernel restart #5754

parallelize vat replay, e.g. at kernel restart #5754

warner commented Jul 12, 2022

mhofman commented Oct 30, 2024

parallelize vat replay, e.g. at kernel restart #5754

parallelize vat replay, e.g. at kernel restart #5754

Comments

warner commented Jul 12, 2022

What is the Problem Being Solved?

Description of the Design

Security Considerations

Test Plan

mhofman commented Oct 30, 2024