Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rate-limited GC in BringOutYourDead #8417

Open
warner opened this issue Oct 1, 2023 · 4 comments
Open

rate-limited GC in BringOutYourDead #8417

warner opened this issue Oct 1, 2023 · 4 comments
Labels
enhancement New feature or request liveslots requires vat-upgrade to deploy changes

Comments

@warner
Copy link
Member

warner commented Oct 1, 2023

What is the Problem Being Solved?

If a vat deletes a large collection (or other portion of the object graph) which causes the unreachability of a large number of objects, the vat may perform a large number of GC dropImport actions at the same time, which might take a long time, and might swamp the kernel with GC activity. We should prevent that, and limit vats to deleting moderate amounts of things at a time.

Background

Inside each vat, normal operation results in some number of objects becoming unreachable. These include normal RAM-based objects ("ephemerals"), virtual objects and collections, durable objects and collections, and imported objects ("presences"). From the perspective of userspace, these objects should be garbage-collected according to standard JavaScript rules.

The JS engine is responsible for managing GC of objects in RAM, however many of these objects are actually stand-ins for more complex identities like virtual objects and imported presences. To enable our distributed object semantics, the kernel-provided "liveslots" layer is responsible for tracking the reachability status of these non-ephemeral things, and emitting syscalls (like syscall.dropImport) at the right time, to let the kernel coordinate GC among multiple vats.

The liveslots layer uses WeakRefs and a FinalizationRegistry to determine when a vref-identified object (Remotable, Presence, or Representative) is no longer reachable in RAM. It combines that with refcounts (tracking reachability from virtual/durable data) and an "export status" record (tracking reachability from the kernel, and other vats). When this data indicates that a virtual/durable object is no longer reachable, liveslots deletes its data from the vatstore, and propagates the deletion outwards through the object graph as appropriate. In addition to deleting data from the vatstore, this code will emit syscalls to inform the kernel of things that can be dropped.

Finalizers run at arbitrary times, but the RAM drops they observe are merely collected in a set named possiblyDeadSet for later. The actual processing does not happen until a special delivery named dispatch.bringOutYourDead (aka "reap") causes liveslots to examine possiblyDeadSet and check refcounts/export-status on everything named therein. A vref (actually a "baseref", which omits any facet information, and thus identifies a cohort of facets) might be placed in possiblyDeadSet, but then later resurrected by code that causes a new Representative to be created. Only vrefs that lack all forms of reference make it through to the actually-deadSet, and get deleted.

The problem we're investigating is that BOYD, as it is affectionately known, does not rate-limit the work it does. A single large virtual collection, which keeps thousands of imports or virtual objects alive, will trigger a large number of deletions and syscalls when it is finally dropped. We are worried about what happens when that much GC work gets triggered in a single crank, and how the kernel will react (#8402).

We have observed mainnet bugs (#8400) which are keeping 120,000 virtual objects alive in one vat, 90k in another. Some of our remediation plans will allow the vat to rate-limit is own deletion process. But for some (#8401), userspace will have no control over the rate, so we're expecting about 50k cycles to get broken in a single event, causing hundreds of thousands of vatstore operations, and eventually to 50k kernel imports getting dropped at the same time.

So the task is to somehow rate-limit the GC work that liveslots does, to tolerate large numbers of objects being deleted at the same time. We care both about the vatstore syscalls it uses to track refcount changes, and the GC syscalls (syscall.dropImport/etc) with which it tells the kernel about them.

Description of the Design

Inside liveslots, the scanForDeadObjects function is invoked by dispatch.bringOutYourDead. It does not execute user code, and we disable metering as it runs, to somewhat insulate the vat's syscall trace from the exact details of GC timing. We also force an engine-level GC as it starts, and give all finalizers a chance to execute, so that all unreachable user-visible objects have been collected and finalized by the time it performs its scan.

Currently, scanForDeadObjects starts with a walk through all of possiblyDeadSet, checking each to see if any of the "three pillars" are still standing:

  • RAM, as evidenced by presence in slotToVal
  • virtual data, evidenced by a non-zero vom.rc.${vref} refcount
  • export to kernel, evidenced by a vom.es.${vref} export-status record

If none are left, the vref is placed in deadSet. At the end of the loop, possiblyDeadSet is cleared.

Instead, our thought is to limit the number of items we place into deadSet. BOYD would start with "budget of doom": it is only allowed to actually kill a limited number of vrefs. We change the loop to remove items from possiblyDeadSet as they are examined, decrement the budget for each item added to deadSet, and exit the loop once the budget is exceeded.

(as an optimization, we might inspect possiblyDeadSet.length ahead of time, and when it is smaller than the doom budget, use a single possiblyDeadSet.clear() instead of N calls to possiblyDeadSet.delete(item), for efficiency)

The resulting syscalls will be limited by the doom budget. Any vrefs not examined because of doom-budget constraints will remain in possiblyDeadSet for a future BOYD.

Two considerations that need more analysis:

  • the refcount changes cause a lot of vatstore syscalls, possibly more than we want
  • the possiblyRetiredSet is also deeply involved, and I have not yet figured out how we should manage it
    • for sure, we must not remove vrefs from possiblyRetiredSet if they're still in possiblyDeadSet
    • the if (!deadSet.has(vref)) check cannot apply if the vref is missing from deadSet because we deferred considering that vref earlier

Rate-Limited Collection Deletion

Currently, when a virtual collection (e.g. a DurableWeakStore) is deleted, we use clearInternalFull to delete every entry, then we delete the metadata. As we delete entries, we decrement the refcount of whatever the entry used to point to. These refcount changes may then push new vrefs onto possiblyDeadSet for later examination.

However, if the collection is very large (like the #8400 bug which involves 120k recoverySet Payments), then deleting the collection will trigger a huge number of entry deletions all within the same crank, each of which requires several vatstore syscalls. This work may be prohibitive, even if the subsequent syscall.dropImport calls are rate-limited.

@erights's original idea was to rate-limit this clear-all-entries for large collections. We'd introduce a queue of collections that are no longer reachable, but which are not yet empty. When deadSet realizes a collection is deleted, we check its size, and if it has more items than some threshold, we push it onto a special internal collection (managed like a queue, but really probably a DurableMapStore indexed by integer). That keeps the collection and everything inside it alive. Each time we do BOYD, we look at the top collection on this queue, iterate through the first N entries, and delete them. That will limit the amount of syscall work we do (in addition to limiting the dropImports that get queued up).

When the top item on the queue is smaller than the threshold, we delete it normally.

This will require some assistance from the Collection Manager, which currently exposes a clearInternal(deleting = true) method for benefit of GC code. We need two API calls, one like sizeInternal to report back the size (even for weak collections, which track the size internally but do not share it with userspace), and clearInternalSome(N), which would delete a limited number of entries (again, even for weak collections, where userspace doesn't even get .clear()).

This would provide rate-limiting for both #8400-type bugs (which a cooperative userspace could rate-limit themselves), and #8401-type bugs (where the cycles are only reachable from a weakmap, so userspace has no way to limit their own deletions).

Security Considerations

This reduces a security threat, from vats which build up a lot of objects over time and then deliberately delete all of them at once, to slow down the kernel.

Scaling Considerations

Test Plan

Unit tests for a variety of situations, using the MockGC framework to exercise precise control over when objects' RAM pillars are dropped.

Upgrade Considerations

We'll need this upgrade deployed before we can upgrade price-feed and Zoe vats to the forms that remediate the #8400/#8401 bugs. The new rate-limiting liveslots must be the current version on mainnet at the time the other vats are upgraded, so they'll pick up the new liveslots before performing their deletions.

@warner warner added enhancement New feature or request liveslots requires vat-upgrade to deploy changes labels Oct 1, 2023
@warner
Copy link
Member Author

warner commented Oct 7, 2023

more notes:

  • @erights suggested that liveslots might increase the doom budget to be proportional to the amount of queued objects to check, so we give more time automatically when necessary, without having it be completely unbounded
  • we could pass the doom budget in as an argument of the BOYD delivery, and we could return the length of the remaining possiblyDeadSet as a return value, to give the kernel information to consider when it makes scheduling decisions
    • but note that possiblyDeadSet.size is changed spontaneously by finalizer execution, so we should not allow it to affect consensus, which means the kernel can't use it to make scheduling decisions
    • or it could make a special "I need some GC time" syscall, to deliver the queue length but not introduce a new information-return pathway
    • the kernel could decide to do more BOYD immediately, or it could defer it (somehow) to a more-idle block when nothing else is going on. maybe controller.doSomeGC(), called after controller.run(), but only if there is compute budget left
    • related to our plan to move from BOYD-every-N-cranks to BOYD-every-N-computrons, maybe reduce N if we get a signal that there are a significant number of objects to examine
  • at vat-upgrade time, we have to decide how much time we're willing to allocate to the final cleanups.. at some point we may cut the vat off and leave it saddled with uncollectable storage entries

@warner
Copy link
Member Author

warner commented Nov 30, 2023

A data point: I wrote a quick test (following the pattern of test-liveslots-mock-gc.js) that populates a virtual MapStore with N virtual objects, then deletes the MapStore, and measures how long the subsequent BOYD takes. This uses mock-gc, so we're explicitly calling gcTools.kill(obj) to remove the memory pillar for each one, and then gcTools.flushAllFRs() to trigger the finalizers.

N time BOYD syscalls
10 3.7ms 157
100 12.9ms 1417
1000 375ms 14017
2000 1.39s 28017
4000 5.38s 56017
8000 21.93s 112017
16000 91.29s 224017
10000 35.13s 140017
12000 51.40s 168017
14000 69.30s 196017

I haven't measured the time regression curve yet but I think it's roughly quadratic (the number of syscalls is linear, as expected).

(Curiously, my CPU meter didn't show any single core at 100%, even though this is certainly CPU-bound. I guess v8 (or the macOS virtualized linux kernel) is round-robining across multiple cores too fast.)

Oh, ignore that: the simulated kernel interface I use in that test is contributing quadratic slowdowns (it re-sorts the vatStore keys after every deletion, whereas the real swing-store's SQLite is smarter). I'm working on other tests which involve real kernel+liveslots, and they're showing linear time usage.

@warner
Copy link
Member Author

warner commented Jan 13, 2024

I've implemented that design in a branch. I'm now working on how to test it properly, and attempting to guess how it will behave with our #8401 cycle collection scenario. It's clear that this will be only one part of an overall solution (there are constraints at different levels that we need to meet).

If we only change liveslots, and leave the current kernel behavior alone, I think we'll observe:

  • deleting v29 will cause the kernel to abandon a large number of the late vat's exports, and delete (+decref) a large number of its imports, which will necessarily add a large bunch of krefs to the gcAction queue. We aren't considering breaking this up.
  • the large gcAction queue will be processed and result in a large dispatch.dropExports into v9-zoe (as the exporter of objects that were held only by v29)
  • the dropExports will trigger about 2x vatstore syscalls, to update refcounts, and will push them all into the RAM-based possiblyDeadSet
  • the subsequent retireExports delivery will also trigger 2x syscalls
  • at some point after that (based on v9-zoe's other deliveries, the reapInterval, and the snapshotInterval), v9-zoe will receive a BOYD, with some limited budgetOfDoom
  • this begins "Phase 1"
  • this BOYD will spend its entire budget deleting a portion of the no-longer-reachable exported objects, and changing the refcounts of the slots they pointed to
  • it will take more normal deliveries to get v9-zoe to another BOYD
    • we need numCycles / budget BOYDs to get through all of Phase 1
    • (remember we process possiblyDeadSet in insertion order)
  • Phase 2 spends another numCycles / budget BOYDs on deleting zoeSeatAdmin objects, and decrefing the imported ExitObjects they hold
  • Phase 3 spends yet another numCycles / budget BOYDs on dropping the improts, and will do budget-sized syscall.dropImports() and retireImports() on each one
  • as currently implemented, there will be a Phase 4 which finally processes possiblyRetiredSet (which, since it requires syscalls to check for virtual recognizers, must be guarded by the budget)
    • this needs at least numCycles / budget, maybe double that, I'm not sure yet

If we don't change the kernel to accelerate BOYDs, this will take a long time. From 21-Dec-2023 to 02-Jan-2024 (agreeable follower run-18), v9-zoe received 217,332 deliveries over 1M seconds (12.25 days), averaging 4.87s/delivery, so we'd hit snapshotInterval = 200 deliveries and trigger another BOYD once every 16 minutes. If we set a doom budget of 1000, the 200k-ish cycles we had as of 21-Dec-2023 would take 200 BOYDs per phase, 600 BOYDs total (assuming possiblyRetiredSet does not add more), and about 7 days.

On the plus side, this doesn't incur too much overhead, and might not slow down normal blocks very much. Once every 16 minutes we'll perform a few thousand DB calls, then commit the block. On the down side, apart from taking a week, all other GC inside Zoe would be queued behind this work, so nothing else would get freed until we'd finished all three phases (probably not a big deal).

To consider BOYD acceleration, we'd have the kernel look at the dispatch.bringOutYourDead() return value for a signal that liveslots exceeded its budget, and there is more cleanup needed. We could react to that by scheduling an immediate BOYD (which might do it again, etc). We'd want to introduce a "Reap Queue". We'd rewrite controller.reapAllVats to use this (push all vatIDs onto the reap-queue), as well as the reapInterval code, and we'd want to change the snapshotInterval and processVatUpgrade code to remove from reap-queue.

We'd need to decide how to prioritize processing of the reap-queue, as well as where to perform the computron/bean limit checks. Currently we prioritize gc-action-queue above everything else, and it doesn't count towards the limits, so we'll drain the gc-action-queue immediately.

The most aggressive acceleration (the "just rip off the bandage already" approach) would be a priority list of [gc-action-queue, reap-queue, run-queue], and only checking limits after run-queue deliveries. If we deleted v29 from a chain upgrade handler, this would result in all the work being done in a single block, at the point in time when all validators are expecting downtime anyways.

However, I think that will run up against the other constraints:

  • each DB write gets put into a pending SQLite transaction
    • I'm guessing (trusting) these are written to the WAL file, and not held in RAM, but SQLite will do whatever it thinks is best
    • we'll have a large WAL file (temporary disk usage), and a long SQLite commit point
    • but those probably won't cause a problem
  • after the SQLite commit, cosmic-swingset will process the export-data stream
    • for every DB write/delete, we'll do a write/delete of something in IAVL
    • we think these changes get accumulated in RAM, so this is a significant RAM spike
  • then cosmic-swingset will perform the IAVL commit
    • this is effectively deleting 90% of the IAVL entries
    • the merkle tree must be recomputed, although in theory there shouldn't be much left to hash
  • some number of blocks later, different for each validator, cosmos will prune the last large IAVL block
    • that will delete a lot from the LevelDB
    • we observed problems after agoric-upgrade-11 rewrote the kvStore (massive IAVL churn)
    • not right away, but when pruning the pre-upgrade state
    • so we might see problems after deleting a lot of IAVL
  • in addition, if a single delivery has a lot of syscalls (or the delivery itself is large), we'll get a large transcript entry string
    • I observe V8 memory problems (hours spend doing GC, occasional OOMs) when vat-warehouse recordSyscalls accumulates a large number of syscalls into a list

A less aggressive approach would prioritize the reap-queue, but apply computron/bean limits before processing any queue, so that we might end a block with more BOYDs left to do. We could also count kernel DB usage as beans, to capture the work done on the gc-action-queue. The first block might be entirely spent on the deletion of v29, because of all the c-list deletions costing DB beans, and do no gc actions at all. The second block might perform a single GC action (the dispatch.dropExports). The third would do the retireExports. The fourth would begin a series of BOYD-only blocks, up to 600 of them depending on how many budget-limited BOYDs can fit within the bean limit. Only after all those had finished would we have room for normal deliveries.

The benefit would be that the DB commits and IAVL churn would be of a more reasonable size, and we could control just how much state change work happens per block. The downside is that we might have a few hours during which the chain is running furiously, but no other offers can be processed, which is an odd form of downtime (note that plain cosmos txns would be processed, but the actionQueue would not be serviced, so no timer updates, no oracle price updates, and no vault liquidations), and might not be acceptable.

I'm currently in favor of the slow approach, assuming that we're talking about a week of cleanup and not a year. Without some changes to our BOYD schedule, the rate depends upon how fast Zoe is receiving traffic, which is currently dominated by the oracle price updates. That's an odd dependency to have, but if it works, I'll use it.

Other places we might consider slowing down the rate of work:

  • apply a budget to processGCActionSet, so instead of creating a large action ("v9 dropExports(100k vrefs)"), it creates a smaller action ("v9 dropExports(1k vrefs)") and leaves the rest on the gc-action-queue
    • if gc-action-queue remains top-priority, and unmetered, this will spread the work out over multiple deliveries, but not multiple blocks, and the 600-ish BOYDs will start when it finishes
    • it would reduce the size of the dropExports/retireExports transcript entries, specifically their 2x vatstore syscalls, which might still run up against the V8 problems if we do them all at once, and which remain large even if BOYD uses internal rate-limiting as this ticket is about
  • apply a per-block budget to gc-action or reap-queue work
  • have liveslots rate-limit collection deletion, as described at the top of this ticket

@mhofman
Copy link
Member

mhofman commented May 8, 2024

Reading "Rate-Limited Collection Deletion", that only works for deleting a collection, not clearing it, right? Unless the old "cleared" collection is "set aside" and some new entries are used to describe it, I don't think we can do such a slow deletion on cleared collection as new entries may be added after clearing which shouldn't be affected by the slow deletion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request liveslots requires vat-upgrade to deploy changes
Projects
None yet
Development

No branches or pull requests

2 participants