-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rate-limited GC in BringOutYourDead #8417
Comments
more notes:
|
Oh, ignore that: the simulated kernel interface I use in that test is contributing quadratic slowdowns (it re-sorts the |
I've implemented that design in a branch. I'm now working on how to test it properly, and attempting to guess how it will behave with our #8401 cycle collection scenario. It's clear that this will be only one part of an overall solution (there are constraints at different levels that we need to meet). If we only change liveslots, and leave the current kernel behavior alone, I think we'll observe:
If we don't change the kernel to accelerate BOYDs, this will take a long time. From 21-Dec-2023 to 02-Jan-2024 ( On the plus side, this doesn't incur too much overhead, and might not slow down normal blocks very much. Once every 16 minutes we'll perform a few thousand DB calls, then commit the block. On the down side, apart from taking a week, all other GC inside Zoe would be queued behind this work, so nothing else would get freed until we'd finished all three phases (probably not a big deal). To consider BOYD acceleration, we'd have the kernel look at the We'd need to decide how to prioritize processing of the reap-queue, as well as where to perform the computron/bean limit checks. Currently we prioritize gc-action-queue above everything else, and it doesn't count towards the limits, so we'll drain the gc-action-queue immediately. The most aggressive acceleration (the "just rip off the bandage already" approach) would be a priority list of However, I think that will run up against the other constraints:
A less aggressive approach would prioritize the reap-queue, but apply computron/bean limits before processing any queue, so that we might end a block with more BOYDs left to do. We could also count kernel DB usage as beans, to capture the work done on the gc-action-queue. The first block might be entirely spent on the deletion of v29, because of all the c-list deletions costing DB beans, and do no gc actions at all. The second block might perform a single GC action (the The benefit would be that the DB commits and IAVL churn would be of a more reasonable size, and we could control just how much state change work happens per block. The downside is that we might have a few hours during which the chain is running furiously, but no other offers can be processed, which is an odd form of downtime (note that plain cosmos txns would be processed, but the I'm currently in favor of the slow approach, assuming that we're talking about a week of cleanup and not a year. Without some changes to our BOYD schedule, the rate depends upon how fast Zoe is receiving traffic, which is currently dominated by the oracle price updates. That's an odd dependency to have, but if it works, I'll use it. Other places we might consider slowing down the rate of work:
|
Reading "Rate-Limited Collection Deletion", that only works for deleting a collection, not clearing it, right? Unless the old "cleared" collection is "set aside" and some new entries are used to describe it, I don't think we can do such a slow deletion on cleared collection as new entries may be added after clearing which shouldn't be affected by the slow deletion. |
What is the Problem Being Solved?
If a vat deletes a large collection (or other portion of the object graph) which causes the unreachability of a large number of objects, the vat may perform a large number of GC
dropImport
actions at the same time, which might take a long time, and might swamp the kernel with GC activity. We should prevent that, and limit vats to deleting moderate amounts of things at a time.Background
Inside each vat, normal operation results in some number of objects becoming unreachable. These include normal RAM-based objects ("ephemerals"), virtual objects and collections, durable objects and collections, and imported objects ("presences"). From the perspective of userspace, these objects should be garbage-collected according to standard JavaScript rules.
The JS engine is responsible for managing GC of objects in RAM, however many of these objects are actually stand-ins for more complex identities like virtual objects and imported presences. To enable our distributed object semantics, the kernel-provided "liveslots" layer is responsible for tracking the reachability status of these non-ephemeral things, and emitting syscalls (like
syscall.dropImport
) at the right time, to let the kernel coordinate GC among multiple vats.The liveslots layer uses
WeakRef
s and aFinalizationRegistry
to determine when a vref-identified object (Remotable, Presence, or Representative) is no longer reachable in RAM. It combines that with refcounts (tracking reachability from virtual/durable data) and an "export status" record (tracking reachability from the kernel, and other vats). When this data indicates that a virtual/durable object is no longer reachable, liveslots deletes its data from the vatstore, and propagates the deletion outwards through the object graph as appropriate. In addition to deleting data from the vatstore, this code will emit syscalls to inform the kernel of things that can be dropped.Finalizers run at arbitrary times, but the RAM drops they observe are merely collected in a set named
possiblyDeadSet
for later. The actual processing does not happen until a special delivery nameddispatch.bringOutYourDead
(aka "reap") causes liveslots to examinepossiblyDeadSet
and check refcounts/export-status on everything named therein. A vref (actually a "baseref", which omits any facet information, and thus identifies a cohort of facets) might be placed inpossiblyDeadSet
, but then later resurrected by code that causes a new Representative to be created. Only vrefs that lack all forms of reference make it through to the actually-deadSet
, and get deleted.The problem we're investigating is that BOYD, as it is affectionately known, does not rate-limit the work it does. A single large virtual collection, which keeps thousands of imports or virtual objects alive, will trigger a large number of deletions and syscalls when it is finally dropped. We are worried about what happens when that much GC work gets triggered in a single crank, and how the kernel will react (#8402).
We have observed mainnet bugs (#8400) which are keeping 120,000 virtual objects alive in one vat, 90k in another. Some of our remediation plans will allow the vat to rate-limit is own deletion process. But for some (#8401), userspace will have no control over the rate, so we're expecting about 50k cycles to get broken in a single event, causing hundreds of thousands of vatstore operations, and eventually to 50k kernel imports getting dropped at the same time.
So the task is to somehow rate-limit the GC work that liveslots does, to tolerate large numbers of objects being deleted at the same time. We care both about the vatstore syscalls it uses to track refcount changes, and the GC syscalls (
syscall.dropImport
/etc) with which it tells the kernel about them.Description of the Design
Inside liveslots, the
scanForDeadObjects
function is invoked bydispatch.bringOutYourDead
. It does not execute user code, and we disable metering as it runs, to somewhat insulate the vat's syscall trace from the exact details of GC timing. We also force an engine-level GC as it starts, and give all finalizers a chance to execute, so that all unreachable user-visible objects have been collected and finalized by the time it performs its scan.Currently,
scanForDeadObjects
starts with a walk through all ofpossiblyDeadSet
, checking each to see if any of the "three pillars" are still standing:slotToVal
vom.rc.${vref}
refcountvom.es.${vref}
export-status recordIf none are left, the vref is placed in
deadSet
. At the end of the loop,possiblyDeadSet
is cleared.Instead, our thought is to limit the number of items we place into
deadSet
. BOYD would start with "budget of doom": it is only allowed to actually kill a limited number of vrefs. We change the loop to remove items frompossiblyDeadSet
as they are examined, decrement the budget for each item added todeadSet
, and exit the loop once the budget is exceeded.(as an optimization, we might inspect
possiblyDeadSet.length
ahead of time, and when it is smaller than the doom budget, use a singlepossiblyDeadSet.clear()
instead of N calls topossiblyDeadSet.delete(item)
, for efficiency)The resulting syscalls will be limited by the doom budget. Any vrefs not examined because of doom-budget constraints will remain in
possiblyDeadSet
for a future BOYD.Two considerations that need more analysis:
possiblyRetiredSet
is also deeply involved, and I have not yet figured out how we should manage itpossiblyRetiredSet
if they're still inpossiblyDeadSet
if (!deadSet.has(vref))
check cannot apply if the vref is missing from deadSet because we deferred considering that vref earlierRate-Limited Collection Deletion
Currently, when a virtual collection (e.g. a DurableWeakStore) is deleted, we use
clearInternalFull
to delete every entry, then we delete the metadata. As we delete entries, we decrement the refcount of whatever the entry used to point to. These refcount changes may then push new vrefs ontopossiblyDeadSet
for later examination.However, if the collection is very large (like the #8400 bug which involves 120k
recoverySet
Payments), then deleting the collection will trigger a huge number of entry deletions all within the same crank, each of which requires several vatstore syscalls. This work may be prohibitive, even if the subsequentsyscall.dropImport
calls are rate-limited.@erights's original idea was to rate-limit this clear-all-entries for large collections. We'd introduce a queue of collections that are no longer reachable, but which are not yet empty. When
deadSet
realizes a collection is deleted, we check its size, and if it has more items than some threshold, we push it onto a special internal collection (managed like a queue, but really probably a DurableMapStore indexed by integer). That keeps the collection and everything inside it alive. Each time we do BOYD, we look at the top collection on this queue, iterate through the first N entries, and delete them. That will limit the amount of syscall work we do (in addition to limiting thedropImports
that get queued up).When the top item on the queue is smaller than the threshold, we delete it normally.
This will require some assistance from the Collection Manager, which currently exposes a
clearInternal(deleting = true)
method for benefit of GC code. We need two API calls, one likesizeInternal
to report back the size (even for weak collections, which track the size internally but do not share it with userspace), andclearInternalSome(N)
, which would delete a limited number of entries (again, even for weak collections, where userspace doesn't even get.clear()
).This would provide rate-limiting for both #8400-type bugs (which a cooperative userspace could rate-limit themselves), and #8401-type bugs (where the cycles are only reachable from a weakmap, so userspace has no way to limit their own deletions).
Security Considerations
This reduces a security threat, from vats which build up a lot of objects over time and then deliberately delete all of them at once, to slow down the kernel.
Scaling Considerations
Test Plan
Unit tests for a variety of situations, using the
MockGC
framework to exercise precise control over when objects' RAM pillars are dropped.Upgrade Considerations
We'll need this upgrade deployed before we can upgrade price-feed and Zoe vats to the forms that remediate the #8400/#8401 bugs. The new rate-limiting liveslots must be the current version on mainnet at the time the other vats are upgraded, so they'll pick up the new liveslots before performing their deletions.
The text was updated successfully, but these errors were encountered: