-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
break large GC syscalls up into smaller pieces #8389
Comments
Interesting, but somewhat conflicting of my wish to move workers to a model where there is a single result at the end of the delivery containing all "distributed object operations" (once we move vatStore in process, and make device calls async only). Maybe we should consider a streaming result when we get there, or we'll simply have moved to parallelized worker already, removing any IPC costs. |
I think we can accommodate that, we just need to explicitly allow a large number of (maybe this will turn into a protocol where the worker knows its current references, and the worker+kernel go through some sort of interactive reconciliation process to let the kernel know what can be dropped: we might allow that process to take multiple steps, limiting the sizes of the messages it uses. Like, we might somehow incorporate both after-the-fact "you've just upgraded, what are you still using?" things, and cycle-breaking "are you only holding onto X because I'm holding onto Y?" questions, into a better protocol, and we should design that protocol to handle large changes without size-limit problems) |
That GC protocol is orthogonal and partially described in #6793 But yeah in general I want an async "emit event" from vat to kernel that doesn't need acking from the kernel. Then the "how" these events are sent doesn't matter, whether streamed during execution or bundled up at result time. |
FWIW, the |
by async I mean the vat doesn't need to wait for an ack from the syscall, like it shouldn't have to for any other "distributed object operations". I very much want to avoid effects of a delivery appearing outside of the delivery window, |
What is the Problem Being Solved?
I've been investigating some performance issues on mainnet, and one possible consequence of a fix is that a BOYD might GC something like 50,000 imports all at once. That might be more than can fit comfortably inside a single syscall.
We'll need to simulate this to measure how long it would take or what limits might be hit, but the most important one would be the netstring length limit on worker messages. That's like 10MB, and the
syscall.dropImports
messages will consume at most 20 bytes per entry, so we're talking about a 1MB netstring, so we'd still have a 10x safety margin.But, I don't think I ever considered really large numbers of vrefs passing through the GC code at once. There might be some algorithmic slowdowns (the way we slice off the next GC Action to perform, in particular, might be
O(N^2)
). We might or might not be able to dodge these slowdowns by submitting the syscalls in smaller chunks. (Note that I think we need to submit all the vrefs by the end of the BOYD, to avoid the risk of confusion later with vrefs being re-imported, which limits this ability to dodge).Description of the Design
In liveslots.js we currently deliver all the dropped vrefs in a single call:
agoric-sdk/packages/swingset-liveslots/src/liveslots.js
Lines 341 to 343 in ef466d3
I think we'd just break this up into groups of like 1k at a time. We'd do the same for all three calls.
Security Considerations
none, the same work is being done, just spread out over multiple calls
Scaling Considerations
need to investigate how the kernel reacts to large numbers of vrefs being GCed at the same time, see if there's a way to spread it out, and look for algorithmic slowdowns
Test Plan
Unit tests with a synthetic GC framework that creates more than 1k vrefs to drop at the same time, then examines the syscalls to make sure they are size-limited.
Probably manual tests of the kernel GC code to investigate potential perf problems with large GC operations.
Upgrade Considerations
This fix will go into liveslots, which means it will only help with vats that have been upgraded. Fortunately the fixes I'm looking at for allowing vats to drop their existing references would require a vat upgrade anyways. So the main sequencing requirement is that the kernel is re-launched (and re-bundles liveslots, with this smaller-pieces fix) before we attempt to upgrade the vats (with the change that releases those references).
The text was updated successfully, but these errors were encountered: