You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have several bugs (#8400 , #8401, #8404) which are causing mainnet to hold large quantities of objects. As of 01-Dec-2023:
#8400 consumes 291k Payments in v29, 218k in v46, 98k in v68, and 74k in v69
#8401 has 175k cycles, which consume space in both zoe and various contract vats (125k in v29, 34k in v68, fewer in 14 other vats)
#8404 consumes 4k in v9-zoe
The #8400 leftover-Payments will be cleaned up incrementally: each time a new price feed is made, it will delete ten old ones. I estimated that this will finish cleaning up all 218k in v46 in about 15 days, and during this time it will trigger a BOYD (that will take an extra 1.2s) every 30 minutes, which would be quite sustainable. The v29 payments will take maybe 20 days to finish remediation.
However, the #8401 cycles are not easy for userspace to perform incremental cleanup. The remediation process is likely to have userspace delete the entire weakmap, causing all 175k objects to be dumped into liveslots for GC all at once, in "one fell swoop". If we do not implement #8417 , then this will dump some all 175k into the kernel at the same time. According to #8402, we might be able to survive this (as in I'm not yet seeing any superlinear execution time), but we need to be more confident than that.
So the goal of this ticket is to use the "mainfork" tool to run an actual chain upgrade that will trigger this large GC operation all at once, and measure how long it takes. It might take half an hour or more.
If the measured time is short enough to be acceptable, then we can proceed with remediation of #8401 without doing additional work (like #8401). If it is too long, or if it has other problems (high memory usage, etc), then we need to find another way, either building 8401 first, or going back to the drawing board and coming up with an entirely different workaround.
note that this test does not require the new vat to be functional: e.g. it could clobber in-flight offers. This would not be acceptable for the real upgrade, but this test only cares about triggering a sufficiently-problematic amount of GC work
use mainfork to create a clone of current mainnet state
in the clone, submit a CORE_EVAL proposal which upgrades vat-zoe to the form that does one-fell-swoop remediation
measure how long the resulting block takes
Another variant is to perform the upgrade as a chain-halting upgrade, which will more closely match how we expect to deploy this.
Security Considerations
none
Scaling Considerations
this measures scaling concerns, to decide whether we can afford to use one-fell-swoop remediation or not
Test Plan
none, this is a one-shot manual test
Upgrade Considerations
The text was updated successfully, but these errors were encountered:
I don't have detailed numbers to report, but I'm confident that one-fell-swoop would crash. I ran tests with increasingly large datasets (using more and more recent snapshots of the mainnet state), and deleting just the ATOM-USD price feed vats (v29/v46) from run-9, in which there were 50k zoe cycles and 120k QuotePayments, took over four hours and OOM-crashed during the subsequent transcript serialization/write/commit.
The current DB (run-53, 30-sep-2024) has 304k zoe cycles and 851k QuotePayments for v29, about 6x larger. So I think it's a complete non-starter to do a single one-fell-swoop deletion of the old vats. I think deleting any large collection will suffer similar problems.
I added a lot of instrumentation to the deletion process while doing that investigation, and learned that the large number of syscalls (refcount checking) during the subsequent BOYD was the big problem. This lead to the design of slow-vat-deletion (#8928), which rate-limits the source of the cost as early as it is possible for the kernel to do.
So I'm closing this ticket: we did enough investigating to make a design decision, which has since been implemented, and will be (mostly) deployed as part of upgrade-17. (the slow-deletion code will not be enabled in upgrade-17 because we lack the cosmic-swingset integration code, #10165, which should land in time to be deployed in upgrade-18)
What is the Problem Being Solved?
We have several bugs (#8400 , #8401, #8404) which are causing mainnet to hold large quantities of objects. As of 01-Dec-2023:
#8400
consumes 291k Payments in v29, 218k in v46, 98k in v68, and 74k in v69#8401
has 175k cycles, which consume space in both zoe and various contract vats (125k in v29, 34k in v68, fewer in 14 other vats)#8404
consumes 4k in v9-zoeThe
#8400
leftover-Payments will be cleaned up incrementally: each time a new price feed is made, it will delete ten old ones. I estimated that this will finish cleaning up all 218k in v46 in about 15 days, and during this time it will trigger a BOYD (that will take an extra 1.2s) every 30 minutes, which would be quite sustainable. The v29 payments will take maybe 20 days to finish remediation.However, the
#8401
cycles are not easy for userspace to perform incremental cleanup. The remediation process is likely to have userspace delete the entire weakmap, causing all 175k objects to be dumped into liveslots for GC all at once, in "one fell swoop". If we do not implement #8417 , then this will dump some all 175k into the kernel at the same time. According to #8402, we might be able to survive this (as in I'm not yet seeing any superlinear execution time), but we need to be more confident than that.So the goal of this ticket is to use the "mainfork" tool to run an actual chain upgrade that will trigger this large GC operation all at once, and measure how long it takes. It might take half an hour or more.
If the measured time is short enough to be acceptable, then we can proceed with remediation of #8401 without doing additional work (like #8401). If it is too long, or if it has other problems (high memory usage, etc), then we need to find another way, either building 8401 first, or going back to the drawing board and coming up with an entirely different workaround.
Description of the Design
Another variant is to perform the upgrade as a chain-halting upgrade, which will more closely match how we expect to deploy this.
Security Considerations
none
Scaling Considerations
this measures scaling concerns, to decide whether we can afford to use one-fell-swoop remediation or not
Test Plan
none, this is a one-shot manual test
Upgrade Considerations
The text was updated successfully, but these errors were encountered: