use mainfork to measure one-fell-swoop remediation cost #8644

warner · 2023-12-11T19:46:24Z

What is the Problem Being Solved?

We have several bugs (#8400 , #8401, #8404) which are causing mainnet to hold large quantities of objects. As of 01-Dec-2023:

#8400 consumes 291k Payments in v29, 218k in v46, 98k in v68, and 74k in v69
#8401 has 175k cycles, which consume space in both zoe and various contract vats (125k in v29, 34k in v68, fewer in 14 other vats)
#8404 consumes 4k in v9-zoe

The #8400 leftover-Payments will be cleaned up incrementally: each time a new price feed is made, it will delete ten old ones. I estimated that this will finish cleaning up all 218k in v46 in about 15 days, and during this time it will trigger a BOYD (that will take an extra 1.2s) every 30 minutes, which would be quite sustainable. The v29 payments will take maybe 20 days to finish remediation.

However, the #8401 cycles are not easy for userspace to perform incremental cleanup. The remediation process is likely to have userspace delete the entire weakmap, causing all 175k objects to be dumped into liveslots for GC all at once, in "one fell swoop". If we do not implement #8417 , then this will dump some all 175k into the kernel at the same time. According to #8402, we might be able to survive this (as in I'm not yet seeing any superlinear execution time), but we need to be more confident than that.

So the goal of this ticket is to use the "mainfork" tool to run an actual chain upgrade that will trigger this large GC operation all at once, and measure how long it takes. It might take half an hour or more.

If the measured time is short enough to be acceptable, then we can proceed with remediation of #8401 without doing additional work (like #8401). If it is too long, or if it has other problems (high memory usage, etc), then we need to find another way, either building 8401 first, or going back to the drawing board and coming up with an entirely different workaround.

Description of the Design

obtain a version of vat-zoe which will trigger the ExitObject/SeatHandle cross-vat reference cycle retains old objects #8401 remediation
- note that this test does not require the new vat to be functional: e.g. it could clobber in-flight offers. This would not be acceptable for the real upgrade, but this test only cares about triggering a sufficiently-problematic amount of GC work
use mainfork to create a clone of current mainnet state
in the clone, submit a CORE_EVAL proposal which upgrades vat-zoe to the form that does one-fell-swoop remediation
measure how long the resulting block takes

Another variant is to perform the upgrade as a chain-halting upgrade, which will more closely match how we expect to deploy this.

Security Considerations

none

Scaling Considerations

this measures scaling concerns, to decide whether we can afford to use one-fell-swoop remediation or not

Test Plan

none, this is a one-shot manual test

Upgrade Considerations

The text was updated successfully, but these errors were encountered:

warner · 2024-10-01T00:29:13Z

I don't have detailed numbers to report, but I'm confident that one-fell-swoop would crash. I ran tests with increasingly large datasets (using more and more recent snapshots of the mainnet state), and deleting just the ATOM-USD price feed vats (v29/v46) from run-9, in which there were 50k zoe cycles and 120k QuotePayments, took over four hours and OOM-crashed during the subsequent transcript serialization/write/commit.

The current DB (run-53, 30-sep-2024) has 304k zoe cycles and 851k QuotePayments for v29, about 6x larger. So I think it's a complete non-starter to do a single one-fell-swoop deletion of the old vats. I think deleting any large collection will suffer similar problems.

I added a lot of instrumentation to the deletion process while doing that investigation, and learned that the large number of syscalls (refcount checking) during the subsequent BOYD was the big problem. This lead to the design of slow-vat-deletion (#8928), which rate-limits the source of the cost as early as it is possible for the kernel to do.

So I'm closing this ticket: we did enough investigating to make a design decision, which has since been implemented, and will be (mostly) deployed as part of upgrade-17. (the slow-deletion code will not be enabled in upgrade-17 because we lack the cosmic-swingset integration code, #10165, which should land in time to be deployed in upgrade-18)

warner added enhancement New feature or request performance Performance related issues labels Dec 11, 2023

warner self-assigned this Dec 11, 2023

warner mentioned this issue Dec 22, 2023

ExitObject/SeatHandle cross-vat reference cycle retains old objects #8401

Closed

aj-agoric added the triaged_2024 label Jan 22, 2024

warner closed this as completed Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use mainfork to measure one-fell-swoop remediation cost #8644

use mainfork to measure one-fell-swoop remediation cost #8644

warner commented Dec 11, 2023 •

edited

Loading

warner commented Oct 1, 2024

use mainfork to measure one-fell-swoop remediation cost #8644

use mainfork to measure one-fell-swoop remediation cost #8644

Comments

warner commented Dec 11, 2023 • edited Loading

What is the Problem Being Solved?

Description of the Design

Security Considerations

Scaling Considerations

Test Plan

Upgrade Considerations

warner commented Oct 1, 2024

warner commented Dec 11, 2023 •

edited

Loading