-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
investigating mainnet slowdown around block 7709170 (05-dec-2022) #6625
Comments
Here's the output of
blocks continue to be full (about 65M computrons each) until block 7709257, where the queue is finally drained and we're able to catch up:
We'll start by examining block 7709165 (which recorded 240M computrons), and 7709213 (250Mc), and 7709243 (280Mc). It's possible that the Big JS Hammer evaluation included a single delivery that took a very long time to complete (not yielding it's time slice until well beyond the per-block limit of 65Mc), followed by a long run-out of additional work (which was properly segmented into separate blocks). And maybe this happened three times. |
We've identified three potential issues:
|
Yes... each smart wallet is waiting for new vbank assets such as DAI_axl and DAI_grv: agoric-sdk/packages/smart-wallet/src/smartWallet.js Lines 523 to 524 in 2c812d2
notifiers / subscriptions adds a new wrinkle to the "only do work proportional to the size of the incoming message" design constraint. |
voting period ended 2022-12-06T18:31:37Zproposal 17 on NG has a pretty clean view of the code that started running. 6 new vats: 3 ea. for DAI_axl, DAI_grv
2 new purses for each of ~400 walletsanalytics.inter.trade shows over 400 smart wallets provisioned. Each of them was waiting to hear about new vbank assets: agoric-sdk/packages/smart-wallet/src/smartWallet.js Lines 523 to 524 in 2c812d2
block times spike but eventually recovernifty chart of block times and missing validators (ack: polkachu) |
The blocks that included a The first block was 7709165, which handled the proposal execution. It consumed 240M computrons ("240Mc"), and 20.942s of swingset execution time. The two
This is followed by a series of blocks which each use the full 65Mc. The amount of wallclock time they take varies, however, from as little as 4s, to a lot of 20-30s blocks, a pair of 55s, and a single 122s outlier.
Each block is using the expected 65M-ish computrons, but some kinds of deliveries take more wallclock time than others (we always hoped the computrons/second ratio would be roughly constant; this is what happens when it is not, and when our mix of delivery types is not random enough to even things out). So some blocks take a lot more time than others, and of course we can't just end a block after five seconds without losing determinism. I'm still digging into the longer blocks. I think we're seeing the same "long time to serialize" problem that we observed in an earlier testnet, which we thought we fixed by updating balances less frequently. But if we're updating a whole bunch of wallet balances at the same time, we're going to be performing the same kind of serialization in a row, which exacerbates the uneven computrons/second ratio problem. p.s. see |
BTW, the total proposal execution (which introduced two denoms) required:
We're guessing this is proportional to With 410 wallets provisioned, the execution requirements for a single denom will probably be:
per provisioned wallet. |
FYI, I've confirmed a slowdown of XS workers over time, which probably contributed to this issue. This specific degradation should not occur and we're investigating it. That said it's not a cause in this case, and probably just exacerbated the core problems identified above. |
I recall another from our discussions:
|
We had a postmortem meeting today, some of the conclusions:
The slowdown was caused by a multiplication of several factors, all of which are worth investigating and fixing:
To address the per-wallet work, we need to make things more lazy, and/or look for ways to just do less work per vat. To address the XS metering precision, we should sort the cranks by their The breadth-first scheduling doesn't have an obvious fix. When we have a more sophisticated kernel scheduler, we might be able to do better, but the small number of vats involved means the kernel wouldn't have a lot of criteria with which it could distinguish between work done for user 1 vs user 2, etc. We need to replay these vats and add GC instrumentation measure how frequently organic GC is happening, how long it takes, and whether this time was a significant contributor to the Anything we do to speed things up or do less work will help with problem 1 (the chain being unavailable for 35 minutes). Users care more about that. Anything we can do to make execution more uniform (and make the per-block computron limit more effective) will help with problem 2 (the handful of long blocks, causing voting/consensus/stability problems). Validators care more about that. |
I believe @dtribble's suggestion to push new work generated by a vat on the top of the runqueue instead of the bottom would have effectively prevented this breadth-first issue. |
historical note: it looks like we knew about #6652 as far back as May 2022 We just didn't schedule that work for the release. |
We're investigating a slowdown of the mainnet chain that happend around block 7709170 , about 10:30am PST on monday 05-dec-2022. Several validators fell behind during this time, with one block being delayed by over two minutes.
Preliminary investigating suggests that a governance proposal passed and became active about this time, causing a large chunk of JS to be executed via the "Big JS Hammer" (core-bootstrap
eval
) upgrade pathway. This can be expected to take non-trivial time to complete, however the work should have been broken up into multiple blocks. It's possible that the computron limit was not being enforced correctly, allowing the block to take longer than it should have.We'll collect information and findings in this ticket.
The text was updated successfully, but these errors were encountered: