Tracking: Commit blocks to state using a separate task #4937

teor2345 · 2022-08-23T20:15:15Z

Motivation

Zebra takes 10-15 minutes to commit some blocks to the state while checkpointing, around blocks 1,718,00 to 1,772,000.

The slow blocks are different on different runs.

This is unacceptable performance, because:

it's much slower than zcashd
Zebra will appear to hang for 15 minutes, which is a usability and security issue
it causes warnings in the Zebra logs
if it's remotely triggerable, it could be a denial of service risk

Diagnosis

Zebra queues up to 1200 blocks, then commits them all in the same state request, after the missing block arrives. This can take up to 10 seconds per block.

Design

Add a block commit task to the state, which runs in a separate thread. The task should be between the block queue and the block verifier.

We'll need to move the shared mutable chain state into the block commit task, so we will also need to redirect StateService read requests to the concurrent ReadStateService.

Here is a diagram of the new state design:
https://docs.google.com/drawings/d/1FXpAUlenDAjl8nkftrypdAPsj0jr-Ut9gZlSP57nuyc/edit

Implementation Plan

Stop Accessing Mutable Chain State

Make StateService requests into concurrent ReadStateService requests #5102

Set Up Channels

Setup Block Commit task

Add a new block commit task with unused channels

Add channels to send blocks to the task

Add a channel that handles finalized state CommitFinalizedBlock requests
Add a channel that handles non-finalized state CommitBlock requests
We want two channels so we can wait for the last finalized block before committing the first non-finalized block (by height)
- The current implementation of this has a bug: Avoid temporary failures verifying the first non-finalized block #5125

Error Handling & Testing

Handle panics in the block commit task by panicking in the service
Testing - what new tests do we want
- Test that transitioning between finalized and non-finalized blocks works when non-finalized blocks arrive first #5315
- Test if checkpoint verifier and state service are correctly reset on block commit errors #2654

Optional tasks:

Optional Cleanup Tasks

Bug fixes:

Avoid an AwaitUtxo race condition when switching to the non-finalized state #5126

Refactors:

Make pending_utxos.respond() async using a channel, so we can use ReadRequest::ChainUtxo in AwaitUtxo

Renames & Formatting:

Rename every instance of address * or transparent_* to address_*
Put the Request and Response enums in a consistent order

In Scope

Non-finalized state
Finalized state
Running the task in a separate thread

Out of Scope

We don't think we'll need to make these changes as part of this change:

Scale lookahead limit based on upcoming checkpoint sizes #5101
Check for downloaded hashes in a batch #5103
(this reduces the number of state requests from the syncer)

These are definitely out of scope:

Other state refactors
Other performance improvements
Note commitment tree performance improvements

The text was updated successfully, but these errors were encountered:

teor2345 · 2022-08-25T06:45:33Z

I've run through all this height range, and I can't reproduce this bug locally. On my machine, all blocks commit in under 10 minutes.

I'm guessing that it's caused by running out of disk space, and PR #4945 will fix it.

teor2345 · 2022-08-25T19:36:40Z

I'm still seeing some of these warnings after increasing the disk size to 200 GB, so that wasn't the complete fix:

WARN {net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 12 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=96.856% current_height=Height(1729637) network_upgrade=Nu5 time_since_last_state_block=12m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
WARN state: run time: very long CommitFinalizedBlock time=12m 32s module=zebra_state::service line=685

https://github.com/ZcashFoundation/zebra/runs/8016091142?check_suite_focus=true#step:6:888

teor2345 · 2022-08-26T03:28:22Z

We might want to try the performance recommendations here:
https://github.com/facebook/rocksdb/wiki/Space-Tuning#block-size

Or:
https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#basic-tuning-suggestions

But we need to check what's slow after merging PR #4952.

teor2345 · 2022-09-02T02:38:29Z

It appears that committing blocks hangs because we sometimes commit over 1000 blocks at the same time, then return the result to the state caller (checkpointer or non-finalized state block committer):

2022-09-01T22:50:25.650944Z WARN state: run time: very long queue_and_commit_finalized: 1774457..=1775638, 1181 blocks time=11m 24s module=zebra_state::service::finalized_state line=173
2022-09-01T22:50:25.650981Z WARN state: run time: very long CommitFinalizedBlock time=11m 24s module=zebra_state::service line=697

https://github.com/ZcashFoundation/zebra/runs/8147047219?check_suite_focus=true#step:6:165

I'll open some new tickets on Monday with alternative fixes, and add them to a list in this ticket.

upbqdn · 2022-09-05T16:02:11Z

I got these when I was syncing beta 14:

run time: long CommitFinalizedBlock time="7m 40s"
run time: long CommitFinalizedBlock time="8m 25s"
run time: long CommitFinalizedBlock time="6m 18s"
run time: long CommitFinalizedBlock time="5m 36s"
run time: long CommitFinalizedBlock time="8m 52s"
run time: very long CommitFinalizedBlock time=10m
run time: long CommitFinalizedBlock time="5m 14s"
run time: long CommitFinalizedBlock time="5m 39s"
run time: long CommitFinalizedBlock time="7m 12s"
run time: long CommitFinalizedBlock time="5m 5s" 
run time: long CommitFinalizedBlock time="5m 9s" 
run time: very long CommitFinalizedBlock time=9m 
run time: long CommitFinalizedBlock time="5m 2s" 
run time: long CommitFinalizedBlock time="7m 3s" 
run time: long CommitFinalizedBlock time="5m 22s"
run time: long CommitFinalizedBlock time="8m 21s"
run time: long CommitFinalizedBlock time="6m 53s"
run time: long CommitFinalizedBlock time="5m 22s"

And these when I was syncing Zebra with #4721:

run time: long CommitFinalizedBlock time="6m 21s"
run time: long CommitFinalizedBlock time="5m 36s"
run time: long CommitFinalizedBlock time="6m 38s"
run time: long CommitFinalizedBlock time="5m 45s"
run time: long CommitFinalizedBlock time="5m 26s"
run time: long CommitFinalizedBlock time="5m 37s"
run time: long CommitFinalizedBlock time="6m 28s"
run time: long CommitFinalizedBlock time="5m 50s"
run time: long CommitFinalizedBlock time="5m 27s"
run time: long CommitFinalizedBlock time="6m 9s" 
run time: long CommitFinalizedBlock time="6m 9s" 
run time: long CommitFinalizedBlock time="8m 44s"
run time: long CommitFinalizedBlock time="6m 13s"
run time: long CommitFinalizedBlock time="5m 5s"

EDIT: all of these entries occurred after the 96th percent of a synced chain.

I just searched for the string run time in the logs. I cropped the lines so that there are no line breaks. There are 18 entries in the first listing, and 14 in the second one. We don't currently log for which heights these entries are. It might be useful to add that so we can check if the long times repeat for the same blocks. Note that none of the entries went higher than 10 minutes.

upbqdn · 2022-09-05T16:17:47Z

I also got these for beta 14:

WARN chain updates have stalled, state height has not increased for 10 minutes. sync_percent=79.411% current_height=Height(1426706) time_since_last_state_block=10m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
...
WARN chain updates have stalled, state height has not increased for 10 minutes. sync_percent=93.243% current_height=Height(1674039) time_since_last_state_block=10m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false

And these for #4721:

WARN chain updates have stalled, state height has not increased for 10 minutes. sync_percent=47.462% current_height=Height(855249) time_since_last_state_block=10m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
...
WARN chain updates have stalled, state height has not increased for 10 minutes. sync_percent=48.418% current_height=Height(872449) time_since_last_state_block=10m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false

The heights seem unrelated, so the problem is likely to be in Zebra itself.

teor2345 · 2022-09-09T00:05:49Z

We want two channels so we can wait for the last finalized block before committing the first non-finalized block (by height)

Waiting for the last finalized block is currently handled by can_fork_chain_at(), which returns early if the tip hash doesn't match any blocks.

zebra/zebra-state/src/service.rs

Lines 378 to 381 in 093d503

    
               /// Returns `true` if `hash` is a valid previous block hash for new non-finalized blocks. 
        
               fn can_fork_chain_at(&self, hash: &block::Hash) -> bool { 
        
                   self.mem.any_chain_contains(hash) || &self.disk.db().finalized_tip_hash() == hash 
        
               }

But it has a bug: if the first non-finalized block arrives before the last finalized block, it will time out and fail verification, because the fork point is only checked once per block. But it will verify correctly when it gets retried.

teor2345 · 2022-09-15T22:36:23Z

We haven't fixed the bug in this ticket yet.
(When we do, we'll add it to that PR so it auto-closes this ticket.)

mpguerra · 2022-09-28T08:57:17Z

removing epic from sprint, individual issues in epic should be added instead

teor2345 · 2022-10-11T00:03:07Z

@mpguerra I think we can close this now, there's only one PR left.

mpguerra · 2022-10-11T08:46:54Z

Do we want to do anything else here? I have converted to a tracking issue as all of the issues added to the epic were closed and I have removed from the release candidate epic.
If we don't want to do anything else here we should close this.

teor2345 · 2022-10-11T19:42:22Z

We've achieved the goals of this ticket within the release candidate scope.

…to state using a separate task ZcashFoundation#4937), added HeightDiff and height ops fixed, several read requests forwarded to ReadStateService

teor2345 changed the title ~~Committing some blocks to the state takes 15 minutes~~ Committing a specific block to the state takes 15 minutes Aug 23, 2022

teor2345 mentioned this issue Aug 23, 2022

Epic: Zebra Release Candidate #3096

Closed

76 tasks

teor2345 added the C-security Category: Security issues label Aug 23, 2022

teor2345 changed the title ~~Committing a specific block to the state takes 15 minutes~~ Committing some blocks to the state takes 15 minutes Aug 24, 2022

teor2345 closed this as completed Aug 25, 2022

teor2345 reopened this Aug 25, 2022

teor2345 mentioned this issue Aug 25, 2022

Full sync with logging for very slow state block commits #4952

Closed

1 task

teor2345 assigned teor2345 and arya2 Aug 27, 2022

teor2345 mentioned this issue Aug 30, 2022

fix(ci): Split a long full sync job #5001

Merged

1 task

teor2345 assigned upbqdn Sep 5, 2022

teor2345 added the Epic Zenhub Label. Denotes a theme of work under which related issues will be grouped label Sep 6, 2022

teor2345 changed the title ~~Committing some blocks to the state takes 15 minutes~~ Epic: Commit blocks to state using a separate task Sep 6, 2022

This was referenced Sep 7, 2022

2. change(state): Run AwaitUtxo read requests without shared mutable chain state #5107

Merged

Make StateService requests into concurrent ReadStateService requests #5102

Closed

ftm1000 removed the Epic Zenhub Label. Denotes a theme of work under which related issues will be grouped label Sep 15, 2022

ftm1000 closed this as completed Sep 15, 2022

ftm1000 removed the S-needs-triage Status: A bug report needs triage label Sep 15, 2022

teor2345 reopened this Sep 15, 2022

teor2345 mentioned this issue Sep 19, 2022

Run some docker tests on smaller instances #5189

Closed

arya2 mentioned this issue Sep 20, 2022

Non finalized block commit channel #5210

Closed

3 tasks

mpguerra moved this to 🆕 New in Zebra Sep 22, 2022

mpguerra added this to Zebra Sep 22, 2022

mpguerra moved this from 🆕 New to 🏗 In progress in Zebra Sep 22, 2022

arya2 mentioned this issue Sep 26, 2022

change(state): Write non-finalized blocks to the state in a separate thread, to avoid network and RPC hangs #5257

Merged

5 tasks

This was referenced Oct 2, 2022

Avoid an AwaitUtxo race condition when switching to the non-finalized state #5126

Closed

Test that transitioning between finalized and non-finalized blocks works when non-finalized blocks arrive first #5315

Closed

mpguerra changed the title ~~Epic: Commit blocks to state using a separate task~~ Tracking: Commit blocks to state using a separate task Oct 11, 2022

teor2345 closed this as completed Oct 11, 2022

Repository owner moved this from 🏗 In progress to ✅ Done in Zebra Oct 11, 2022

This was referenced Nov 2, 2022

fix(sync): Make the syncer ignore some new block verification errors #5537

Merged

fix(sync): Pause new downloads when Zebra reaches the lookahead limit #5561

Merged

dimxy mentioned this issue Apr 13, 2023

Pull state writing performance improvement KomodoPlatform/zebra#56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking: Commit blocks to state using a separate task #4937

Tracking: Commit blocks to state using a separate task #4937

teor2345 commented Aug 23, 2022 •

edited

Loading

Optional Cleanup Tasks

teor2345 commented Aug 25, 2022

teor2345 commented Aug 25, 2022

teor2345 commented Aug 26, 2022

teor2345 commented Sep 2, 2022

upbqdn commented Sep 5, 2022 •

edited

Loading

upbqdn commented Sep 5, 2022 •

edited

Loading

teor2345 commented Sep 9, 2022

teor2345 commented Sep 15, 2022

mpguerra commented Sep 28, 2022

teor2345 commented Oct 11, 2022

mpguerra commented Oct 11, 2022

teor2345 commented Oct 11, 2022

Tracking: Commit blocks to state using a separate task #4937

Tracking: Commit blocks to state using a separate task #4937

Comments

teor2345 commented Aug 23, 2022 • edited Loading

Motivation

Diagnosis

Design

Implementation Plan

Stop Accessing Mutable Chain State

Set Up Channels

Setup Block Commit task

Add channels to send blocks to the task

Error Handling & Testing

Optional Cleanup Tasks

In Scope

Out of Scope

teor2345 commented Aug 25, 2022

teor2345 commented Aug 25, 2022

teor2345 commented Aug 26, 2022

teor2345 commented Sep 2, 2022

upbqdn commented Sep 5, 2022 • edited Loading

upbqdn commented Sep 5, 2022 • edited Loading

teor2345 commented Sep 9, 2022

teor2345 commented Sep 15, 2022

mpguerra commented Sep 28, 2022

teor2345 commented Oct 11, 2022

mpguerra commented Oct 11, 2022

teor2345 commented Oct 11, 2022

teor2345 commented Aug 23, 2022 •

edited

Loading

upbqdn commented Sep 5, 2022 •

edited

Loading

upbqdn commented Sep 5, 2022 •

edited

Loading