Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking: Commit blocks to state using a separate task #4937

Closed
12 tasks done
Tracked by #3096
teor2345 opened this issue Aug 23, 2022 · 12 comments
Closed
12 tasks done
Tracked by #3096

Tracking: Commit blocks to state using a separate task #4937

teor2345 opened this issue Aug 23, 2022 · 12 comments
Assignees
Labels
A-state Area: State / database changes C-bug Category: This is a bug C-security Category: Security issues I-hang A Zebra component stops responding to requests I-slow Problems with performance or responsiveness I-usability Zebra is hard to understand or use S-needs-investigation Status: Needs further investigation

Comments

@teor2345
Copy link
Contributor

teor2345 commented Aug 23, 2022

Motivation

Zebra takes 10-15 minutes to commit some blocks to the state while checkpointing, around blocks 1,718,00 to 1,772,000.

The slow blocks are different on different runs.

This is unacceptable performance, because:

  • it's much slower than zcashd
  • Zebra will appear to hang for 15 minutes, which is a usability and security issue
  • it causes warnings in the Zebra logs
  • if it's remotely triggerable, it could be a denial of service risk

Diagnosis

Zebra queues up to 1200 blocks, then commits them all in the same state request, after the missing block arrives. This can take up to 10 seconds per block.

Design

Add a block commit task to the state, which runs in a separate thread. The task should be between the block queue and the block verifier.

We'll need to move the shared mutable chain state into the block commit task, so we will also need to redirect StateService read requests to the concurrent ReadStateService.

Here is a diagram of the new state design:
https://docs.google.com/drawings/d/1FXpAUlenDAjl8nkftrypdAPsj0jr-Ut9gZlSP57nuyc/edit

Implementation Plan

Stop Accessing Mutable Chain State

Set Up Channels

Setup Block Commit task

  • Add a new block commit task with unused channels

Add channels to send blocks to the task

  • Add a channel that handles finalized state CommitFinalizedBlock requests
  • Add a channel that handles non-finalized state CommitBlock requests
  • We want two channels so we can wait for the last finalized block before committing the first non-finalized block (by height)

Error Handling & Testing

Optional tasks:

Optional Cleanup Tasks

Bug fixes:

Refactors:

  • Make pending_utxos.respond() async using a channel, so we can use ReadRequest::ChainUtxo in AwaitUtxo

Renames & Formatting:

  • Rename every instance of address * or transparent_* to address_*
  • Put the Request and Response enums in a consistent order

In Scope

  • Non-finalized state
  • Finalized state
  • Running the task in a separate thread

Out of Scope

We don't think we'll need to make these changes as part of this change:

These are definitely out of scope:

  • Other state refactors
  • Other performance improvements
  • Note commitment tree performance improvements
@teor2345 teor2345 added C-bug Category: This is a bug S-needs-triage Status: A bug report needs triage S-needs-investigation Status: Needs further investigation P-High 🔥 I-hang A Zebra component stops responding to requests I-slow Problems with performance or responsiveness I-usability Zebra is hard to understand or use A-state Area: State / database changes labels Aug 23, 2022
@teor2345 teor2345 changed the title Committing some blocks to the state takes 15 minutes Committing a specific block to the state takes 15 minutes Aug 23, 2022
@teor2345 teor2345 added the C-security Category: Security issues label Aug 23, 2022
@teor2345 teor2345 changed the title Committing a specific block to the state takes 15 minutes Committing some blocks to the state takes 15 minutes Aug 24, 2022
@teor2345
Copy link
Contributor Author

I've run through all this height range, and I can't reproduce this bug locally. On my machine, all blocks commit in under 10 minutes.

I'm guessing that it's caused by running out of disk space, and PR #4945 will fix it.

@teor2345
Copy link
Contributor Author

I'm still seeing some of these warnings after increasing the disk size to 200 GB, so that wasn't the complete fix:

WARN {net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 12 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=96.856% current_height=Height(1729637) network_upgrade=Nu5 time_since_last_state_block=12m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
WARN state: run time: very long CommitFinalizedBlock time=12m 32s module=zebra_state::service line=685

https://github.com/ZcashFoundation/zebra/runs/8016091142?check_suite_focus=true#step:6:888

@teor2345
Copy link
Contributor Author

We might want to try the performance recommendations here:
https://github.com/facebook/rocksdb/wiki/Space-Tuning#block-size

Or:
https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#basic-tuning-suggestions

But we need to check what's slow after merging PR #4952.

@teor2345
Copy link
Contributor Author

teor2345 commented Sep 2, 2022

It appears that committing blocks hangs because we sometimes commit over 1000 blocks at the same time, then return the result to the state caller (checkpointer or non-finalized state block committer):

2022-09-01T22:50:25.650944Z WARN state: run time: very long queue_and_commit_finalized: 1774457..=1775638, 1181 blocks time=11m 24s module=zebra_state::service::finalized_state line=173
2022-09-01T22:50:25.650981Z WARN state: run time: very long CommitFinalizedBlock time=11m 24s module=zebra_state::service line=697

https://github.com/ZcashFoundation/zebra/runs/8147047219?check_suite_focus=true#step:6:165

I'll open some new tickets on Monday with alternative fixes, and add them to a list in this ticket.

@upbqdn
Copy link
Member

upbqdn commented Sep 5, 2022

I got these when I was syncing beta 14:

run time: long CommitFinalizedBlock time="7m 40s"
run time: long CommitFinalizedBlock time="8m 25s"
run time: long CommitFinalizedBlock time="6m 18s"
run time: long CommitFinalizedBlock time="5m 36s"
run time: long CommitFinalizedBlock time="8m 52s"
run time: very long CommitFinalizedBlock time=10m
run time: long CommitFinalizedBlock time="5m 14s"
run time: long CommitFinalizedBlock time="5m 39s"
run time: long CommitFinalizedBlock time="7m 12s"
run time: long CommitFinalizedBlock time="5m 5s" 
run time: long CommitFinalizedBlock time="5m 9s" 
run time: very long CommitFinalizedBlock time=9m 
run time: long CommitFinalizedBlock time="5m 2s" 
run time: long CommitFinalizedBlock time="7m 3s" 
run time: long CommitFinalizedBlock time="5m 22s"
run time: long CommitFinalizedBlock time="8m 21s"
run time: long CommitFinalizedBlock time="6m 53s"
run time: long CommitFinalizedBlock time="5m 22s"

And these when I was syncing Zebra with #4721:

run time: long CommitFinalizedBlock time="6m 21s"
run time: long CommitFinalizedBlock time="5m 36s"
run time: long CommitFinalizedBlock time="6m 38s"
run time: long CommitFinalizedBlock time="5m 45s"
run time: long CommitFinalizedBlock time="5m 26s"
run time: long CommitFinalizedBlock time="5m 37s"
run time: long CommitFinalizedBlock time="6m 28s"
run time: long CommitFinalizedBlock time="5m 50s"
run time: long CommitFinalizedBlock time="5m 27s"
run time: long CommitFinalizedBlock time="6m 9s" 
run time: long CommitFinalizedBlock time="6m 9s" 
run time: long CommitFinalizedBlock time="8m 44s"
run time: long CommitFinalizedBlock time="6m 13s"
run time: long CommitFinalizedBlock time="5m 5s" 

EDIT: all of these entries occurred after the 96th percent of a synced chain.

I just searched for the string run time in the logs. I cropped the lines so that there are no line breaks. There are 18 entries in the first listing, and 14 in the second one. We don't currently log for which heights these entries are. It might be useful to add that so we can check if the long times repeat for the same blocks. Note that none of the entries went higher than 10 minutes.

@upbqdn
Copy link
Member

upbqdn commented Sep 5, 2022

I also got these for beta 14:

WARN chain updates have stalled, state height has not increased for 10 minutes. sync_percent=79.411% current_height=Height(1426706) time_since_last_state_block=10m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
...
WARN chain updates have stalled, state height has not increased for 10 minutes. sync_percent=93.243% current_height=Height(1674039) time_since_last_state_block=10m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false

And these for #4721:

WARN chain updates have stalled, state height has not increased for 10 minutes. sync_percent=47.462% current_height=Height(855249) time_since_last_state_block=10m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
...
WARN chain updates have stalled, state height has not increased for 10 minutes. sync_percent=48.418% current_height=Height(872449) time_since_last_state_block=10m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false

The heights seem unrelated, so the problem is likely to be in Zebra itself.

@teor2345 teor2345 added the Epic Zenhub Label. Denotes a theme of work under which related issues will be grouped label Sep 6, 2022
@teor2345 teor2345 changed the title Committing some blocks to the state takes 15 minutes Epic: Commit blocks to state using a separate task Sep 6, 2022
@teor2345
Copy link
Contributor Author

teor2345 commented Sep 9, 2022

  • We want two channels so we can wait for the last finalized block before committing the first non-finalized block (by height)

Waiting for the last finalized block is currently handled by can_fork_chain_at(), which returns early if the tip hash doesn't match any blocks.

/// Returns `true` if `hash` is a valid previous block hash for new non-finalized blocks.
fn can_fork_chain_at(&self, hash: &block::Hash) -> bool {
self.mem.any_chain_contains(hash) || &self.disk.db().finalized_tip_hash() == hash
}

But it has a bug: if the first non-finalized block arrives before the last finalized block, it will time out and fail verification, because the fork point is only checked once per block. But it will verify correctly when it gets retried.

@teor2345
Copy link
Contributor Author

We haven't fixed the bug in this ticket yet.
(When we do, we'll add it to that PR so it auto-closes this ticket.)

@mpguerra
Copy link
Contributor

removing epic from sprint, individual issues in epic should be added instead

@teor2345
Copy link
Contributor Author

@mpguerra I think we can close this now, there's only one PR left.

@mpguerra mpguerra changed the title Epic: Commit blocks to state using a separate task Tracking: Commit blocks to state using a separate task Oct 11, 2022
@mpguerra
Copy link
Contributor

Do we want to do anything else here? I have converted to a tracking issue as all of the issues added to the epic were closed and I have removed from the release candidate epic.
If we don't want to do anything else here we should close this.

@teor2345
Copy link
Contributor Author

We've achieved the goals of this ticket within the release candidate scope.

Repository owner moved this from 🏗 In progress to ✅ Done in Zebra Oct 11, 2022
dimxy added a commit to dimxy/komodo-zebra that referenced this issue Apr 22, 2023
…to state using a separate task ZcashFoundation#4937), added HeightDiff and height ops fixed, several read requests forwarded to ReadStateService
dimxy added a commit to dimxy/komodo-zebra that referenced this issue Jun 21, 2023
…to state using a separate task ZcashFoundation#4937), added HeightDiff and height ops fixed, several read requests forwarded to ReadStateService
dimxy added a commit to dimxy/komodo-zebra that referenced this issue Jul 4, 2023
…to state using a separate task ZcashFoundation#4937), added HeightDiff and height ops fixed, several read requests forwarded to ReadStateService
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-state Area: State / database changes C-bug Category: This is a bug C-security Category: Security issues I-hang A Zebra component stops responding to requests I-slow Problems with performance or responsiveness I-usability Zebra is hard to understand or use S-needs-investigation Status: Needs further investigation
Projects
Archived in project
Development

No branches or pull requests

5 participants