Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change(state): Check database format is valid every 5 minutes, to catch format errors in new block code #7602

Merged
merged 9 commits into from
Sep 22, 2023

Conversation

teor2345
Copy link
Contributor

@teor2345 teor2345 commented Sep 21, 2023

Motivation

We want to catch format errors in new block code. But currently that code is only tested when Zebra adds new blocks, then shuts down, then opens the same state. So PRs can merge with bugs in that code.

Instead, we check the format is valid every 5 minutes while Zebra is running.

This is the periodic part of #7570.

Complex Code or Requirements

We can't run this code too often, or it will impact performance. But we want to make sure it gets run during most tests, so we catch bugs early.

Solution

Continue running the format change thread after the initial change finishes, then periodically run a format check:

  • Turn the format change task into a continuous loop
  • Move the max height check into the format checks
  • Make slower format checks cancellable on shutdown

Related changes:

  • Refactor the format change/check code so that "no change" has an enum variant
  • Add a variant for checking the format of new blocks
  • Refactor the format checks so they can be called using a single method
  • Make all the format checks run completely, and log their errors, then panic
  • Time format upgrades and format checks
  • Store the format change flag in the database struct for debugging

Example Logs

On startup, it takes 2 seconds to open the database, and 7 seconds to check the format is valid:

2023-09-21T02:45:10.692975Z  INFO zebra_state::service::finalized_state::disk_format::upgrade: trying to open newer database format: data should be compatible running_version=25.2.1 disk_version=25.2.2
2023-09-21T02:45:12.388329Z  INFO zebra_state::service::finalized_state::disk_format::upgrade: marked database format as downgraded running_version=25.2.1 disk_version=25.2.2 
2023-09-21T02:45:19.485921Z  INFO zebra_state::service::finalized_state::disk_format::upgrade: database format is valid running_version=25.2.1 inital_disk_version=25.2.2

So the format check is running approximately 2% of the time. Since it runs in parallel with the rest of the code, and it will be faster when the data is in the database memory cache, this is acceptable performace.

Review

This is on the critical path for finishing off the subtree work. It's not strictly required for the 1.3.0 release, but it would help to find any other data bugs.

Reviewer Checklist

  • Will the PR name make sense to users?
    • Does it need extra CHANGELOG info? (new features, breaking changes, large changes)
  • Are the PR labels correct?
  • Does the code do what the ticket and PR says?
    • Does it change concurrent code, unsafe code, or consensus rules?
  • How do you know it works? Does it have tests?

@teor2345 teor2345 added P-Medium ⚡ C-testing Category: These are tests A-state Area: State / database changes A-concurrency Area: Async code, needs extra work to make it work properly. labels Sep 21, 2023
@teor2345 teor2345 requested a review from a team as a code owner September 21, 2023 03:21
@teor2345 teor2345 self-assigned this Sep 21, 2023
@teor2345 teor2345 requested review from oxarbitrage and removed request for a team September 21, 2023 03:21
@mpguerra
Copy link
Contributor

Does this run on CI only or on any running zebra instance?

@oxarbitrage oxarbitrage requested review from upbqdn and removed request for oxarbitrage September 21, 2023 12:44
@teor2345
Copy link
Contributor Author

Does this run on CI only or on any running zebra instance?

This runs on all Zebra instances, because if the database format is wrong, then we can break a consensus rule, or send incorrect data to wallets.

(We already run these checks at startup, but that's not enough to detect all format bugs.)

@mergify mergify bot merged commit b737ccf into main Sep 22, 2023
@mergify mergify bot deleted the flsync-valid branch September 22, 2023 01:33
@upbqdn upbqdn mentioned this pull request Oct 13, 2023
38 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-concurrency Area: Async code, needs extra work to make it work properly. A-state Area: State / database changes C-testing Category: These are tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants