Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

services/horizon/internal/ingest: Prevent redundant and concurrent state verification runs #4821

Merged
merged 2 commits into from
Apr 6, 2023

Conversation

tamirms
Copy link
Contributor

@tamirms tamirms commented Mar 26, 2023

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with name of package that is most changed in the PR, ex.
    services/friendbot, or all or doc if the changes are broad or impact many
    packages.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated any docs (developer docs, .md
    files, etc... affected by this change). Take a look in the docs folder for a given service,
    like this one.

Release planning

  • I've updated the relevant CHANGELOG (here for Horizon) if
    needed with deprecations, added features, breaking changes, and DB schema changes.
  • I've decided if this PR requires a new major/minor version according to
    semver, or if it's mainly a patch change. The PR is targeted at the next
    release branch if it's not a patch change.

What

Close #4833

As a (most likely unintentional) side effect of #4204 , it is possible for multiple ingesting nodes to perform state verification concurrently. This commit introduces a state verification lock which ensures that only one ingesting node can perform state verification at a time.

This commit also adds two new configuration variables:

  • --ingest-state-verification-frequency which specifies the frequency in checkpoints for how often state verification is run
  • --ingest-state-verification-timeout which specifies a timeout on how long state verification can run

Also, state verification is now running on the read replica if there is one configured.

Why

We have noticed that State verification is taking more than 30 minutes to run. During this time state verification has an open repeatable read transaction. Maintaining multiple of these transactions for such an extended period of time may have a negative effect on ingestion performance. Not to mention that multiple state verification runs for the same checkpoint ledger is wasteful because it's performing the same check multiple times.

Also, if state verification is having a negative impact on ingestion performance it is important that we have configuration knobs to limit how frequently it is run. e.g. instead of running state verification on every checkpoint we can instead run it on every 288th checkpoint (approximately once per day).

Given that state verification is a read only operation we can also move it to run on the read replica which will minimize the impact of the rw database used by ingestion.

Known limitations

We need to verify that running state verification on read replicas will not significantly impact the performance of request serving horizon instances which also use the read replica db.

@tamirms tamirms force-pushed the remove-redundant-state-verify branch from e9802d5 to 1655de0 Compare March 26, 2023 15:21
@tamirms tamirms changed the title services/horizon/internal/ingest: Disable concurrent state verification services/horizon/internal/ingest: Prevent concurrent state verification Mar 26, 2023
@tamirms tamirms changed the title services/horizon/internal/ingest: Prevent concurrent state verification services/horizon/internal/ingest: Prevent redundant and concurrent state verification runs Mar 26, 2023
@tamirms tamirms force-pushed the remove-redundant-state-verify branch 3 times, most recently from bbc91a9 to 3018fcd Compare March 26, 2023 17:17
@tamirms tamirms requested a review from a team March 27, 2023 09:05
@tamirms tamirms force-pushed the remove-redundant-state-verify branch from 3018fcd to bf49aaf Compare March 30, 2023 11:10
@sreuland sreuland added horizon ingest New ingestion system labels Mar 30, 2023
@sreuland
Copy link
Contributor

this is a significant feature addition, seems like qualifies for an issue/ticket to provide visibility of the effort on Stability & Performance board, maybe we can retro-point it also then, just so the effort gets included in sprint velocity?

Copy link
Contributor

@sreuland sreuland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work!

@tamirms tamirms force-pushed the remove-redundant-state-verify branch 4 times, most recently from b2b5a5d to 688a8fe Compare April 4, 2023 20:16
@tamirms
Copy link
Contributor Author

tamirms commented Apr 6, 2023

After some tests in staging I discovered that state verification does not work on the read replica because it keeps failing with the following error:

cancelling statement due to conflict with recovery

So, I have removed that feature from this PR. Apart from that everything looked good

to specifiy how often state verificaition runs and a timeout for
capping the duration of a state verification run.
@tamirms tamirms force-pushed the remove-redundant-state-verify branch from 688a8fe to fff78d8 Compare April 6, 2023 06:41
@tamirms tamirms enabled auto-merge (squash) April 6, 2023 06:56
@tamirms tamirms merged commit 6c5193d into stellar:master Apr 6, 2023
@tamirms tamirms deleted the remove-redundant-state-verify branch April 6, 2023 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
horizon ingest New ingestion system
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve State Verification Efficiency
2 participants