Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

services/horizon: Add middleware for checking read-replica lag #3574

Merged
merged 1 commit into from
May 11, 2021

Conversation

bartekn
Copy link
Contributor

@bartekn bartekn commented Apr 28, 2021

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with name of package that is most changed in the PR, ex.
    services/friendbot, or all or doc if the changes are broad or impact many
    packages.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated any docs (developer docs, .md
    files, etc... affected by this change). Take a look in the docs folder for a given service,
    like this one.

Release planning

  • I've updated the relevant CHANGELOG (here for Horizon) if
    needed with deprecations, added features, breaking changes, and DB schema changes.
  • I've decided if this PR requires a new major/minor version according to
    semver, or if it's mainly a patch change. The PR is targeted at the next
    release branch if it's not a patch change.

What

  • Adds a new RO_DATABASE_URL that allows setting a connection to a read-replica DB.
  • When the flag is set:
    • A new middleware checking if the read-replica is behind a primary DB will be enabled. If such check fails three times, the stale_history error will be returned. The middleware will sleep 20ms and 40ms on the first and second failure.
    • All DB queries will be sent to read-replica (except the one that checks the primary sequence number in the middleware above).
  • Adds a new horizon_http_replica_lag_errors_count metric to be able to track number of error responses due to replica lag.

Why

Horizon should provide clients with strongly consistent responses when using read-replicas. Specifically, for consecutive HTTP requests: A and B, response for B must always return data at ledger sequence higher or equal to the response for A.

Known limitations

In case of a higher replica lag, the middleware may increase HTTP error rate.

@bartekn bartekn requested a review from a team April 28, 2021 20:31
@bartekn bartekn linked an issue Apr 28, 2021 that may be closed by this pull request
}

if replicaIngestLedger >= primaryIngestLedger {
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to have metrics on the distribution of requests which satisfy this condition on attempt 1, 2, 3, and 4?

@2opremio 2opremio changed the base branch from master to release-horizon-v2.3.0 May 11, 2021 16:49
@2opremio 2opremio merged commit d48da5b into release-horizon-v2.3.0 May 11, 2021
@2opremio 2opremio deleted the replica-middleware branch May 11, 2021 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create replica lag middleware
3 participants