Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

services/horizon: Add horizon health check endpoint #3435

Merged
merged 9 commits into from
Mar 4, 2021

Conversation

tamirms
Copy link
Contributor

@tamirms tamirms commented Mar 2, 2021

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with name of package that is most changed in the PR, ex.
    services/friendbot, or all or doc if the changes are broad or impact many
    packages.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated any docs (developer docs, .md
    files, etc... affected by this change). Take a look in the docs folder for a given service,
    like this one.

Release planning

  • I've updated the relevant CHANGELOG (here for Horizon) if
    needed with deprecations, added features, breaking changes, and DB schema changes.
  • I've decided if this PR requires a new major/minor version according to
    semver, or if it's mainly a patch change. The PR is targeted at the next
    release branch if it's not a patch change.

What

Close #3396

This commit adds a health check endpoint which can be used to check if horizon is operational. Fully operation is defined as being able to submit transactions to stellar core and being able to access the Horizon DB. On success the health check responds with a 200 http status code. On failure the health check responds with a 503.

Why

This endpoint can be queried by load balancers to determine whether traffic should be routed away from a horizon instance if it is unhealthy.

Known limitations

I think it's worth adding a rule on the ops side so that the endpoint is only accessible internally (e.g. from the loadbalancer or monitoring agents).

@tamirms tamirms requested review from jacekn and a team March 2, 2021 12:36
@tamirms tamirms marked this pull request as ready for review March 2, 2021 12:36
@jacekn
Copy link
Contributor

jacekn commented Mar 2, 2021

I think it's worth adding a rule on the ops side so that the endpoint is only accessible internally (e.g. from the loadbalancer or monitoring agents).

@tamirms above is a really good point, probably worth further discussion. Maybe /health should be exposed on the admin port instead, next to /metrics? But at the same time I think there is some value in anyone being able to hit /health to confirm the service should be responding. Do you think there is a performance impact if the endpoint is used heavily?

@tamirms
Copy link
Contributor Author

tamirms commented Mar 2, 2021

@jacekn

Maybe /health should be exposed on the admin port instead, next to /metrics? But at the same time I think there is some value in anyone being able to hit /health to confirm the service should be responding.

yes, I had the same thought and came to the conclusion that it would be better to serve the endpoint out of the same port used by the rest of the horizon API

Do you think there is a performance impact if the endpoint is used heavily?

I think it could be possible that stellar core is overwhelmed with info requests if the /health is hit excessively. But there should be rate limiting in place to mitigate that. Still I think it would be make sense to restrict access to /health just to be on the safe side

@brahman81 brahman81 requested a review from paulbellamy March 2, 2021 22:11
@brahman81
Copy link
Contributor

@tamirms am not sure I understand the motivation of having the /health endpoint exposed on the main socket if we are suggesting the operator restricts and/or rate limit access to it.

Would it be best to expose this internally only on the same socket as /metrics ?

@tamirms
Copy link
Contributor Author

tamirms commented Mar 3, 2021

@brahman81 the benefit of having /health on the main socket is that it would verify that the http server on the main socket is operational and can respond to http requests. imagine if we exposed /health on the internal socket and for whatever reason the internal socket was working fine but the main socket was unable to receive requests. In that scenario the health check requests would succeed even though requests to the horizon API would fail.

@jacekn
Copy link
Contributor

jacekn commented Mar 3, 2021

@tamirms I can see your point about validating the main http server, it's definitely be a good idea to verify it's running.
But at the same time if we say that access to the /health endpoint should be restricted by ops then I think we should not expose it by default. It's very easy to miss the fact that restrictions are needed and I'm sure many operators would end up with the endpoint exposed unintentionally.

Maybe we should decide if we consider /health safe to be exposed globally without restrictions and work from there?
If we decide it's safe then we expose it by default on the main port and don't require ops to add custom rules.
If we decide it's unsafe I think the safest default behaviour is for /health to not be exposed. In this case we can either make it optional via config flag (so that ops must explicitly enable it after adding restrictions) or decide we run on the admin port and accept the risk that the main http server could be dead while the admin port is up.

@ire-and-curses
Copy link
Member

Can we just rate limit this endpoint in the implementation? e.g. just cache the stellar core response in Horizon with a lifetime of 1s or something?

@paulbellamy
Copy link
Contributor

The cache implemented here still has a thundering-herd problem when it expires. It's probably overkill, but the cache lock should be around the whole health-check, so the backend doesn't get a traffic spike every 500ms.

@tamirms
Copy link
Contributor Author

tamirms commented Mar 3, 2021

@paulbellamy good catch! would 5196749 fix the problem?

@paulbellamy
Copy link
Contributor

Yeah that looks ideal, I'd say. You could keep a RWMutex for performance, but probably not worth it. We'll still have the herd problem across multiple horizon instances, but also not worth fixing that now.

services/horizon/internal/health.go Outdated Show resolved Hide resolved
services/horizon/internal/health.go Outdated Show resolved Hide resolved
support/db/session.go Outdated Show resolved Hide resolved
services/horizon/internal/health.go Outdated Show resolved Hide resolved
@ire-and-curses ire-and-curses requested a review from bartekn March 4, 2021 19:00
@@ -67,11 +67,11 @@ func (h healthCheck) runCheck() healthResponse {
CoreSynced: true,
}
if err := h.session.Ping(dbPingTimeout); err != nil {
log.WithField("component", "healthCheck").Warnf("could not ping db: %s", err)
log.WithField("service", "healthCheck").Warnf("could not ping db: %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last NIT: you can create a local log with field and reuse it later to avoid duplication:

localLog := log.WithField("service", "healthCheck")
// ...
localLog.Warnf("could not ping db: %s", err)

@tamirms tamirms merged commit df2d6e4 into stellar:master Mar 4, 2021
@tamirms tamirms deleted the healthz branch March 4, 2021 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add horizon health check endpoint
6 participants