-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
services/horizon: Add horizon health check endpoint #3435
Conversation
@tamirms above is a really good point, probably worth further discussion. Maybe |
yes, I had the same thought and came to the conclusion that it would be better to serve the endpoint out of the same port used by the rest of the horizon API
I think it could be possible that stellar core is overwhelmed with info requests if the /health is hit excessively. But there should be rate limiting in place to mitigate that. Still I think it would be make sense to restrict access to /health just to be on the safe side |
@tamirms am not sure I understand the motivation of having the Would it be best to expose this internally only on the same socket as |
@brahman81 the benefit of having |
@tamirms I can see your point about validating the main http server, it's definitely be a good idea to verify it's running. Maybe we should decide if we consider |
Can we just rate limit this endpoint in the implementation? e.g. just cache the stellar core response in Horizon with a lifetime of 1s or something? |
The cache implemented here still has a thundering-herd problem when it expires. It's probably overkill, but the cache lock should be around the whole health-check, so the backend doesn't get a traffic spike every 500ms. |
@paulbellamy good catch! would 5196749 fix the problem? |
Yeah that looks ideal, I'd say. You could keep a RWMutex for performance, but probably not worth it. We'll still have the herd problem across multiple horizon instances, but also not worth fixing that now. |
services/horizon/internal/health.go
Outdated
@@ -67,11 +67,11 @@ func (h healthCheck) runCheck() healthResponse { | |||
CoreSynced: true, | |||
} | |||
if err := h.session.Ping(dbPingTimeout); err != nil { | |||
log.WithField("component", "healthCheck").Warnf("could not ping db: %s", err) | |||
log.WithField("service", "healthCheck").Warnf("could not ping db: %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last NIT: you can create a local log with field and reuse it later to avoid duplication:
localLog := log.WithField("service", "healthCheck")
// ...
localLog.Warnf("could not ping db: %s", err)
PR Checklist
PR Structure
otherwise).
services/friendbot
, orall
ordoc
if the changes are broad or impact manypackages.
Thoroughness
.md
files, etc... affected by this change). Take a look in the
docs
folder for a given service,like this one.
Release planning
needed with deprecations, added features, breaking changes, and DB schema changes.
semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.
What
Close #3396
This commit adds a health check endpoint which can be used to check if horizon is operational. Fully operation is defined as being able to submit transactions to stellar core and being able to access the Horizon DB. On success the health check responds with a 200 http status code. On failure the health check responds with a 503.
Why
This endpoint can be queried by load balancers to determine whether traffic should be routed away from a horizon instance if it is unhealthy.
Known limitations
I think it's worth adding a rule on the ops side so that the endpoint is only accessible internally (e.g. from the loadbalancer or monitoring agents).