services/horizon: Add horizon health check endpoint #3435

tamirms · 2021-03-02T12:30:40Z

PR Checklist

PR Structure

This PR has reasonably narrow scope (if not, break it down into smaller PRs).
This PR avoids mixing refactoring changes with feature changes (split into two PRs
otherwise).
This PR's title starts with name of package that is most changed in the PR, ex.
services/friendbot, or all or doc if the changes are broad or impact many
packages.

Thoroughness

This PR adds tests for the most critical parts of the new functionality or fixes.
I've updated any docs (developer docs, .md
files, etc... affected by this change). Take a look in the docs folder for a given service,
like this one.

Release planning

I've updated the relevant CHANGELOG (here for Horizon) if
needed with deprecations, added features, breaking changes, and DB schema changes.
I've decided if this PR requires a new major/minor version according to
semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.

What

Close #3396

This commit adds a health check endpoint which can be used to check if horizon is operational. Fully operation is defined as being able to submit transactions to stellar core and being able to access the Horizon DB. On success the health check responds with a 200 http status code. On failure the health check responds with a 503.

Why

This endpoint can be queried by load balancers to determine whether traffic should be routed away from a horizon instance if it is unhealthy.

Known limitations

I think it's worth adding a rule on the ops side so that the endpoint is only accessible internally (e.g. from the loadbalancer or monitoring agents).

jacekn · 2021-03-02T16:37:46Z

I think it's worth adding a rule on the ops side so that the endpoint is only accessible internally (e.g. from the loadbalancer or monitoring agents).

@tamirms above is a really good point, probably worth further discussion. Maybe /health should be exposed on the admin port instead, next to /metrics? But at the same time I think there is some value in anyone being able to hit /health to confirm the service should be responding. Do you think there is a performance impact if the endpoint is used heavily?

tamirms · 2021-03-02T16:50:34Z

@jacekn

Maybe /health should be exposed on the admin port instead, next to /metrics? But at the same time I think there is some value in anyone being able to hit /health to confirm the service should be responding.

yes, I had the same thought and came to the conclusion that it would be better to serve the endpoint out of the same port used by the rest of the horizon API

Do you think there is a performance impact if the endpoint is used heavily?

I think it could be possible that stellar core is overwhelmed with info requests if the /health is hit excessively. But there should be rate limiting in place to mitigate that. Still I think it would be make sense to restrict access to /health just to be on the safe side

brahman81 · 2021-03-02T22:16:56Z

@tamirms am not sure I understand the motivation of having the /health endpoint exposed on the main socket if we are suggesting the operator restricts and/or rate limit access to it.

Would it be best to expose this internally only on the same socket as /metrics ?

tamirms · 2021-03-03T07:14:23Z

@brahman81 the benefit of having /health on the main socket is that it would verify that the http server on the main socket is operational and can respond to http requests. imagine if we exposed /health on the internal socket and for whatever reason the internal socket was working fine but the main socket was unable to receive requests. In that scenario the health check requests would succeed even though requests to the horizon API would fail.

jacekn · 2021-03-03T13:12:46Z

@tamirms I can see your point about validating the main http server, it's definitely be a good idea to verify it's running.
But at the same time if we say that access to the /health endpoint should be restricted by ops then I think we should not expose it by default. It's very easy to miss the fact that restrictions are needed and I'm sure many operators would end up with the endpoint exposed unintentionally.

Maybe we should decide if we consider /health safe to be exposed globally without restrictions and work from there?
If we decide it's safe then we expose it by default on the main port and don't require ops to add custom rules.
If we decide it's unsafe I think the safest default behaviour is for /health to not be exposed. In this case we can either make it optional via config flag (so that ops must explicitly enable it after adding restrictions) or decide we run on the admin port and accept the risk that the main http server could be dead while the admin port is up.

ire-and-curses · 2021-03-03T17:05:46Z

Can we just rate limit this endpoint in the implementation? e.g. just cache the stellar core response in Horizon with a lifetime of 1s or something?

…ests

paulbellamy · 2021-03-03T20:40:27Z

The cache implemented here still has a thundering-herd problem when it expires. It's probably overkill, but the cache lock should be around the whole health-check, so the backend doesn't get a traffic spike every 500ms.

tamirms · 2021-03-03T21:20:03Z

@paulbellamy good catch! would 5196749 fix the problem?

paulbellamy · 2021-03-03T21:39:21Z

Yeah that looks ideal, I'd say. You could keep a RWMutex for performance, but probably not worth it. We'll still have the herd problem across multiple horizon instances, but also not worth fixing that now.

services/horizon/internal/health.go

support/db/session.go

services/horizon/internal/health.go

…onse

bartekn · 2021-03-04T19:59:54Z

services/horizon/internal/health.go

@@ -67,11 +67,11 @@ func (h healthCheck) runCheck() healthResponse {
 		CoreSynced:        true,
 	}
 	if err := h.session.Ping(dbPingTimeout); err != nil {
-		log.WithField("component", "healthCheck").Warnf("could not ping db: %s", err)
+		log.WithField("service", "healthCheck").Warnf("could not ping db: %s", err)


Last NIT: you can create a local log with field and reuse it later to avoid duplication:

localLog := log.WithField("service", "healthCheck") // ... localLog.Warnf("could not ping db: %s", err)

Add horizon health check endpoint

6bd4cd2

tamirms requested review from jacekn and a team March 2, 2021 12:36

tamirms marked this pull request as ready for review March 2, 2021 12:36

brahman81 requested a review from paulbellamy March 2, 2021 22:11

Cache health check results to prevent overloading core with info requ…

15d9ddb

…ests

Fix problem with multiple threads updating cache at the same time

5196749

bartekn reviewed Mar 3, 2021

View reviewed changes

services/horizon/internal/health.go Outdated Show resolved Hide resolved

services/horizon/internal/health.go Outdated Show resolved Hide resolved

support/db/session.go Outdated Show resolved Hide resolved

services/horizon/internal/health.go Outdated Show resolved Hide resolved

Code review fixes

8c77962

bartekn reviewed Mar 4, 2021

View reviewed changes

services/horizon/internal/health.go Outdated Show resolved Hide resolved

tamirms added 2 commits March 4, 2021 16:08

Change component tag to service and log errors when writing json resp…

cbb17ce

…onse

Fix docker-compose workflow to work with Horizon 2.0 changes

e3cc963

ire-and-curses requested a review from bartekn March 4, 2021 19:00

bartekn approved these changes Mar 4, 2021

View reviewed changes

tamirms and others added 3 commits March 4, 2021 21:37

Add health logger

877f56a

Merge branch 'master' into healthz

975153a

Update changelog

4464a47

tamirms merged commit df2d6e4 into stellar:master Mar 4, 2021

tamirms deleted the healthz branch March 4, 2021 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

services/horizon: Add horizon health check endpoint #3435

services/horizon: Add horizon health check endpoint #3435

tamirms commented Mar 2, 2021 •

edited

Loading

jacekn commented Mar 2, 2021

tamirms commented Mar 2, 2021

brahman81 commented Mar 2, 2021

tamirms commented Mar 3, 2021

jacekn commented Mar 3, 2021

ire-and-curses commented Mar 3, 2021

paulbellamy commented Mar 3, 2021

tamirms commented Mar 3, 2021

paulbellamy commented Mar 3, 2021

bartekn Mar 4, 2021

services/horizon: Add horizon health check endpoint #3435

services/horizon: Add horizon health check endpoint #3435

Conversation

tamirms commented Mar 2, 2021 • edited Loading

PR Structure

Thoroughness

Release planning

What

Why

Known limitations

jacekn commented Mar 2, 2021

tamirms commented Mar 2, 2021

brahman81 commented Mar 2, 2021

tamirms commented Mar 3, 2021

jacekn commented Mar 3, 2021

ire-and-curses commented Mar 3, 2021

paulbellamy commented Mar 3, 2021

tamirms commented Mar 3, 2021

paulbellamy commented Mar 3, 2021

bartekn Mar 4, 2021

Choose a reason for hiding this comment

tamirms commented Mar 2, 2021 •

edited

Loading