diff --git a/docs/pages/setup/admin/troubleshooting.mdx b/docs/pages/setup/admin/troubleshooting.mdx index a34e70d1f31e5..a25257ebdedf4 100644 --- a/docs/pages/setup/admin/troubleshooting.mdx +++ b/docs/pages/setup/admin/troubleshooting.mdx @@ -15,27 +15,9 @@ run with verbose logging enabled by passing it `-d` flag. It is not recommended to run Teleport in production with verbose logging as it generates a substantial amount of data. -Sometimes you may want to reset [`teleport`](../reference/cli.mdx#teleport) to a clean -state. This can be accomplished by erasing everything under `"data_dir"` -directory. Assuming the default location, `rm -rf /var/lib/teleport/*` will do. - -Teleport also supports HTTP endpoints for monitoring purposes. They are disabled -by default, but you can enable them: - -```code -$ sudo teleport start --diag-addr=127.0.0.1:3000 -``` - -Now you can see the monitoring information by visiting several endpoints: - -- `http://127.0.0.1:3000/metrics` is the list of internal metrics Teleport is tracking. It is compatible with [Prometheus](https://prometheus.io/) - collectors. For a full list of metrics review our [metrics reference](../reference/metrics.mdx). -- `http://127.0.0.1:3000/healthz` returns "OK" if the process is healthy or - `503` otherwise. -- `http://127.0.0.1:3000/readyz` is similar to `/healthz`, but it returns "OK" - *only after* the node successfully joined the cluster, i.e.it draws the difference between "healthy" and "ready". -- `http://127.0.0.1:3000/debug/pprof/` is Golang's standard profiler. It's only - available when `-d` flag is given in addition to `--diag-addr` +Sometimes you may want to reset [`teleport`](../reference/cli.mdx#teleport) to a +clean state. This can be accomplished by erasing everything under the `data_dir` +directory, which defaults to `/var/lib/teleport/`. ## Debug dump diff --git a/docs/pages/setup/reference/metrics.mdx b/docs/pages/setup/reference/metrics.mdx index 3501ef5325f4c..754c0b7d15ec8 100644 --- a/docs/pages/setup/reference/metrics.mdx +++ b/docs/pages/setup/reference/metrics.mdx @@ -1,30 +1,79 @@ --- -title: Teleport Metrics -description: How to set up Prometheus to monitor Teleport for SSH and Kubernetes access -h1: Metrics +title: Teleport Diagnostics +description: How to use Teleport's health, readiness, profiling, and monitoring endpoints. --- -## Teleport Prometheus endpoint - -Teleport provides HTTP endpoints for monitoring purposes. They are disabled -by default, but you can enable them using the `--diag-addr` flag to `teleport start`: +Teleport provides HTTP endpoints for monitoring purposes. They are disabled by +default, but you can enable them using the `--diag-addr` flag when running +`teleport start`: ```code $ sudo teleport start --diag-addr=127.0.0.1:3000 ``` -Now you can see the monitoring information by visiting several endpoints: - -- `http://127.0.0.1:3000/metrics` is the list of internal metrics Teleport is - tracking. It is compatible with [Prometheus](https://prometheus.io/) - collectors. -- `http://127.0.0.1:3000/healthz` returns "OK" if the process is healthy or - `503` otherwise. -- `http://127.0.0.1:3000/readyz` is similar to `/healthz`, but it returns "OK" - *only after* the node successfully joined the cluster, i.e.it draws the - difference between "healthy" and "ready". -- `http://127.0.0.1:3000/debug/pprof/` is Golang's standard profiler. It's only - available when `-d` flag is given in addition to `--diag-addr` +Now you can collect monitoring information from several endpoints. + +## `/healthz` + +The `http://127.0.0.1:3000/healthz` endpoint responds with a body of +`{"status":"ok"}` and an HTTP 200 OK status code if the process is running. + +This is a simple check, suitable for determining if the Teleport process is +still running. + +## `/readyz` + +The `http://127.0.0.1:3000/readyz` endpoint is similar to `/healthz`, but its +response includes information about the state of the process. + +The response body is a JSON object of the form: + +``` +{ "status": "a status message here"} +``` + +### `/readyz` and heartbeats + +If a Teleport component fails to execute its heartbeat procedure, it will enter +a degraded state. Teleport will begin recovering from this state when a +heartbeat completes successfully. + +The first successful heartbeat will transition Teleport into a recovering state. + +A second consecutive successful heartbeat will cause Teleport to transition to +the OK state, so long as at least 10 seconds have elapsed since the +first successful heartbeat. + +Teleport heartbeats run every 5 seconds. This means that depending on the timing +of heartbeats, it can take 10-20 seconds after connectivity is restored for +`/readyz` to start reporting healthy again. + +### Status codes + +The status code of the response can be one of: + +- HTTP 200 OK: Teleport is operating normally +- HTTP 503 Service Unavailable: Teleport has encountered a connection error and + is running in a degraded state. This happens when a Teleport heartbeat fails. +- HTTP 400 Bad Request: Teleport is either entering its initial startup phase or + has begun recovering from a degraded state. + +The same state information is also available via the `process_state` metric +under the `/metrics` endpoint. + +## `/debug/pprof` + +The `http://127.0.0.1:3000/debug/pprof/` endpoint is Go's standard pprof +profiler. This endpoint is only available if the `--debug` (or `-d`) flag is +supplied (in addition to `--diag-addr`). + +## `/metrics` + +The `http://127.0.0.1:3000/metrics` endpoint serves the internal metrics +Teleport is tracking. It is compatible with +[Prometheus](https://prometheus.io/) collectors. + +The following metrics are available: | Name | Type | Component | Description | | - | - | - | - | diff --git a/lib/service/state.go b/lib/service/state.go index c391ae487d3b4..bc6e18685a87b 100644 --- a/lib/service/state.go +++ b/lib/service/state.go @@ -88,7 +88,7 @@ func (f *processState) update(event Event) { component, ok := event.Payload.(string) if !ok { - f.process.log.Errorf("TeleportDegradedEvent broadcasted without component name, this is a bug!") + f.process.log.Errorf("%v broadcasted without component name, this is a bug!", event.Name) return } s, ok := f.states[component] @@ -118,7 +118,7 @@ func (f *processState) update(event Event) { s.recoveryTime = f.process.Clock.Now() f.process.log.Infof("Teleport component %q is recovering from a degraded state.", component) case stateRecovering: - if f.process.Clock.Now().Sub(s.recoveryTime) > defaults.HeartbeatCheckPeriod*2 { + if f.process.Clock.Since(s.recoveryTime) > defaults.HeartbeatCheckPeriod*2 { s.state = stateOK f.process.log.Infof("Teleport component %q has recovered from a degraded state.", component) }