-
-
Notifications
You must be signed in to change notification settings - Fork 192
Add /health endpoint #49
Comments
Yes, it's an expensive health check when there are lots of metrics to return. I'll give this some thought. Thanks for creating the issue. |
@seglo I would like to work on this. Can you assign to me ? |
@anbarasantr I've assigned it to you. Thanks for volunteering to work on it! |
Another idea is to write a file to the container and use a shell command to check for its presence. Then we wouldn't need to add a library to create an endpoint. |
@seglo I thought that the health end point will be useful to configure liveness and readiness probes for Containers. So should not we return success only when the prometheus client is exposing metrics and application is polling from Kafka successfully? |
@anbarasantr Yes, that's the reason for the health check. Right now the check will return a full scrape of all metrics, which is not ideal. I would like to avoid adding another HTTP library and endpoint to the project, or adding too much logic to assert "healthiness". I think touching a file every poll interval, and updating the health checks to assert that the file exists, or has been updated recently is good enough. Another idea would be to add a kafka-lag-exporter meta metric. For example in #46 I describe adding a metric that tracks how long poll intervals take, and exporting that to the prometheus endpoint. Then the container health check could make an HTTP call to filter just for that metric (https://github.com/lightbend/kafka-lag-exporter#filtering-metrics-without-prometheus-server), and optionally fail the health check if it's greater than some value (longer than the poll interval itself?) |
@anbarasantr Ok, great. I'll assign that issue to you as well. Thanks. |
Can this issue get some attention? We use the lag-exporter with Consul which is sensitive to health check size (snapshots) and thus this introduces unnecessary load for checking health. |
@analytically sure. I will raise a pull request in few days. |
Thanks @anbarasantr ! |
Thanks, so what is the canonical way to use this as an http health check? |
/metrics?name[]=kafka_consumergroup_poll_time_ms |
Currently / and /metrics return the full metrics. It'd be handy if there was a simple /health endpoint returning 200 OK to indicate service availability instead of returning metrics at all time.
The text was updated successfully, but these errors were encountered: