feat(health): add check endpoint and loop control #2575

jfroy · 2024-11-08T20:09:51Z

On Kubernetes, it makes more sense to use a liveness probe than the health server loop (i.e. only have one loop). This patch introduces a flag to disable the health server loop, and a new /check/ endpoint for such probes.

When the connection is saturated, health checks can take a long time and therefore fail if the timeout is too short. Gradually increasing the timeout, as done in the health server loop, is not all that useful because the upper bound on the timeout is ultimately what you are willing to tolerate before declaring the connection unhealthy. So a static probe with a long timeout and a failure count, as implemented in Kubernetes, will be more stable (i.e. less flopping), especially if more than one sequential failure is allowed.

The above argument aside, the two health/probe loops also do not work well together because they can get out of phase. Kubernetes probes usually must be used to sequence containers in a pod.

On Kubernetes, it makes more sense to use a liveness probe than the health server loop (i.e. only have one loop). This patch introduces a flag to disable the health server loop, and a new /check/ endpoint for such probes. When the connection is saturated, health checks can take a long time and therefore fail if the timeout is too short. Gradually increasing the timeout, as done in the health server loop, is not all that useful because the upper bound on the timeout is ultimately what you are willing to tolerate before declaring the connection unhealthy. So a static probe with a long timeout and a failure count, as implemented in Kubernetes, will be more stable (i.e. less flopping), especially if more than one sequential failure is allowed. The above argument aside, the two health/probe loops also do not work well together because they can get out of phase. Kubernetes probes usually must be used to sequence containers in a pod. Signed-off-by: Jean-Francois Roy <[email protected]>

qdm12 · 2024-11-18T22:15:36Z

A 10 seconds timeout to a simple TCP dial+TLS handshake should be plenty for a working connection, even if bottlenecked, what do you think?
In your code you address 2 things: a liveness endpoint doing a health check (TCP dial+TLS) timing out when the request gets canceled AND disabling the auto healing. I'm quite opposed on the second one, so let's focus on the first one (I can explain more if you explain more why you need it disabled).
The http request to this new liveness endpoint: would it ever be canceled?? Something to check, otherwise it might leave a bunch of goroutines hanging/small memory leak.
Another solution, such that there is only one endpoint, would be to pass a timeout url query parameter to the healthcheck endpoint, and that one would keep on waiting on the "health loop" to get a nil error, and would only return the last error after the timeout given has elapsed: would that be possible from the orchestrator point of view? Alternatively just pass a liveness=true parameter such that the http handler hangs instead of returning an error, that way we can keep the health loop independent to how we use it externally.
Just in case you did not notice, you adjust timeout values for the auto-healing.

jfroy · 2024-11-18T23:39:14Z

A 10 seconds timeout to a simple TCP dial+TLS handshake should be plenty for a working connection, even if bottlenecked, what do you think?

My intuition agrees, but I've seen this fail when 1000s of TCP connections are going through the VPN and the link is fairly loaded (80%+ of expected throughput).

In your code you address 2 things: a liveness endpoint doing a health check (TCP dial+TLS) timing out when the request gets canceled AND disabling the auto healing. I'm quite opposed on the second one, so let's focus on the first one (I can explain more if you explain more why you need it disabled).

The liveness endpoint uses the same code as the health loop on purpose, to avoid code duplication and maintain as much of the behavior of the loop as possible.

As I wrote in the commit message, when using a K8S liveness probe, the gluetun health loop and the K8S probe loop sort of interfere with each other -- they can easily get out of phase. So before this patch, you can end up in a situation where the gluetun loop fails and sets the status error, the K8S liveness probe comes in and samples the error (thus failing the probe), the gluetun loop then succeeds and then fails again, the K8S probe comes in and samples an error again, etc. This can lead to K8S considering the container failed, even though the health server is flopping because the connection is loaded. Because the gluetun loop has an adaptive timeout, it's not possible to prevent the 2 loops from going out of phase. Even if you were to change the gluetun loop to have a fixed period and matched it to the K8S probe, they would still eventually get out of phase.

Disabling the gluetun health loop only makes sense when using a K8S liveness probe. When using gluetun from just Docker, it is a bad idea to disable it, indeed. Perhaps the environment variable could be named something more specific ("disable for kubernetes", "disable I know what I am doing", "disable footguns are fun", etc). Alternatively, perhaps if the liveness endpoint is used it could disable the gluetun loop.

The http request to this new liveness endpoint: would it ever be canceled?? Something to check, otherwise it might leave a bunch of goroutines hanging/small memory leak.

Ah yes, very good point. I will fix the code. I don't think cancellation is likely, but it's just better to handle it.

Another solution, such that there is only one endpoint, would be to pass a timeout url query parameter to the healthcheck endpoint, and that one would keep on waiting on the "health loop" to get a nil error, and would only return the last error after the timeout given has elapsed: would that be possible from the orchestrator point of view? Alternatively just pass a liveness=true parameter such that the http handler hangs instead of returning an error, that way we can keep the health loop independent to how we use it externally.

That's an interesting alternative, but it feels better, overall, not to have 2 loops.

Just in case you did not notice, you adjust timeout values for the auto-healing.

Yeah, and I did try that before writing this patch, but I always got into the unstable regime I described above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(health): add check endpoint and loop control #2575

feat(health): add check endpoint and loop control #2575

jfroy commented Nov 8, 2024

qdm12 commented Nov 18, 2024

jfroy commented Nov 18, 2024 •

edited

Loading

feat(health): add check endpoint and loop control #2575

Are you sure you want to change the base?

feat(health): add check endpoint and loop control #2575

Conversation

jfroy commented Nov 8, 2024

qdm12 commented Nov 18, 2024

jfroy commented Nov 18, 2024 • edited Loading

jfroy commented Nov 18, 2024 •

edited

Loading