Add health checks for each service #802

XAMPPRocky · 2023-09-26T07:53:07Z

This PR refactors the health checks, so that we have service specific health checks, such that a service only marks itself as ready if its provider has given a valid response. And the agent service is only healthy when it has an active connection from the relay, and will retry its connection if it hasn't heard from it in awhile.

src/cli/admin.rs

src/cli/proxy.rs

src/cli/relay.rs

src/config/watch/agones.rs

markmandel · 2023-09-28T23:05:52Z

src/xds/client.rs

+                                    .read()
+                                    .as_deref()
+                                    .unwrap()
+                                    .store(true, Ordering::SeqCst);


With the current implementation, this will drop the a proxy connected to a management server in and out of the load balancer, since liveness is tied to readiness - and that seems less than ideal.

The proxy should still process packets even it's retrying the connection - up until the point where it has to drop entirely to restart.

As as note - with this change, if a backing xDS server goes down, this will take down all clients (proxies, relays, etc) that are connected with it when they eventually restart. Are we happy with that? That's a pretty big area of effect (this reads like a potential regional outage to me).

Or maybe this is more of a documentation issue? i.e. "make sure you set health checks considerably long enough to allow for the system to retry connections" ?

So just to reiterate this properly, now my understanding is more correct.

On initial creation of the Quilkin process we wait to be Ready until we wait for a successful connection to whatever backing process we are working with (xDS, K8s, whatever).

But if that connection drops, eventually all Quilkin instances handling that connection will take themselves out of the load balancer - which means, the whole system stops processing data - and we have a single point of failure that can cause a regional outage.

Is that okay?

For example: if you had a Proxy connected Relay, and the Relay goes down, should the proxies should continue to be processing data packets for as long as it takes for the Relay to come back up? Or should it eventually stop processing (Ready: false) all together because continuing to process data with stale configuration data would be bad?

Assuming we want a Proxy to eventually stop serving traffic if a relay goes down, we have some good controls at hand, on the Kubernetes side. Since we would probably want fast readiness on success, I expect we would advocate for a periodSeconds of 1 (the minimum), but then have a successThreshold of 1, but failureThreshold of something like 30 (or some other larger number) - so end users can control this.

I just want to talk this though, since the implications here can be wide, so we would need to document this really well.

Or should it eventually stop processing (Ready: false) all together because continuing to process data with stale configuration data would be bad?

It should eventually stop processing, because it will not be able to serve new servers. A proxy should have a long gracefulTerminationSeconds so that it has plenty of time to continue to serve existing servers and recover. It's not just that processing with stale data would be bad, it will not be able to process any new traffic if it doesn't have a working provider.

we have some good controls at hand, on the Kubernetes side.

Relatedly, this is an another place where have #673 would be useful. We'd be able to define good defaults for these thresholds, instead of requiring users to figure it out themselves.

It should eventually stop processing, because it will not be able to serve new servers.

Okay, cool - so we are definitely saying that if a relay goes down, at some point any connected proxies should stop serving traffic altogether. (similar for other connected scenarios).

Relatedly, this is an another place where have #673 would be useful. We'd be able to define good defaults for these thresholds, instead of requiring users to figure it out themselves.

We can handle some of this with our examples, but we're definitely entering a land where Helm charts are starting to get useful (especially when combo'd with rbac rules etc).

I also feel like it's probably game dependent (most likely tied to session length), so documentation on when and where a Ready would end up being false, is very important here.

For example, if I have a game session that lasts <=30m, I would probably want my proxies to end up non-Ready after > 30m (maybe 35m?) to ensure any existing sessions have time to finish, but after that, if the proxy is not getting any new data, therefore keeping the proxy running becomes a risk, and it should be shut down at that point.

For something like a lobby server or something longer running, removal would probably end up being much longer I expect (a few hours? a day?).

A proxy should have a long gracefulTerminationSeconds so that it has plenty of time to continue to serve existing servers and recover.

gracefulTerminationSeconds has more to do with things like node shutdownds, upgrades, rolling update deployments etc -- so did you mean this in more of a "we should have more documentation around termination" (100% agree btw), or I may be missing a point here on liveness/readiness.

Also a great other reason for some suggested values / Helm charts, here as well.

I also feel like it's probably game dependent (most likely tied to session length), so documentation on when and where a Ready would end up being false, is very important here.

Definitely, if we were building a helm chart, I would expose a expectedSessionLength variable, and use that as the basis of the calcuations we set for thresholds.

For example, if I have a game session that lasts <=30m, I would probably want my proxies to end up non-Ready after > 30m (maybe 35m?) to ensure any existing sessions have time to finish

You probably want it to be at least 1.5x to 2.5x your session length, because you want to be account for variability in session length, and a session starting at the same time that the provider went down.

so did you mean this in more of a "we should have more documentation around termination" (100% agree btw), or I may be missing a point here on liveness/readiness.

Sorry I meant the similarly named terminationGracePeriodSeconds.

markmandel

Only thing blocking for me then (after the recent discussion) is updates to documentation at https://googleforgames.github.io/quilkin/main/book/deployment/admin.html on readiness and liveness, with potentially some good default values.

I'm happy to take those values and port them into examples as well).

This implements the integration tests for `quilkin relay` and `quilkin agent` for Agones integration. This includes refactoring of the integration test libraries to consolidate and improve aspects of orchestrating and running proxies in a Kubernetes cluster for integration tests. Also includes a bug fix on readiness for Agent that will eventually be replaced by googleforgames#802, but allows for tests to pass. Closes googleforgames#806

This implements the integration tests for `quilkin relay` and `quilkin agent` for Agones integration. This includes refactoring of the integration test libraries to consolidate and improve aspects of orchestrating and running proxies in a Kubernetes cluster for integration tests. Also includes a bug fix on readiness for Agent that will eventually be replaced by #802, but allows for tests to pass. Closes #806

docs/src/deployment/admin.md

Co-authored-by: Mark Mandel <[email protected]>

quilkin-bot · 2023-10-11T07:15:24Z

Build Succeeded 🥳

Build Id: e2da28dc-311c-4a47-a0b5-cb557c379168

The following development images have been built, and will exist for the next 30 days:

us-docker.pkg.dev/quilkin/ci/quilkin:0.7.0-dev-34d7ea8

To build this version:

git fetch [email protected]:googleforgames/quilkin.git pull/802/head:pr_802 && git checkout pr_802
cargo build

XAMPPRocky requested review from markmandel, No-ops, rezvaneh and vallfors as code owners September 26, 2023 07:53

github-actions bot added the size/l label Sep 26, 2023

XAMPPRocky force-pushed the ep/add-health-checks branch 7 times, most recently from 6fd74f8 to 8dc5237 Compare September 27, 2023 12:23

markmandel reviewed Sep 28, 2023

View reviewed changes

XAMPPRocky requested a review from koslib as a code owner October 9, 2023 08:33

koslib approved these changes Oct 9, 2023

View reviewed changes

markmandel reviewed Oct 9, 2023

View reviewed changes

markmandel mentioned this pull request Oct 9, 2023

Integration tests for Agones Relay and Agent #811

Merged

XAMPPRocky added 3 commits October 10, 2023 10:32

Add health checks for each service

456444d

Clarify function naming

3bdb278

Add documentation

8ca75b4

XAMPPRocky force-pushed the ep/add-health-checks branch from 60a9792 to 8ca75b4 Compare October 10, 2023 08:51

XAMPPRocky requested a review from markmandel October 10, 2023 09:12

markmandel reviewed Oct 11, 2023

View reviewed changes

docs/src/deployment/admin.md Outdated Show resolved Hide resolved

markmandel mentioned this pull request Oct 11, 2023

Add readiness checks to /examples #818

Closed

XAMPPRocky and others added 2 commits October 11, 2023 08:53

Update admin.md

34d7ea8

Co-authored-by: Mark Mandel <[email protected]>

Merge branch 'main' into ep/add-health-checks

33e54cf

XAMPPRocky merged commit b7a6c51 into main Oct 11, 2023
5 checks passed

XAMPPRocky deleted the ep/add-health-checks branch October 11, 2023 07:19

markmandel added the kind/feature New feature or request label Oct 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add health checks for each service #802

Add health checks for each service #802

XAMPPRocky commented Sep 26, 2023

markmandel Sep 28, 2023

markmandel Sep 29, 2023

XAMPPRocky Oct 9, 2023 •

edited

Loading

markmandel Oct 9, 2023

XAMPPRocky Oct 10, 2023

markmandel left a comment

quilkin-bot commented Oct 11, 2023

Add health checks for each service #802

Add health checks for each service #802

Conversation

XAMPPRocky commented Sep 26, 2023

markmandel Sep 28, 2023

Choose a reason for hiding this comment

markmandel Sep 29, 2023

Choose a reason for hiding this comment

XAMPPRocky Oct 9, 2023 • edited Loading

Choose a reason for hiding this comment

markmandel Oct 9, 2023

Choose a reason for hiding this comment

XAMPPRocky Oct 10, 2023

Choose a reason for hiding this comment

markmandel left a comment

Choose a reason for hiding this comment

quilkin-bot commented Oct 11, 2023

XAMPPRocky Oct 9, 2023 •

edited

Loading