Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add health checks for each service #802

Merged
merged 5 commits into from
Oct 11, 2023
Merged

Add health checks for each service #802

merged 5 commits into from
Oct 11, 2023

Conversation

XAMPPRocky
Copy link
Collaborator

This PR refactors the health checks, so that we have service specific health checks, such that a service only marks itself as ready if its provider has given a valid response. And the agent service is only healthy when it has an active connection from the relay, and will retry its connection if it hasn't heard from it in awhile.

src/cli/admin.rs Outdated Show resolved Hide resolved
src/cli/proxy.rs Show resolved Hide resolved
src/cli/relay.rs Show resolved Hide resolved
src/config/watch/agones.rs Show resolved Hide resolved
.read()
.as_deref()
.unwrap()
.store(true, Ordering::SeqCst);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current implementation, this will drop the a proxy connected to a management server in and out of the load balancer, since liveness is tied to readiness - and that seems less than ideal.

The proxy should still process packets even it's retrying the connection - up until the point where it has to drop entirely to restart.

As as note - with this change, if a backing xDS server goes down, this will take down all clients (proxies, relays, etc) that are connected with it when they eventually restart. Are we happy with that? That's a pretty big area of effect (this reads like a potential regional outage to me).

Or maybe this is more of a documentation issue? i.e. "make sure you set health checks considerably long enough to allow for the system to retry connections" ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So just to reiterate this properly, now my understanding is more correct.

On initial creation of the Quilkin process we wait to be Ready until we wait for a successful connection to whatever backing process we are working with (xDS, K8s, whatever).

But if that connection drops, eventually all Quilkin instances handling that connection will take themselves out of the load balancer - which means, the whole system stops processing data - and we have a single point of failure that can cause a regional outage.

Is that okay?

For example: if you had a Proxy connected Relay, and the Relay goes down, should the proxies should continue to be processing data packets for as long as it takes for the Relay to come back up? Or should it eventually stop processing (Ready: false) all together because continuing to process data with stale configuration data would be bad?

Assuming we want a Proxy to eventually stop serving traffic if a relay goes down, we have some good controls at hand, on the Kubernetes side. Since we would probably want fast readiness on success, I expect we would advocate for a periodSeconds of 1 (the minimum), but then have a successThreshold of 1, but failureThreshold of something like 30 (or some other larger number) - so end users can control this.

I just want to talk this though, since the implications here can be wide, so we would need to document this really well.

Copy link
Collaborator Author

@XAMPPRocky XAMPPRocky Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or should it eventually stop processing (Ready: false) all together because continuing to process data with stale configuration data would be bad?

It should eventually stop processing, because it will not be able to serve new servers. A proxy should have a long gracefulTerminationSeconds so that it has plenty of time to continue to serve existing servers and recover. It's not just that processing with stale data would be bad, it will not be able to process any new traffic if it doesn't have a working provider.

we have some good controls at hand, on the Kubernetes side.

Relatedly, this is an another place where have #673 would be useful. We'd be able to define good defaults for these thresholds, instead of requiring users to figure it out themselves.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should eventually stop processing, because it will not be able to serve new servers.

Okay, cool - so we are definitely saying that if a relay goes down, at some point any connected proxies should stop serving traffic altogether. (similar for other connected scenarios).

Relatedly, this is an another place where have #673 would be useful. We'd be able to define good defaults for these thresholds, instead of requiring users to figure it out themselves.

We can handle some of this with our examples, but we're definitely entering a land where Helm charts are starting to get useful (especially when combo'd with rbac rules etc).

I also feel like it's probably game dependent (most likely tied to session length), so documentation on when and where a Ready would end up being false, is very important here.

For example, if I have a game session that lasts <=30m, I would probably want my proxies to end up non-Ready after > 30m (maybe 35m?) to ensure any existing sessions have time to finish, but after that, if the proxy is not getting any new data, therefore keeping the proxy running becomes a risk, and it should be shut down at that point.

For something like a lobby server or something longer running, removal would probably end up being much longer I expect (a few hours? a day?).

A proxy should have a long gracefulTerminationSeconds so that it has plenty of time to continue to serve existing servers and recover.

gracefulTerminationSeconds has more to do with things like node shutdownds, upgrades, rolling update deployments etc -- so did you mean this in more of a "we should have more documentation around termination" (100% agree btw), or I may be missing a point here on liveness/readiness.

Also a great other reason for some suggested values / Helm charts, here as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel like it's probably game dependent (most likely tied to session length), so documentation on when and where a Ready would end up being false, is very important here.

Definitely, if we were building a helm chart, I would expose a expectedSessionLength variable, and use that as the basis of the calcuations we set for thresholds.

For example, if I have a game session that lasts <=30m, I would probably want my proxies to end up non-Ready after > 30m (maybe 35m?) to ensure any existing sessions have time to finish

You probably want it to be at least 1.5x to 2.5x your session length, because you want to be account for variability in session length, and a session starting at the same time that the provider went down.

so did you mean this in more of a "we should have more documentation around termination" (100% agree btw), or I may be missing a point here on liveness/readiness.

Sorry I meant the similarly named terminationGracePeriodSeconds.

Copy link
Member

@markmandel markmandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only thing blocking for me then (after the recent discussion) is updates to documentation at https://googleforgames.github.io/quilkin/main/book/deployment/admin.html on readiness and liveness, with potentially some good default values.

I'm happy to take those values and port them into examples as well).

markmandel added a commit to markmandel/quilkin that referenced this pull request Oct 9, 2023
This implements the integration tests for `quilkin relay` and `quilkin
agent` for Agones integration.

This includes refactoring of the integration test libraries to
consolidate and improve aspects of orchestrating and running
proxies in a Kubernetes cluster for integration tests.

Also includes a bug fix on readiness for Agent that will eventually be
replaced by googleforgames#802, but allows for tests to pass.

Closes googleforgames#806
markmandel added a commit to markmandel/quilkin that referenced this pull request Oct 9, 2023
This implements the integration tests for `quilkin relay` and `quilkin
agent` for Agones integration.

This includes refactoring of the integration test libraries to
consolidate and improve aspects of orchestrating and running
proxies in a Kubernetes cluster for integration tests.

Also includes a bug fix on readiness for Agent that will eventually be
replaced by googleforgames#802, but allows for tests to pass.

Closes googleforgames#806
markmandel added a commit to markmandel/quilkin that referenced this pull request Oct 9, 2023
This implements the integration tests for `quilkin relay` and `quilkin
agent` for Agones integration.

This includes refactoring of the integration test libraries to
consolidate and improve aspects of orchestrating and running
proxies in a Kubernetes cluster for integration tests.

Also includes a bug fix on readiness for Agent that will eventually be
replaced by googleforgames#802, but allows for tests to pass.

Closes googleforgames#806
XAMPPRocky pushed a commit that referenced this pull request Oct 10, 2023
This implements the integration tests for `quilkin relay` and `quilkin
agent` for Agones integration.

This includes refactoring of the integration test libraries to
consolidate and improve aspects of orchestrating and running
proxies in a Kubernetes cluster for integration tests.

Also includes a bug fix on readiness for Agent that will eventually be
replaced by #802, but allows for tests to pass.

Closes #806
@quilkin-bot
Copy link
Collaborator

Build Succeeded 🥳

Build Id: e2da28dc-311c-4a47-a0b5-cb557c379168

The following development images have been built, and will exist for the next 30 days:

To build this version:

git fetch [email protected]:googleforgames/quilkin.git pull/802/head:pr_802 && git checkout pr_802
cargo build

@XAMPPRocky XAMPPRocky merged commit b7a6c51 into main Oct 11, 2023
5 checks passed
@XAMPPRocky XAMPPRocky deleted the ep/add-health-checks branch October 11, 2023 07:19
@markmandel markmandel added the kind/feature New feature or request label Oct 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request size/l
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants