-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add health checks for each service #802
Conversation
6fd74f8
to
8dc5237
Compare
.read() | ||
.as_deref() | ||
.unwrap() | ||
.store(true, Ordering::SeqCst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current implementation, this will drop the a proxy connected to a management server in and out of the load balancer, since liveness is tied to readiness - and that seems less than ideal.
The proxy should still process packets even it's retrying the connection - up until the point where it has to drop entirely to restart.
As as note - with this change, if a backing xDS server goes down, this will take down all clients (proxies, relays, etc) that are connected with it when they eventually restart. Are we happy with that? That's a pretty big area of effect (this reads like a potential regional outage to me).
Or maybe this is more of a documentation issue? i.e. "make sure you set health checks considerably long enough to allow for the system to retry connections" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So just to reiterate this properly, now my understanding is more correct.
On initial creation of the Quilkin process we wait to be Ready until we wait for a successful connection to whatever backing process we are working with (xDS, K8s, whatever).
But if that connection drops, eventually all Quilkin instances handling that connection will take themselves out of the load balancer - which means, the whole system stops processing data - and we have a single point of failure that can cause a regional outage.
Is that okay?
For example: if you had a Proxy connected Relay, and the Relay goes down, should the proxies should continue to be processing data packets for as long as it takes for the Relay to come back up? Or should it eventually stop processing (Ready: false) all together because continuing to process data with stale configuration data would be bad?
Assuming we want a Proxy to eventually stop serving traffic if a relay goes down, we have some good controls at hand, on the Kubernetes side. Since we would probably want fast readiness on success, I expect we would advocate for a periodSeconds
of 1 (the minimum), but then have a successThreshold
of 1, but failureThreshold
of something like 30 (or some other larger number) - so end users can control this.
I just want to talk this though, since the implications here can be wide, so we would need to document this really well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or should it eventually stop processing (Ready: false) all together because continuing to process data with stale configuration data would be bad?
It should eventually stop processing, because it will not be able to serve new servers. A proxy should have a long gracefulTerminationSeconds
so that it has plenty of time to continue to serve existing servers and recover. It's not just that processing with stale data would be bad, it will not be able to process any new traffic if it doesn't have a working provider.
we have some good controls at hand, on the Kubernetes side.
Relatedly, this is an another place where have #673 would be useful. We'd be able to define good defaults for these thresholds, instead of requiring users to figure it out themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should eventually stop processing, because it will not be able to serve new servers.
Okay, cool - so we are definitely saying that if a relay goes down, at some point any connected proxies should stop serving traffic altogether. (similar for other connected scenarios).
Relatedly, this is an another place where have #673 would be useful. We'd be able to define good defaults for these thresholds, instead of requiring users to figure it out themselves.
We can handle some of this with our examples, but we're definitely entering a land where Helm charts are starting to get useful (especially when combo'd with rbac rules etc).
I also feel like it's probably game dependent (most likely tied to session length), so documentation on when and where a Ready would end up being false, is very important here.
For example, if I have a game session that lasts <=30m, I would probably want my proxies to end up non-Ready after > 30m (maybe 35m?) to ensure any existing sessions have time to finish, but after that, if the proxy is not getting any new data, therefore keeping the proxy running becomes a risk, and it should be shut down at that point.
For something like a lobby server or something longer running, removal would probably end up being much longer I expect (a few hours? a day?).
A proxy should have a long gracefulTerminationSeconds so that it has plenty of time to continue to serve existing servers and recover.
gracefulTerminationSeconds
has more to do with things like node shutdownds, upgrades, rolling update deployments etc -- so did you mean this in more of a "we should have more documentation around termination" (100% agree btw), or I may be missing a point here on liveness/readiness.
Also a great other reason for some suggested values / Helm charts, here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also feel like it's probably game dependent (most likely tied to session length), so documentation on when and where a Ready would end up being false, is very important here.
Definitely, if we were building a helm chart, I would expose a expectedSessionLength
variable, and use that as the basis of the calcuations we set for thresholds.
For example, if I have a game session that lasts <=30m, I would probably want my proxies to end up non-Ready after > 30m (maybe 35m?) to ensure any existing sessions have time to finish
You probably want it to be at least 1.5x to 2.5x your session length, because you want to be account for variability in session length, and a session starting at the same time that the provider went down.
so did you mean this in more of a "we should have more documentation around termination" (100% agree btw), or I may be missing a point here on liveness/readiness.
Sorry I meant the similarly named terminationGracePeriodSeconds
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only thing blocking for me then (after the recent discussion) is updates to documentation at https://googleforgames.github.io/quilkin/main/book/deployment/admin.html on readiness and liveness, with potentially some good default values.
I'm happy to take those values and port them into examples as well).
This implements the integration tests for `quilkin relay` and `quilkin agent` for Agones integration. This includes refactoring of the integration test libraries to consolidate and improve aspects of orchestrating and running proxies in a Kubernetes cluster for integration tests. Also includes a bug fix on readiness for Agent that will eventually be replaced by googleforgames#802, but allows for tests to pass. Closes googleforgames#806
This implements the integration tests for `quilkin relay` and `quilkin agent` for Agones integration. This includes refactoring of the integration test libraries to consolidate and improve aspects of orchestrating and running proxies in a Kubernetes cluster for integration tests. Also includes a bug fix on readiness for Agent that will eventually be replaced by googleforgames#802, but allows for tests to pass. Closes googleforgames#806
This implements the integration tests for `quilkin relay` and `quilkin agent` for Agones integration. This includes refactoring of the integration test libraries to consolidate and improve aspects of orchestrating and running proxies in a Kubernetes cluster for integration tests. Also includes a bug fix on readiness for Agent that will eventually be replaced by googleforgames#802, but allows for tests to pass. Closes googleforgames#806
This implements the integration tests for `quilkin relay` and `quilkin agent` for Agones integration. This includes refactoring of the integration test libraries to consolidate and improve aspects of orchestrating and running proxies in a Kubernetes cluster for integration tests. Also includes a bug fix on readiness for Agent that will eventually be replaced by #802, but allows for tests to pass. Closes #806
60a9792
to
8ca75b4
Compare
Co-authored-by: Mark Mandel <[email protected]>
Build Succeeded 🥳 Build Id: e2da28dc-311c-4a47-a0b5-cb557c379168 The following development images have been built, and will exist for the next 30 days: To build this version:
|
This PR refactors the health checks, so that we have service specific health checks, such that a service only marks itself as ready if its provider has given a valid response. And the
agent
service is only healthy when it has an active connection from the relay, and will retry its connection if it hasn't heard from it in awhile.