-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a control plane #131
Comments
I'm thinking it makes sense to break things down into the following scenarios, and work things through to make sure there is the most generic applicability.
I'm trying to think if Agones player tracking helps here, but I don't think it does.
I wonder if this matters? As long as the reconciliation timeframe is a matter of seconds, as long as the initial connection timeout on the game takes that into account, it's probably not a concern? One other thought - can Open Match use the control plane to be aware of all the proxies? Since it will need to send that information to the game client as well. |
1 + 3 sounds like primarily the control plane needs to keep track of gameserver churn (from its pov say allocated vs not allocated) and send them as add/delete events to the proxies.
I think from a game client's pov this would be a bug or annoyance since we only should drop packets due to network or config issues - requiring clients to update their code might not be easy especially if they're doing fire/forget with no connection/retries which sounds likely since its all udp?
Yeah it sounds like in some cases this information could be needed. Some cases where it might not be:
As a fallback, we could probably add a grpc endpoint to the control plane that streams updates when a proxy joins/leaves |
Because then their authentication token will need to be added to the endpoint, so they have access to that game server.
I agree - but it might be worth waiting and seeing real data before implementing a solution. It may not actually be a big deal. Or it may. Just thinking it's the sort of thing we should probably have some real data on before making decisions.
Since you can't assume that a UDP packet ever reaches it's destination - the packets will have to retry by default until they receive their ACK response. I would expect that determining "no response" is more of a "I didn't get a value back in x time" rather than being able to rely on something stable like a TCP direct connection. I'm just positing that a delay in our control plane is technically no different from a small network blip, so it may well be handled anyway. |
That is possible - but seems unlikely. If we're talking about a global game - proxies are distributed around the world. So I think we should assume separate clusters by default. |
this would be the case for us at Embark, we have game server clusters spread out in different regions but a game only takes place in exactly one of those so we don't have a need for inter-cluster/region communication. to clarify what I have in mind for a setup:
Ideally all of these physically run in In a case like ours at Embark, we can still create a cluster (GS, proxies, control plane) in multiple regions and have them independent of each other. For use cases that do need to be distributed, the same initial setup can still be used: say for example a game client should talk to a proxy in
It shouldn't matter where OM runs since it shouldn't need to talk to the control plane.
I think this ties closely to how the control plane discovers game server addresses? At the time of allocating a GS, tokens should be known so that the control plane should be able to watch a single resource that contains both token and address (e.g appending the token to the GameServer, Endpoint or some other object) i.e OM provides tokens when requesting a GS and the service it talks to updates the appropriate object with both info - otherwise we'd need to think about races |
I think we have an interesting difference of opinions 💡 . Correct me if I'm wrong though! I'm assuming each player has their own identifying token. It sounds like you are thinking there is a single token for each game server? Is that correct? If so, that explains differences in architectural approaches. |
I don't think that's the case, I also had in mind multiple tokens for a gameserver as well - my OM knowledge is still minimal currently but my assumption was that at the time OM decides to allocate a GS, it already knows what players to assign to that particular GS (that should be the case right?) and as a result all tokens should be known or generatable at that point - regardless of if its one token per player. e.g when OM requests that a GS is allocated, it also provides any tokens alongside |
Ah awesome, then we are on the same page!
Aha! Yes this would be true in some games, but definitely not all. Some games will allow you to backfill game servers that are currently active. For example:
More details on Open Match: googleforgames/open-match#1240 So Open Match (or any matchmaker) will need to be part of the process of allowing access to specific game servers. |
Ah, I see! That's good to know I'll take a look at that issue! |
👍 this clarifies why a player being assigned to a server is important to the control plane! Essentially at anytime we should be able to change the set of tokens that are associated with the game server if I understand correctly. That should work mostly the same with my previous proposal so that the control plane still watches the associated k8s resources containing the tokens&addresses (not necessarily the same resource) and updates the proxies whenever there are changes. |
You are right - if you stored the tokens on the Agones GameServer (maybe also through something the SDK could change as well, if players disconnect?) it could be the central repository for the information? Combine this with something like googleforgames/agones#1239 -- and or some of the tools in https://agones.dev/site/docs/third-party-content/libraries-tools/ -- that could work out? I have some worries about some race condition type stuff though, but they can likely be worked through? One thought - depending on how much performance we get out of the proxy - it is possible that you could server Game Server clusters |
Yeah, implementation wise I would imagine there would be some sort of provider like interface such that the control plane's actual logic is separate from the source of events - e.g polling k8s api vs some other endpoint. Then adding a new source can be a matter a struct that can poll that source for info and convert it to a set of addresses and tokens that the control plane consumes
😮 which race conditions?
I would expect the proxies not to be an issue perf wise but rather latency if the clusters are far enough from them, would be interesting to find out for sure. What did you have in mind re message queue, what's sent on the queues? |
Sounds like a message queue of some kind would be good. https://github.com/Octops/agones-event-broadcaster may be a good fit.
For example: If a game server binary tracks that a client has disconnected / kicked out of a game - it will want to remove the access token from itself, and have that propagate out to the proxy. One way it could do that is through an annotation it edits through the Agones SDK. At the same time a re-allocation happens that adds several new players and updates the same annotation. The SDK may still have the old list rather than the new one, and overwrite the new player connection data. In the SDK, there's no concept of increment/make delta change -- but maybe that's something we can add to Agones itself. If we had a "delta change" operation on the SDK for labels and annotations, Kubernetes generational resource locking would save us here (although, not 100% sure how that would work, but we can probably some up with something 🤔 ).
From a GCP perspective - I could have 3 clusters running in the same GCP region/zone, depending on the size of my game. So latency shouldn't be an issue at that point.
So the project I mentioned sends out GameServer changes over pubsub (or other message queues) - so our xDS service could subscribe to that, and translate that into our xDS format of choice, to be send out to all proxies. |
k8s api's have support for optimistic locking to avoid these types of issues, there's the
Ah yes, this would be possible with the provider interface. |
You are correct about the locking (it is the default, and you can somewhat opt out of it with patch statements, but even then sometimes they fail with a It's tricky if you are doing some kind of increment operation - i.e. if you had an annotation of: The way the SDK One way you could handle this though - is make every connection token unique as an annotation! Then Agones doesn't have to worry about deltas! So you could do something like:
Then to add an extra token of Some extra thoughts:
|
I didn't catch the problem from the example - if the first attempt fails then the writer could call |
The way `SDK.SetAnnotation() works, it can only do complete value swaps for labels and annotations - so it's a case of last-one-in-wins - it has no concept of list deltas. So if while it tries and sets an annotation value of So either - Agones would need to add the concept of list deltas to it's SDK for Annotations (and is that a good idea, maybe? What format, etc? Maybe you can provide JSON patches to json values stored in annotations?), or you would need to do individual unique keys for the appropriate annotation. |
If I understand correctly, the issue is more of an API limitation on the SDK that currently |
It does retry, but it's only ever going to retry with the initial value that was passed. |
I see, can we add support for this to the sdk? e.g a new api where the caller passes in a function to recompute a new value on retry |
That's an interesting question. The SDK is a wrapper around a gRPC client, and the gRPC servers talks to K8s -- so it'll be tricky to have a callback here, although not impossible. |
The XDS control planes out there today seem to be directed at proxying HTTP and TCP traffic so that it seems simpler to roll our own rather than attempt to adopt one of them to work for our use case.
go-control-plane seems to make writing an XDS control plane really easy thankfully - it handles running a GRPC server that speaks the XDS protocol with proxies and is backed by a cache which an implementation needs to populate.
The rough workflow would be that we have our code find out what gameserver/upstreamendpoints are available in the cluster and update the cache. Each quilkin proxy will watch resources by contacting go-control-plane's GRPC server which then feeds it data from the cache whenever it is updated.
Currently wondering what a simple workflow would be with e.g Agones + OpenMatch - say a loop that watches
GameServer
s that have beenAllocated
and populates the cache with their addresses as Endpoints - and it sounds like one difference in this case is that we'll need to ensure that all connected proxies have acknowledged the update containing aGameServer
's address before Open Match lets any game client talk to them otherwise a race condition can cause any initial packets from clients to be dropped since the proxies won't know about the new address.So that this would need some kind of synchronization between the control plane and Open Match? (Say a CRD that will be updated by the control plane server and watched by the director?)
Would this make sense? Is this missing something? Thoughts?
The text was updated successfully, but these errors were encountered: