Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: RefCount allocation to host many room in same pod #1197

Closed
neuecc opened this issue Nov 29, 2019 · 20 comments
Closed

Proposal: RefCount allocation to host many room in same pod #1197

neuecc opened this issue Nov 29, 2019 · 20 comments
Labels
kind/feature New features for Agones stale Pending closure unless there is a strong objection. wontfix Sorry, but we're not going to do that.

Comments

@neuecc
Copy link

neuecc commented Nov 29, 2019

Is your feature request related to a problem? Please describe.

The current agones allocation model is only support to assign single room in Pod.
image

This provides a complete model of Isolation, but at the cost of Runtime and Game Engine,
it loses many sharable memory and pay much sleep time.
If the room is not hosting a large number of users like BattleRoyal,
a smaller user (2-8) and command frequency is low, we are able to host multiple rooms on one game engine(pod).

image

Does not share state between rooms, but shares execution engine (Load immutable gamedatas in memory, Runtime JIT, dynamic assemblies, etc...).
This can result in significant cost savings.

We're creating mobile game that run in .NET Core Server and Unity Client.
In .NET Core, we're using MagicOnion, OpenSource our network engine built on .NET Core/gRPC/Http2.
The cost of a one-room game loop on the server is relatively small, and many rooms can coexist on a single engine.
We want to host this real-time server on Agones, but we need to host many rooms on same pod to reduce costs.

Many mobile games requires state-ful and realtime server but not request heavy traffic per single room.
Therefore, since the runtime cost is relatively high, we want to host multiple rooms.
(If the runtime cost is relatively low, single room is fine.)

Describe the solution you'd like

I suggest RefCount (virtual allocation).
Allocate requests returns same pod and increment reference count.
When the finish games, request shutdown to decrement reference count.
If the reference count exceeds the value set in the configuration,
returns the server created for the new pod.

@neuecc neuecc added the kind/feature New features for Agones label Nov 29, 2019
@roberthbailey
Copy link
Member

I've heard a few requests for this sort of enhancement so it's definitely worth exploring and seeing how it would work.

There are a few of things I'm concerned about:

  1. If we want packed allocations and we keep replacing connections into the same engine, we need a way to "drain" everyone off the game server instance (e.g. for upgrades or node maintenance). With the current design this can be easily achieved using k8s primitives with the assumption that each pod has a reasonably short lifespan (order of minutes).
  2. We need a way to keep track of capacity for each engine instance (room). We could assume each virtual allocation uses the same resources, but it would probably be better to be able to do something based on player tracking.
  3. We have a single SDK for the whole pod, so we need a backwards compatible way to introduce virtual allocations into the existing SDK.
  4. We need to think hard about the lifecycle of a virtual allocation since it will be a bit different than a allocation where we know the lifecycle of the engine is tied to the lifecycle of a container. How do healthchecks work? Is it possible to detect that one virtual allocation is unhealthy but the others in the same allocation are fine? Is it possible to rectify this situation if we can detect it?

There are probably other things that will come up, but thinking through these will be a good place to start.

@logixworx
Copy link

logixworx commented Dec 1, 2019

Definitely need this. I wrote a multi-room system within a single server instance for Unity/Mirror.

@roberthbailey
Copy link
Member

@logixworx - Are there any details about that system that you can share (or point to any public documentation)? Do you have any requirements that weren't described in the original post? Do you have any insights that we can leverage in Agones? Do you have any thoughts about the questions I posed above?

@logixworx
Copy link

My requirements are exactly as described in the original post, as well as in your response. I can't think of anything more to contribute to the discussion at the moment.

@markmandel
Copy link
Member

So I have quite a few thoughts, and it comes under several categories. This is a topic that has come up quite regularly over the years.

I'm going to use the term "Game Session" as the term I've grow comfortable with for a separate session / room within a Game Server.

There's some implementation details below, but I'm mostly trying to keep them to describe my ideas - please consider them sacrificial drafts.

Configuration

  • How do we configure how many Sessions are available per Game Server. I feel like a simple ref count is going to be too simplistic, especially when we want to expand it down the line (as feature creep over time is always a reality)
    • An easy option might be to have a configuration number attribute on GameServer CRDs.
  • To @roberthbailey 's point we likely want to have lifecycle states of each Session, and ability to manage them separately.
  • A solution could be to have a GameSession CRD resource that gets created for each Session for a GameServer. This would put extra load on the K8s API/etcd, but would give us maximum flexibility and visibility. (is there a max number of CRD records we can create in a K8s cluster?)
    • Each session could have Ready, Allocated states etc individually, and be queryable by all the tools available.
    • This gives up tracking across each GameServer->Session using standard K8s ownership metadata and labels.
    • I think we could reuse lots of the existing SDKs and life-cycle - e.g. at first pass, SDK.Ready() means that all Sessions are Ready at start, and then expand as needed SDK.SessionReady(sessionId) (as a thought).

Data Communication Routing

  • The Game Server binary will need a way to get all it's Sessions, and its states, so it can take in traffic and route it to the appropriate internal game session process. This can probably be done by providing this information through the SDK.
  • I think there are two potential routing options (anyone have anything else?):
    1. A token is provided in the data packet for which session to send data through/back from
    2. Have separate ports for each session, and route to a separate session process based on the port.
  • The onus is on the game developer to handle this routing (i.e. not Agones), but we should probably keep this in mind, to make sure we don't block either approach
  • Open question - do we need to assign specific ports to specific sessions on a GameServer? I'm not sure how to do this. Or can we pass this responsibility up to the user?

Allocation

  • Question: Should the a Session allocation happen on a different API path than GameServer Allocation?
  • My currently thinking is "no".
  • As we expand out, we're already also talking about being able to allocate based on player capacity (Player Tracking for each GameServer #1033) down the line as well. I can only assume we'll have more allocation requirements as we continue to grow.
  • So we should work out if there is a nice way we can extend the current allocation resource API to include this feature, and potentially more down the line. Let's not box ourselves into a corner with a implementation that works specifically for this, but not for others (another point against a refcount)
  • I expect there will be good code reuse here also keeping allocation down a single code path.
  • The idea with Allocation would be to handle, when a Session is requested to pack sessions appropriately, and Allocate backing GameServers when needed, and mark GameSessions as allocated as needed -- and return that data when done.
  • To @roberthbailey 's point as well - I feel like we will need a "Drain" (or similar) state for a GameServer that basically does a "stop allocating to this GameServer and then shut it down once its empty"
    • This would also be useful for player capacity based allocation (although we should be smart about what "empty" means, and how it is defined)

Other Thoughts

  • Originally, I had thought this might be better as a separate framework that sat on top of Agones. The more I think about it, I think this is non-optimal, as this functionality is better weaved into Agones such that it is opt-in as needed. There is a lot of Agones core we can reuse here, which we would have to rebuild otherwise, and the value isn't there.

Would love comments, thoughts and questions on the above.

@theminecoder
Copy link

How do we configure how many Sessions are available per Game Server. I feel like a simple ref count is going to be too simplistic, especially when we want to expand it down the line (as feature creep over time is always a reality)

From an api standpoint it would be nice to just be able to call something like RequestSession() as much as I need. I have a system that allow us to fully clean the server from each session so being able to run the server for as long as possible is better then having to wait for a new container to boot.

We could still have the config option that equates to max total sessions the instance is allowed to make which would make the sidecar return an error when requesting a session once the limit has passed. This could fit into draining/updating as well by returning the error once the instance has been told to drain/update.

@markmandel
Copy link
Member

From an api standpoint it would be nice to just be able to call something like RequestSession() as much as I need.

What is "it" in the above? What exactly would "RequestSession()" be doing here? Is this a SDK level API, or is more something like Allocation ?

One thing I didn't mention in the above, was I think we would also need a way to self-allocate a GameSession through the SDK - much like we do for a GameServer - so something like SDK.SessionAllocate(sessionId), to support certain workflows - is that in the vein of what you are thinking?

Then you can move the GameSesssion back to SDK.SessionReady(sessionId) when you want it to go back to being in the pool of available Allocatable GameSessions.

Actually SDK.SessionReady(sessionId) to return a session to a Ready state would be useful if it was allocated through the Allocation endpoint as well. (and if all sessions are Ready, the GameServer should return to Ready too, so it could be scaled down if needed).

I think we are on the same page that the assumption is that in this instance, a GameServer will likely be up and running for longer than a singular game Session 👍

@logixworx
Copy link

Is there an ETA on this feature? I need it asap. Thanks!

@markmandel
Copy link
Member

markmandel commented Jan 10, 2020

@logixworx no ETA as of yet. We've got a variety of users who are interested in this as well, so I would expect it to come at some point this year.

We still need to do a complete design document on this, as well as implementation - it's a pretty substantial piece of work, so I expect it will take some months to complete.

@TheCactusBlue
Copy link

Any updates on this feature?

@castaneai
Copy link
Collaborator

castaneai commented Oct 20, 2020

Hi.
We are developing a Custom Allocator Service to solve this problem.

First of all, our GameServer process has multiple Rooms.
We determine which Rooms a player can enter by passing a token containing a Room ID to the connection.

Our custom allocator works as follows

Room Assignment

  • Allocate one GameServer from Agones upon receiving an allocation request. At the same time, we cache the information of the GameServer.
  • The next time we receive an allocation request, we return the same cached GameServer and increment RefCount in the custom allocator.
  • When RefCount exceeds the capacity, it allocates one GameServer from Agones again.
  • When the specification of GameServer changes (e.g., deploying a new version), we clear the cached GameServer and reset RefCount. (it uses GameServer Informer)

Shutdown

This approach is simple as Custom Allocator only knows about RefCount and one cached GameServer. Even if the Custom Allocator loses its state, Agones will assign a new GameServer to it. The entire service will not be down.

However, this approach requires the GameServer to implement multi-rooms.
Whether Agones should do this or develop it as a separate library is a matter of debate.

@markmandel
Copy link
Member

Thanks for the feedback - we curious to see if the updated "Re-Allocation" design we have been working on #1239 (comment) would be useful for your needs?

However, this approach requires the GameServer to implement multi-rooms. Whether Agones should do this or develop it as a separate library is a matter of debate.

Yeah, I'm not sure if Agones can help with the internals of the multi-room implementation inside the game server binary. We're an orchestration platform, so this seems like something that would be game specific and best left outside of Agones.

@castaneai
Copy link
Collaborator

@markmandel I just read it, the idea of re-allocation sounds great!

Once Agones implements re-allocation, we may be able to fulfill the request without custom allocator service.

There are a couple of things to worry about.

  • What is the behavior when a new version is deployed to the fleet?
  • Is it possible for capacity to overflow if a large number of allocation requests are received in a short period of time, and can the k8s label withstand such spike access?

@markmandel
Copy link
Member

What is the behavior when a new version is deployed to the fleet?

The whole structure is built so that you can safely allocate while doing rolling updates across Fleets (in fact how Kubernetes handles resources make this quite simple) -- so none of that changes. We also have lots of tests to ensure this stay safe.

Is it possible for capacity to overflow if a large number of allocation requests are received in a short period of time, and can the k8s label withstand such spike access?

We're doing a bunch of allocation request performance tweaks(#1852 and #1856), but we're seeing easy throughput of 100+ allocations per second when doing some load testing. We have to update the GameServer when Allocating anyway, so we make label/annotation changes part of that update request -- so it's all the same operations.

As per the label change from the SDK -- this is a newer operation. This will end up working eventually as it's backed by a queue. So there could be some delay if the K8s API gets backed up for some reason, but will be eventually consistent.

So having something that is "count" based may be faster - but we could potentially go down that path as we gather real world data on the speeds people actually need.

Does that answer your question?

@castaneai
Copy link
Collaborator

@markmandel Thank you for the quick answer.
We're looking forward to the implementation of re-allocation! and would love to adopt it when it is released.

I have no further concerns at the moment.
If I have any new questions in the future, I will ask them again.

@sisso
Copy link

sisso commented Apr 19, 2021

A similar use case is to run multiple game servers process in the same pod, each one, exposing its own ports as a game server.

This would be an easy out-of-the-box solution if you want to pack more games per node, as doesn't require any game server change, just the port. Especially to work around the limit of pods per node (110 in GKE) when you want to scale vertically.

Ideally, it would create a 1-N relationship between pod and game sever, in another word, each pod can have multiple game servers.

For the original request. You just need to configure your listener to listen to multiple ports for the same server, one for each "room".

@sisso
Copy link

sisso commented Apr 20, 2021

Another thing comes to my mind. Let me know if I am lost in the concepts, still new here.

I can already run multiple game sessions on the same server by switching back to Ready after be allocated until I can not receive a new session.

Obviously, a game server can be evicted. But if this proves to be a solution, a new state AllocatedAndReady to manage the case of ready and on use.

@markmandel
Copy link
Member

I can already run multiple game sessions on the same server by switching back to Ready after be allocated until I can not receive a new session.

As you pointed out, a Ready GameServer can be deleted at any point.

Rather than add a whole new state, see the linked design above for utilising labels to manage if the GameServer is in the pool to be allocated from - which adds the same functionality, but is very flexible.

@markmandel markmandel added the stale Pending closure unless there is a strong objection. label Jun 23, 2022
@markmandel
Copy link
Member

Just reviewing this issue - I think we can close this ticket now, since we have:
https://agones.dev/site/docs/integration-patterns/high-density-gameservers/

@markmandel markmandel added the wontfix Sorry, but we're not going to do that. label Jul 28, 2022
@markmandel
Copy link
Member

This has been stale for a month, so I'll close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New features for Agones stale Pending closure unless there is a strong objection. wontfix Sorry, but we're not going to do that.
Projects
None yet
Development

No branches or pull requests

8 participants