Proposal: RefCount allocation to host many room in same pod #1197

neuecc · 2019-11-29T09:56:11Z

Is your feature request related to a problem? Please describe.

The current agones allocation model is only support to assign single room in Pod.

This provides a complete model of Isolation, but at the cost of Runtime and Game Engine,
it loses many sharable memory and pay much sleep time.
If the room is not hosting a large number of users like BattleRoyal,
a smaller user (2-8) and command frequency is low, we are able to host multiple rooms on one game engine(pod).

Does not share state between rooms, but shares execution engine (Load immutable gamedatas in memory, Runtime JIT, dynamic assemblies, etc...).
This can result in significant cost savings.

We're creating mobile game that run in .NET Core Server and Unity Client.
In .NET Core, we're using MagicOnion, OpenSource our network engine built on .NET Core/gRPC/Http2.
The cost of a one-room game loop on the server is relatively small, and many rooms can coexist on a single engine.
We want to host this real-time server on Agones, but we need to host many rooms on same pod to reduce costs.

Many mobile games requires state-ful and realtime server but not request heavy traffic per single room.
Therefore, since the runtime cost is relatively high, we want to host multiple rooms.
(If the runtime cost is relatively low, single room is fine.)

Describe the solution you'd like

I suggest RefCount (virtual allocation).
Allocate requests returns same pod and increment reference count.
When the finish games, request shutdown to decrement reference count.
If the reference count exceeds the value set in the configuration,
returns the server created for the new pod.

roberthbailey · 2019-11-30T05:38:11Z

I've heard a few requests for this sort of enhancement so it's definitely worth exploring and seeing how it would work.

There are a few of things I'm concerned about:

If we want packed allocations and we keep replacing connections into the same engine, we need a way to "drain" everyone off the game server instance (e.g. for upgrades or node maintenance). With the current design this can be easily achieved using k8s primitives with the assumption that each pod has a reasonably short lifespan (order of minutes).
We need a way to keep track of capacity for each engine instance (room). We could assume each virtual allocation uses the same resources, but it would probably be better to be able to do something based on player tracking.
We have a single SDK for the whole pod, so we need a backwards compatible way to introduce virtual allocations into the existing SDK.
We need to think hard about the lifecycle of a virtual allocation since it will be a bit different than a allocation where we know the lifecycle of the engine is tied to the lifecycle of a container. How do healthchecks work? Is it possible to detect that one virtual allocation is unhealthy but the others in the same allocation are fine? Is it possible to rectify this situation if we can detect it?

There are probably other things that will come up, but thinking through these will be a good place to start.

logixworx · 2019-12-01T16:36:04Z

Definitely need this. I wrote a multi-room system within a single server instance for Unity/Mirror.

roberthbailey · 2019-12-02T17:23:05Z

@logixworx - Are there any details about that system that you can share (or point to any public documentation)? Do you have any requirements that weren't described in the original post? Do you have any insights that we can leverage in Agones? Do you have any thoughts about the questions I posed above?

logixworx · 2019-12-05T20:50:00Z

My requirements are exactly as described in the original post, as well as in your response. I can't think of anything more to contribute to the discussion at the moment.

markmandel · 2019-12-05T21:29:36Z

So I have quite a few thoughts, and it comes under several categories. This is a topic that has come up quite regularly over the years.

I'm going to use the term "Game Session" as the term I've grow comfortable with for a separate session / room within a Game Server.

There's some implementation details below, but I'm mostly trying to keep them to describe my ideas - please consider them sacrificial drafts.

Configuration

How do we configure how many Sessions are available per Game Server. I feel like a simple ref count is going to be too simplistic, especially when we want to expand it down the line (as feature creep over time is always a reality)
- An easy option might be to have a configuration number attribute on GameServer CRDs.
To @roberthbailey 's point we likely want to have lifecycle states of each Session, and ability to manage them separately.
A solution could be to have a GameSession CRD resource that gets created for each Session for a GameServer. This would put extra load on the K8s API/etcd, but would give us maximum flexibility and visibility. (is there a max number of CRD records we can create in a K8s cluster?)
- Each session could have Ready, Allocated states etc individually, and be queryable by all the tools available.
- This gives up tracking across each GameServer->Session using standard K8s ownership metadata and labels.
- I think we could reuse lots of the existing SDKs and life-cycle - e.g. at first pass, SDK.Ready() means that all Sessions are Ready at start, and then expand as needed SDK.SessionReady(sessionId) (as a thought).

Data Communication Routing

The Game Server binary will need a way to get all it's Sessions, and its states, so it can take in traffic and route it to the appropriate internal game session process. This can probably be done by providing this information through the SDK.
I think there are two potential routing options (anyone have anything else?):
1. A token is provided in the data packet for which session to send data through/back from
2. Have separate ports for each session, and route to a separate session process based on the port.
The onus is on the game developer to handle this routing (i.e. not Agones), but we should probably keep this in mind, to make sure we don't block either approach
Open question - do we need to assign specific ports to specific sessions on a GameServer? I'm not sure how to do this. Or can we pass this responsibility up to the user?

Allocation

Question: Should the a Session allocation happen on a different API path than GameServer Allocation?
My currently thinking is "no".
As we expand out, we're already also talking about being able to allocate based on player capacity (Player Tracking for each GameServer #1033) down the line as well. I can only assume we'll have more allocation requirements as we continue to grow.
So we should work out if there is a nice way we can extend the current allocation resource API to include this feature, and potentially more down the line. Let's not box ourselves into a corner with a implementation that works specifically for this, but not for others (another point against a refcount)
I expect there will be good code reuse here also keeping allocation down a single code path.
The idea with Allocation would be to handle, when a Session is requested to pack sessions appropriately, and Allocate backing GameServers when needed, and mark GameSessions as allocated as needed -- and return that data when done.
To @roberthbailey 's point as well - I feel like we will need a "Drain" (or similar) state for a GameServer that basically does a "stop allocating to this GameServer and then shut it down once its empty"
- This would also be useful for player capacity based allocation (although we should be smart about what "empty" means, and how it is defined)

Other Thoughts

Originally, I had thought this might be better as a separate framework that sat on top of Agones. The more I think about it, I think this is non-optimal, as this functionality is better weaved into Agones such that it is opt-in as needed. There is a lot of Agones core we can reuse here, which we would have to rebuild otherwise, and the value isn't there.

Would love comments, thoughts and questions on the above.

theminecoder · 2019-12-07T04:49:22Z

How do we configure how many Sessions are available per Game Server. I feel like a simple ref count is going to be too simplistic, especially when we want to expand it down the line (as feature creep over time is always a reality)

From an api standpoint it would be nice to just be able to call something like RequestSession() as much as I need. I have a system that allow us to fully clean the server from each session so being able to run the server for as long as possible is better then having to wait for a new container to boot.

We could still have the config option that equates to max total sessions the instance is allowed to make which would make the sidecar return an error when requesting a session once the limit has passed. This could fit into draining/updating as well by returning the error once the instance has been told to drain/update.

markmandel · 2019-12-09T21:26:00Z

From an api standpoint it would be nice to just be able to call something like RequestSession() as much as I need.

What is "it" in the above? What exactly would "RequestSession()" be doing here? Is this a SDK level API, or is more something like Allocation ?

One thing I didn't mention in the above, was I think we would also need a way to self-allocate a GameSession through the SDK - much like we do for a GameServer - so something like SDK.SessionAllocate(sessionId), to support certain workflows - is that in the vein of what you are thinking?

Then you can move the GameSesssion back to SDK.SessionReady(sessionId) when you want it to go back to being in the pool of available Allocatable GameSessions.

Actually SDK.SessionReady(sessionId) to return a session to a Ready state would be useful if it was allocated through the Allocation endpoint as well. (and if all sessions are Ready, the GameServer should return to Ready too, so it could be scaled down if needed).

I think we are on the same page that the assumption is that in this instance, a GameServer will likely be up and running for longer than a singular game Session 👍

logixworx · 2020-01-10T17:24:42Z

Is there an ETA on this feature? I need it asap. Thanks!

markmandel · 2020-01-10T23:44:32Z

@logixworx no ETA as of yet. We've got a variety of users who are interested in this as well, so I would expect it to come at some point this year.

We still need to do a complete design document on this, as well as implementation - it's a pretty substantial piece of work, so I expect it will take some months to complete.

TheCactusBlue · 2020-06-15T12:32:12Z

Any updates on this feature?

castaneai · 2020-10-20T08:11:07Z

Hi.
We are developing a Custom Allocator Service to solve this problem.

First of all, our GameServer process has multiple Rooms.
We determine which Rooms a player can enter by passing a token containing a Room ID to the connection.

Our custom allocator works as follows

Room Assignment

Allocate one GameServer from Agones upon receiving an allocation request. At the same time, we cache the information of the GameServer.
The next time we receive an allocation request, we return the same cached GameServer and increment RefCount in the custom allocator.
When RefCount exceeds the capacity, it allocates one GameServer from Agones again.
When the specification of GameServer changes (e.g., deploying a new version), we clear the cached GameServer and reset RefCount. (it uses GameServer Informer)

Shutdown

Rooms with no players will be automatically deleted.
The GameServer process monitors the number of Rooms. After the number of Rooms reaches 0 and a certain amount of time has passed, it sends a shutdown request to Agones. (similar to Dangling Game Servers left behind if allocation fails at the wrong moment #607)

This approach is simple as Custom Allocator only knows about RefCount and one cached GameServer. Even if the Custom Allocator loses its state, Agones will assign a new GameServer to it. The entire service will not be down.

However, this approach requires the GameServer to implement multi-rooms.
Whether Agones should do this or develop it as a separate library is a matter of debate.

markmandel · 2020-10-20T17:11:12Z

Thanks for the feedback - we curious to see if the updated "Re-Allocation" design we have been working on #1239 (comment) would be useful for your needs?

However, this approach requires the GameServer to implement multi-rooms. Whether Agones should do this or develop it as a separate library is a matter of debate.

Yeah, I'm not sure if Agones can help with the internals of the multi-room implementation inside the game server binary. We're an orchestration platform, so this seems like something that would be game specific and best left outside of Agones.

castaneai · 2020-10-21T08:02:35Z

@markmandel I just read it, the idea of re-allocation sounds great!

Once Agones implements re-allocation, we may be able to fulfill the request without custom allocator service.

There are a couple of things to worry about.

What is the behavior when a new version is deployed to the fleet?
Is it possible for capacity to overflow if a large number of allocation requests are received in a short period of time, and can the k8s label withstand such spike access?

markmandel · 2020-10-21T17:55:41Z

What is the behavior when a new version is deployed to the fleet?

The whole structure is built so that you can safely allocate while doing rolling updates across Fleets (in fact how Kubernetes handles resources make this quite simple) -- so none of that changes. We also have lots of tests to ensure this stay safe.

Is it possible for capacity to overflow if a large number of allocation requests are received in a short period of time, and can the k8s label withstand such spike access?

We're doing a bunch of allocation request performance tweaks(#1852 and #1856), but we're seeing easy throughput of 100+ allocations per second when doing some load testing. We have to update the GameServer when Allocating anyway, so we make label/annotation changes part of that update request -- so it's all the same operations.

As per the label change from the SDK -- this is a newer operation. This will end up working eventually as it's backed by a queue. So there could be some delay if the K8s API gets backed up for some reason, but will be eventually consistent.

So having something that is "count" based may be faster - but we could potentially go down that path as we gather real world data on the speeds people actually need.

Does that answer your question?

castaneai · 2020-10-22T10:06:26Z

@markmandel Thank you for the quick answer.
We're looking forward to the implementation of re-allocation! and would love to adopt it when it is released.

I have no further concerns at the moment.
If I have any new questions in the future, I will ask them again.

sisso · 2021-04-19T15:52:44Z

A similar use case is to run multiple game servers process in the same pod, each one, exposing its own ports as a game server.

This would be an easy out-of-the-box solution if you want to pack more games per node, as doesn't require any game server change, just the port. Especially to work around the limit of pods per node (110 in GKE) when you want to scale vertically.

Ideally, it would create a 1-N relationship between pod and game sever, in another word, each pod can have multiple game servers.

For the original request. You just need to configure your listener to listen to multiple ports for the same server, one for each "room".

sisso · 2021-04-20T15:05:12Z

Another thing comes to my mind. Let me know if I am lost in the concepts, still new here.

I can already run multiple game sessions on the same server by switching back to Ready after be allocated until I can not receive a new session.

Obviously, a game server can be evicted. But if this proves to be a solution, a new state AllocatedAndReady to manage the case of ready and on use.

markmandel · 2021-04-20T18:21:28Z

I can already run multiple game sessions on the same server by switching back to Ready after be allocated until I can not receive a new session.

As you pointed out, a Ready GameServer can be deleted at any point.

Rather than add a whole new state, see the linked design above for utilising labels to manage if the GameServer is in the pool to be allocated from - which adds the same functionality, but is very flexible.

markmandel · 2022-06-23T03:15:03Z

Just reviewing this issue - I think we can close this ticket now, since we have:
https://agones.dev/site/docs/integration-patterns/high-density-gameservers/

markmandel · 2022-07-28T02:14:53Z

This has been stale for a month, so I'll close this issue.

neuecc added the kind/feature New features for Agones label Nov 29, 2019

markmandel mentioned this issue Dec 17, 2019

Game Server Allocation advanced filtering: player count, state, reallocation #1239

Closed

markmandel mentioned this issue Jul 14, 2020

Adopt Resource based APIs for Player Tracking #1677

Closed

markmandel added the stale Pending closure unless there is a strong objection. label Jun 23, 2022

markmandel added the wontfix Sorry, but we're not going to do that. label Jul 28, 2022

markmandel closed this as completed Jul 28, 2022

markmandel mentioned this issue Aug 24, 2022

Arbitrary Counts and Lists for GameServers, SDKs and Allocation #2716

Closed

50 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: RefCount allocation to host many room in same pod #1197

Proposal: RefCount allocation to host many room in same pod #1197

neuecc commented Nov 29, 2019

roberthbailey commented Nov 30, 2019

logixworx commented Dec 1, 2019 •

edited

Loading

roberthbailey commented Dec 2, 2019

logixworx commented Dec 5, 2019

markmandel commented Dec 5, 2019

theminecoder commented Dec 7, 2019

markmandel commented Dec 9, 2019

logixworx commented Jan 10, 2020

markmandel commented Jan 10, 2020 •

edited

Loading

TheCactusBlue commented Jun 15, 2020

castaneai commented Oct 20, 2020 •

edited

Loading

markmandel commented Oct 20, 2020

castaneai commented Oct 21, 2020

markmandel commented Oct 21, 2020

castaneai commented Oct 22, 2020

sisso commented Apr 19, 2021

sisso commented Apr 20, 2021

markmandel commented Apr 20, 2021

markmandel commented Jun 23, 2022

markmandel commented Jul 28, 2022

Proposal: RefCount allocation to host many room in same pod #1197

Proposal: RefCount allocation to host many room in same pod #1197

Comments

neuecc commented Nov 29, 2019

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

roberthbailey commented Nov 30, 2019

logixworx commented Dec 1, 2019 • edited Loading

roberthbailey commented Dec 2, 2019

logixworx commented Dec 5, 2019

markmandel commented Dec 5, 2019

Configuration

Data Communication Routing

Allocation

Other Thoughts

theminecoder commented Dec 7, 2019

markmandel commented Dec 9, 2019

logixworx commented Jan 10, 2020

markmandel commented Jan 10, 2020 • edited Loading

TheCactusBlue commented Jun 15, 2020

castaneai commented Oct 20, 2020 • edited Loading

Room Assignment

Shutdown

markmandel commented Oct 20, 2020

castaneai commented Oct 21, 2020

markmandel commented Oct 21, 2020

castaneai commented Oct 22, 2020

sisso commented Apr 19, 2021

sisso commented Apr 20, 2021

markmandel commented Apr 20, 2021

markmandel commented Jun 23, 2022

markmandel commented Jul 28, 2022

logixworx commented Dec 1, 2019 •

edited

Loading

markmandel commented Jan 10, 2020 •

edited

Loading

castaneai commented Oct 20, 2020 •

edited

Loading