Use batching in GameServerAllocation controller to improve throughput. #536

jkowalski · 2019-01-30T21:06:02Z

To get better throughput in GSA controller we could do batching: group together N allocation requests, assign GS to each of them and individually commit in parallel.

markmandel · 2019-01-30T21:07:29Z

As a first step, I'm going to look into moving GameServerAllocation into a Aggregate API so we can have more control over what happens with the API.

In theory, there should be no/minimal change to how GSA's work -- at least that's the plan 😄

markmandel · 2019-01-31T23:23:27Z

Just keeping a list of resources on building Aggregate API Server:

Current work: https://github.com/markmandel/agones/tree/feature/gsa-api-server

markmandel · 2019-02-16T21:46:59Z

The good news is - I have (some cleanup to do) an API extension working to do gameserver allocations.

I've only implemented and supported the CREATE (HTTP: Post) method on the API, as without storage, it's really the only one needed. If people request it, I could look into the Watch function as well, if people want to watch for create events.

The annoying news is - each create API call has 60s to provide a response (although can keep processing in the background) -- which makes one of the long term goals, having a SDK.Ack() function for blocking on Allocation return) -- a little trickier. Or at least, with a shorter timeout than I may have liked.

Asking the community for feedback on that aspect (slack):

I'm wondering if a 30 second timeout (to give everything else some buffer) for that ack() to come back is reasonable or not? I can't imaging you would want to keep your players waiting that long anyway -- or is that too short a time? (could probably bump that out to 40 or 50 second if need be, but 30 is super conservative)

(Basically, we'd wait for 30 second, and if we didn't get the ack back, we'd either delete the gameserver, or maybe kick it back to ready - haven't decided. Probably delete -- it's cleaner)

Regardless, this will now also allow us to batch, skip storage for the GSA, etc. And also make it easier if we decide to also provide a gRPC interface as well for allocation.

This moves the implementation of GameServerAllocation (GSA) to a [Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. 1. Sets us up if we decide we also want to have an alternative (http and/or gRPC) endpoint for allocation, based on feedback from this implementation. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. This also includes some libraries for building further api server extension points.

ilkercelikyilmaz · 2019-02-19T02:59:00Z

What is the number of per second allocations do we need to reach? I am running some load tests against the recent allocation changes I did (haven't been checked-in yet). I run 60 concurrent clients and the system can allocate around 60 gs per sec (Allocated 2999 gs in 50 secs).

This moves the implementation of GameServerAllocation (GSA) to a [Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. 1. Sets us up if we decide we also want to have an alternative (http and/or gRPC) endpoint for allocation, based on feedback from this implementation. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed. This also includes some libraries for building further api server extension points.

markmandel · 2019-02-19T03:10:17Z

@ilkercelikyilmaz that's a huge improvement over what we had previously 🔥 ( @pm7h do you have those numbers on hand? I can't seem to find them). Once we also incorporate #600 - I wonder if we might be very close to what we might need to be.

pm7h · 2019-02-20T02:01:26Z

Yes, it's a huge improvement. Last I ran my load tests, it took over a minute for 100 allocations. You can see those results here: #412 (comment)

ilkercelikyilmaz · 2019-02-20T06:15:22Z

I made couple changes after talking to Jarek (use Update instead of patch to prevent multiple allocations) and random gs selection from the top N (=20) available list to reduce the number of collisions
.
I also changes the test client to increase the QPS of the kubernetes client.
With these changed I ran the test few times. With 16 concurrent client, system can allocate around 100 gs per sec (2800 gs allocated in 28 seconds).

This moves the implementation of GameServerAllocation (GSA) to a [Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. 1. Sets us up if we decide we also want to have an alternative (http and/or gRPC) endpoint for allocation, based on feedback from this implementation. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed. This also includes some libraries for building further api server extension points.

This both cleans up the webhook component, and makes it easier to test, but also sets us up to reuse the https server with the given cert pair -- which we will want to do as we work on googleforgames#536 and setup an api server extension which needs exactly the same self signed certificate setup.

This both cleans up the webhook component, and makes it easier to test, but also sets us up to reuse the https server with the given cert pair -- which we will want to do as we work on #536 and setup an api server extension which needs exactly the same self signed certificate setup.

[Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed.

This moves the implementation of GameServerAllocation (GSA) to a [Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (googleforgames#536), since we control the entire http request. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed.

This moves the implementation of GameServerAllocation (GSA) to a [Kubernetes API Extension](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/) instead of using CRDs. This was essentially done for performance reasons, but to break it down: 1. GSA is now create only. Since we had no need for GSA storage, and don't want the performance hit. 1. This removes the mutation and validation webhooks, which have 30s timeout and are run in serial 1. API Server still does cut off a response after 60s, but the api can continue processing (60s gives us enough time, I think for a SDK.Ack() on Allocate, which I don't think we had before) 1. Validation now happens in the request. 1. We can now do batching of requests for higher throughput (#536), since we control the entire http request. The breaking changes are: 1. GameServerAllocation's group is now `allocation.agones.dev` rather than `stable.agones.dev`, because a CRD group can't overlap with a api server. 1. Since there is only the `create` verb for GSA, there is no get/list/watch options for GameServerAllocations - so no informers/listers either. But this could be added at a later date, if needed.

markmandel · 2019-05-07T19:06:31Z

/cc @ilkercelikyilmaz @jkowalski how do we feel about closing this issue, given the performance we have now?

ilkercelikyilmaz · 2019-05-08T17:16:17Z

I think this can be a good improvement but there is no urgency so we should keep it open. Not a blovker for 1.X though.
@markmandel , did you see my comment/findings on the recent change on PodList improvement?

markmandel · 2019-05-08T18:40:20Z

I think this can be a good improvement but there is no urgency so we should keep it open. Not a blovker for 1.X though.

Good call 👍 I've moved it off the next milestone, but leaving it open.

@markmandel , did you see my comment/findings on the recent change on PodList improvement?

I did but hard to determine why that is happening - would be useful to have the performance testing suite in open source in some way, so we can all test things. Might be good to do CPU flame graph to see where the bottlenecks are.

ilkercelikyilmaz · 2019-05-08T20:08:56Z

I ill try to check-in my load test in 0.11.

markmandel · 2019-06-15T01:09:19Z

I think this can be closed now! if you have objections, please say so, otehrwise I will close on Tuesday!

markmandel · 2019-06-18T16:23:54Z

No response! Closing! 😄

jkowalski added the area/performance Anything to do with Agones being slow, or making it go faster. label Jan 30, 2019

jkowalski mentioned this issue Jan 30, 2019

Replace global allocation mutex with fine-grained concurrency controls. #535

Closed

markmandel self-assigned this Feb 3, 2019

markmandel added the kind/design Proposal discussing new features / fixes and how they should be implemented label Feb 16, 2019

markmandel added this to the 0.9.0 milestone Feb 18, 2019

markmandel mentioned this issue Feb 19, 2019

[Breaking Change] Move GameServerAllocation to an API Extension Server. #600

Closed

markmandel mentioned this issue Feb 20, 2019

Infrastructure for Agones performance reporting #573

Closed

markmandel mentioned this issue Mar 8, 2019

Refactor https server into its own component #643

Merged

markmandel modified the milestones: 0.9.0, 0.10.0 Mar 26, 2019

markmandel mentioned this issue Mar 31, 2019

Implement GameServerAlocation as API Extension #682

Merged

markmandel modified the milestones: 0.10.0, 0.11.0 May 7, 2019

markmandel removed this from the 0.11.0 milestone May 8, 2019

ilkercelikyilmaz mentioned this issue May 28, 2019

Request to become an Approver on Agones #796

Closed

markmandel closed this as completed Jun 18, 2019

markmandel added this to the 0.11.0 milestone Jun 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use batching in GameServerAllocation controller to improve throughput. #536

Use batching in GameServerAllocation controller to improve throughput. #536

jkowalski commented Jan 30, 2019

markmandel commented Jan 30, 2019 •

edited

Loading

markmandel commented Jan 31, 2019 •

edited

Loading

markmandel commented Feb 16, 2019

ilkercelikyilmaz commented Feb 19, 2019

markmandel commented Feb 19, 2019 •

edited

Loading

pm7h commented Feb 20, 2019

ilkercelikyilmaz commented Feb 20, 2019

markmandel commented May 7, 2019

ilkercelikyilmaz commented May 8, 2019

markmandel commented May 8, 2019

ilkercelikyilmaz commented May 8, 2019

markmandel commented Jun 15, 2019

markmandel commented Jun 18, 2019

Use batching in GameServerAllocation controller to improve throughput. #536

Use batching in GameServerAllocation controller to improve throughput. #536

Comments

jkowalski commented Jan 30, 2019

markmandel commented Jan 30, 2019 • edited Loading

markmandel commented Jan 31, 2019 • edited Loading

markmandel commented Feb 16, 2019

ilkercelikyilmaz commented Feb 19, 2019

markmandel commented Feb 19, 2019 • edited Loading

pm7h commented Feb 20, 2019

ilkercelikyilmaz commented Feb 20, 2019

markmandel commented May 7, 2019

ilkercelikyilmaz commented May 8, 2019

markmandel commented May 8, 2019

ilkercelikyilmaz commented May 8, 2019

markmandel commented Jun 15, 2019

markmandel commented Jun 18, 2019

markmandel commented Jan 30, 2019 •

edited

Loading

markmandel commented Jan 31, 2019 •

edited

Loading

markmandel commented Feb 19, 2019 •

edited

Loading