Support Agones on GKE Autopilot #2777

zmerlynn · 2022-10-26T16:56:12Z

Is your feature request related to a problem? Please describe.

Agones has a number of friction points on GKE Autopilot:

Arbitrary host port selection is not allowed (denied by autogke-no-host-port)
Use of the cluster-autoscaler.kubernetes.io/safe-to-evict=false to minimize disruption (denied by autogke-node-affinity-selector-limitation):
- On game server pods, Agones defaults to cluster-autoscaler.kubernetes.io/safe-to-evict=false annotation - this can be mitigated with Allow cluster autoscaler to scale down game server pods #2747
- The Agones controller also defaults to `cluster-autoscaler.kubernetes.io/safe-to-evict=false', presumably because it is not HA and would like to minimize disruption, e.g. Split agones controller into leader elected pods #1296

Note that Autopilot issues were previously brought up in #2249 - the state of the world has changed considerably since then (mutating webhooks are now allowed), but there are still friction points.

Describe the solution you'd like

I propose the following:

We've recently added a feature to allow for Autopilot host port assignment using a special annotation. In order to allow this to be used with Agones, I propose:
- A small cloud-provider library to allow for per-provider abstractions
- In said cloud provider abstraction, we add a game server pod interceptor that can add the appropriate annotation
- We also add a validator that the portPolicy is Dynamic (we may be able to support Passthrough later, but let's start here)
- (This section is complete with Sync Pod host ports back to GameServer in GCP #2782 and Split port allocators, implement Autopilot port allocation/policies #2789)
The controllers/webhooks/allocator can be made HA, at which point we won't need to annotate safe-to-evict=false (Design: Make it possible to evict agones-controller by separating into controller/extensions #2797).
So much testing.
Add docs!

The text was updated successfully, but these errors were encountered:

roberthbailey · 2022-10-26T17:58:18Z

One question on

On game server pods, Agones defaults to cluster-autoscaler.kubernetes.io/safe-to-evict=false annotation - this can be mitigated with #2747

The current mitigation will require that any fleet spec being moved from non-autopilot to autopilot will need to be updated to add extra metadata. It would be nice in the longer term if we could better support upgrades (across all platforms) and switch the default behavior to allow a safe eviction but give folks the chance to prevent it if that doesn't work for their game server.

zmerlynn · 2022-10-26T18:03:45Z

One question on

On game server pods, Agones defaults to cluster-autoscaler.kubernetes.io/safe-to-evict=false annotation - this can be mitigated with #2747

The current mitigation will require that any fleet spec being moved from non-autopilot to autopilot will need to be updated to add extra metadata. It would be nice in the longer term if we could better support upgrades (across all platforms) and switch the default behavior to allow a safe eviction but give folks the chance to prevent it if that doesn't work for their game server.

I debated this, and it again comes to explicit intent vs transparently working: One possible option is that we just set this annotation to be true on Autopilot, but that doesn't seem necessarily correct either.

This is the start of the implementation for #2777: * Most of this is mechanical and implements a thin cloud product abstraction layer in pkg/cloud, instantiated with New(product). The product abstraction provides a single function so far: SyncPodPortsToGameServer. * SyncPodPortsToGameServer is inserted as a hook while syncing IP/ports, to let different cloud providers handle port allocation slightly differently (in this case, GKE Autopilot) * In GKE Autopilot, we look for a JSON string like `{"min":7000,"max":8000,"portsAssigned":{"7001":7737,"7002":7738}}` as an indication that the host ports were reassigned (per policy). As a side note to anyone watching, this is currently an unreleased feature. If we see this, we use the provided mapping to map the host ports in the GameServer.Spec. With this change, it's possible to launch a GameServer and get a healthy GameServer Pod by adding the following annotation: ``` annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: "true" autopilot.gke.io/host-port-assignment: '{"min": 7000, "max": 8000}' ``` If this PR causes any issues, the cloud product auto detection can be disabled by setting `agones.cloudProduct=generic`, or forced to GKE Autopilot using `agones.cloudProduct=gke-autopilot`. In a future PR, I will add the host-port-assignment annotation automatically on Autopilot Co-authored-by: Mark Mandel <[email protected]>

nodepools and regional clusters Updates to release checklist. (googleforgames#2772) * Updates to release checklist. Adding items that showed up in the recent release that were not written down or required better clarification. * Review updates, and some other small tweaks. Co-authored-by: Robert Bailey <[email protected]> Release 1.27.0 (googleforgames#2776) * Release 1.27.0 * Update FAQ on ExternalDNS (googleforgames#2773) The feature flag it points to have been moved to stable, so the link is not useful any more. Also removed notes on ipv6, since they aren't 100% accurate, as we were discussing in googleforgames#2767. * Updates to release checklist. (googleforgames#2772) * Updates to release checklist. Adding items that showed up in the recent release that were not written down or required better clarification. * Review updates, and some other small tweaks. Co-authored-by: Robert Bailey <[email protected]> * Release-changes * Review comment * Review changes Co-authored-by: Mark Mandel <[email protected]> Co-authored-by: Robert Bailey <[email protected]> Version updates (googleforgames#2778) Players in-game metric for when PlayerTracking is enabled (googleforgames#2765) * Check for DeletionTimestamp of fleet and gameserverset before scaling * Add metric to track player count in gameservers * check PlayerStatus is not nil * Update metrics available in docs * Wrong relref path * typo * Change name for players in game metric to player connected. Add player capacity metric. Hide docs until next agones release. * Duplicate metrics table * add gameserver player tracking metrics to fleetViews Co-authored-by: Mark Mandel <[email protected]> Remove generation for swagger Go code and Add static swagger codes for test (googleforgames#2757) Co-authored-by: Mark Mandel <[email protected]> Updated allocation yaml files under examples/ to use selectors Show how to set graceful termination in a game server that is safe to (googleforgames#2780) evict. Avoid retry from allocateFromLocalCluster under context kill. (googleforgames#2783) * Version updates * issue-2736-changes Co-authored-by: Mark Mandel <[email protected]> Bring SDK base image to debian:bullseye (googleforgames#2769) * Bring SDK base image to debian:bullseye The upgrade to gRPC solved one issue, and I also added a limit to number of processes that could run for `make -j` otherwise the whole thing would fall over (also would crash my dev machine!). Closes googleforgames#2224 * Force refresh of cpp cache on Cloud Build. * Fixes for CI: * Revert CI cache increment (don't think we need it) * Add shell to cpp image for debugging. * Fix formatting issue that is breaking CI. Co-authored-by: Robert Bailey <[email protected]> Update health-checking.md (googleforgames#2785) Fixed spell error: spec.health.failureTheshold to spec.health.failureThreshold Updated allocation yaml files under examples/ to use selectors (googleforgames#2787) Cleanup of load tests (googleforgames#2784) * issue-2744 updated changes with new description * 2744 review changes Sync Pod host ports back to GameServer in GCP (googleforgames#2782) This is the start of the implementation for googleforgames#2777: * Most of this is mechanical and implements a thin cloud product abstraction layer in pkg/cloud, instantiated with New(product). The product abstraction provides a single function so far: SyncPodPortsToGameServer. * SyncPodPortsToGameServer is inserted as a hook while syncing IP/ports, to let different cloud providers handle port allocation slightly differently (in this case, GKE Autopilot) * In GKE Autopilot, we look for a JSON string like `{"min":7000,"max":8000,"portsAssigned":{"7001":7737,"7002":7738}}` as an indication that the host ports were reassigned (per policy). As a side note to anyone watching, this is currently an unreleased feature. If we see this, we use the provided mapping to map the host ports in the GameServer.Spec. With this change, it's possible to launch a GameServer and get a healthy GameServer Pod by adding the following annotation: ``` annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: "true" autopilot.gke.io/host-port-assignment: '{"min": 7000, "max": 8000}' ``` If this PR causes any issues, the cloud product auto detection can be disabled by setting `agones.cloudProduct=generic`, or forced to GKE Autopilot using `agones.cloudProduct=gke-autopilot`. In a future PR, I will add the host-port-assignment annotation automatically on Autopilot Co-authored-by: Mark Mandel <[email protected]> Update gke terraform files to allow autoscaling Fix (not really) problems reported by VSCode (googleforgames#2790) VSCode reports `main redeclared` between allocationload.go and runscenario.go due to the fact that they both look like `package main` binaries in the same directory, similar e.g. [this poster on a different project](https://stackoverflow.com/questions/66970531/vs-code-go-main-redeclared-in-this-block) To fix it, it's easy enough to just give these binaries their own package path and fix up the calling scripts. Along the way, fix a lint complaint in runscenario.go Add location variable for cluster location argument Minor fix changed default of location var to empty string GameServerRestartBeforeReadyCrash: Run serially (googleforgames#2791) Narrow the race in googleforgames#2445 by running GameServerRestartBeforeReadyCrash serially. See googleforgames#2445 (comment) for a detailed analysis. Does not fix the issue - this is stopgap until we understand how to fix it. Enable fieldalignment linter, then mostly ignore it (googleforgames#2795) Enable the fieldalignment linter by enabling all `govet` checks except shadowing. Ignore large swaths of code (tests, cmd/, APIs), and nolint'd existing complaints that seemed irrelevant. Along the way: * removed existing nolint:maligned, as `maligned` is no more. * disabled `structcheck` and `deadcode` as they are deprecated (and I think have been subsumed by other linters?) * changed `gameServerCacheEntry` to `gameServerCache`. It is the cache, not just an entry. * fixed alignment of `gameServerSetCacheEntry`. Add fswatch library to watch and batch filesystem events, use in allocator (googleforgames#2792) This pull refactors the fsnotify code in allocator/main out to a shared library, and in that shared library implements a batched notification processor. Closes googleforgames#1816: This takes a slightly different approach than specified in the issue, instead choosing to just delay processing until after a batch processing period. I chose 1s - it's far longer than necessary, but still much shorter than it takes for the secret changes to propagate to the container anyways. I considered the approach in googleforgames#1816 of trying to parse the actual events, but it's too fiddly to get exactly right: e.g. maybe you only refresh on "write", but then "chmod" could make the file readable whereas it wasn't before, "rename" could expose a file that wasn't there before, etc. Cloud product: Split port allocators, implement Autopilot port allocation/policies (googleforgames#2789) In the Agones on GKE Autopilot implementation, we have no need for the port allocator - the informer/etc. is an unnecessary moving piece. This PR allows for cloud products to provide their own port allocation implementation, and implements the GKE Autopilot "allocator". We do this by: * Splitting portallocator off to its own package. It was basically self-sufficient anyways, except it was a little too friendly with controller_test.go. I solved that by introducing a TestInterface for controller_test.go to upcast to. * Allow cloud product implementations to define their own port allocator. * Defining a new port allocator for GKE that does a simple per-port HostPort allocation, and adds the host-port-assignment annotation to the pod template. * Extend cloudproduct again to add a GameServer validator * And in Autopilot, reject if the PortPolicy is not `Dynamic` Release: Note to switch away from `agones-images` (googleforgames#2809) Since we have few guardrails on accidentally touching `agones-images` project, adding a note in the release checklist to switch back to a local development project after running a release. Flake: TestControllerGameServerCount (googleforgames#2805) Made it deterministic in the test, and got rod of the potential race conditions. Also fix it such that the util function for generating GameServer names always produce a unique name. Closes googleforgames#2804 Co-authored-by: Robert Bailey <[email protected]> Remove Windows FAQ Entry (googleforgames#2811) The contents are no longer accurate, and are covered in the installation section now. Makefile changes for adding location variable added autoscale parameters to Makefile and README Markdown fix in readme Changed LOCATION to always be set with ZONE as default use only if the variable has a value fixed extraneous characters update gke terraform exmaple module Update Node.js dependencies and package (googleforgames#2815) * Update all dependencies and Node,js to LTS version * Update other docker images that use Node.js Added autoscale to example cluster and added to website docs Added defaults and feature expiry Remove zone from gke/variable.tf file.

This PR does a couple of things towards googleforgames#2777: * Enforce that on Autopilot the scheduling strategy is `Packed` * If the user does not provide a nodeSelector or tolerations in the PodSpec, we add: ``` nodeSelector: agones.dev/role: gameserver tolerations: - effect: NoSchedule key: agones.dev/role operator: Equal value: gameserver ``` This has the effect of splitting the game server pods off to their own nodes ala https://wdenniss.com/autopilot-workload-separation. If the user has either defined, we assume they know what they're doing and skip this. Note: In `Packed` we already set an affinity: ``` affinity: podAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchLabels: agones.dev/role: gameserver topologyKey: kubernetes.io/hostname weight: 100 ``` We keep this affinity, but it becomes mostly a no-op after using a `nodeSelector`.

This PR does a couple of things towards #2777: * Enforce that on Autopilot the scheduling strategy is `Packed` * If the user does not provide a nodeSelector or tolerations in the PodSpec, we add: ``` nodeSelector: agones.dev/role: gameserver tolerations: - effect: NoSchedule key: agones.dev/role operator: Equal value: gameserver ``` This has the effect of splitting the game server pods off to their own nodes ala https://wdenniss.com/autopilot-workload-separation. If the user has either defined, we assume they know what they're doing and skip this. Note: In `Packed` we already set an affinity: ``` affinity: podAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchLabels: agones.dev/role: gameserver topologyKey: kubernetes.io/hostname weight: 100 ``` We keep this affinity, but it becomes mostly a no-op after using a `nodeSelector`.

This PR does a couple of things towards googleforgames#2777: * Enforce that on Autopilot the scheduling strategy is `Packed` * If the user does not provide a nodeSelector or tolerations in the PodSpec, we add: ``` nodeSelector: agones.dev/role: gameserver tolerations: - effect: NoSchedule key: agones.dev/role operator: Equal value: gameserver ``` This has the effect of splitting the game server pods off to their own nodes ala https://wdenniss.com/autopilot-workload-separation. If the user has either defined, we assume they know what they're doing and skip this. Note: In `Packed` we already set an affinity: ``` affinity: podAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchLabels: agones.dev/role: gameserver topologyKey: kubernetes.io/hostname weight: 100 ``` We keep this affinity, but it becomes mostly a no-op after using a `nodeSelector`.

Workload separation on Autopilot requires 0.5 vCPU per pod vs the 0.25 vCPU per pod normally required (see https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-resource-requests#compute-class-min-max vs https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-resource-requests#workload-separation) A lot of users have smaller pods. Revert the automatic use of workload separation. I will work internally to figure out how we might change this. For the time being, I will work towards how we might document this. It's relatively straightforward if a user needs to, as this block was just adding the equivalent of: ``` tolerations: - key: agones.dev/role operator: Equal value: gameserver effect: NoSchedule nodeSelector: agones.dev/role: gameserver ``` But with documentation, we can show other options as well, i.e. how to have workload separation between different gameserver workloads if desired. Towards #2777

Most of the "wait for state" functions use a hardcoded 5m. To allow for autoscaling variance on Autopilot, allow for 10m on Autopilot. In addition, the WaitForGameServerState function can understand which states are terminal to save some time in the case that a GameServer has already moved into a terminal state. Along the way: Remove New/NewWithRates as they are unused or unnecessary wrappers - in particular, they definitely don't need to be exported. Towards googleforgames#2777

…2893) Most of the "wait for state" functions use a hardcoded 5m. To allow for autoscaling variance on Autopilot, allow for 10m on Autopilot. In addition, the WaitForGameServerState function can understand which states are terminal to save some time in the case that a GameServer has already moved into a terminal state. Along the way: Remove New/NewWithRates as they are unused or unnecessary wrappers - in particular, they definitely don't need to be exported. Towards #2777

…rage On Autopilot, ephemeral storage defaults to a request/limit of 1GiB, but on Agones we assume there's no ephemeral storage limit. The default persistent log limit is 10GB. In this PR, we adds a default resources.{limits,requests} block to the controller if the user has not already specified `resources` and set it to the persistent log maximum plus a fudge factor of 100MB to allow for the certs and some overhead. By default this comes out to 10100MB, which slides under the Autopilot 10GiB limit by virtue of GiB/MB differences. Towards googleforgames#2777

…rage (#2900) On Autopilot, ephemeral storage defaults to a request/limit of 1GiB, but on Agones we assume there's no ephemeral storage limit. The default persistent log limit is 10GB. In this PR, we add a default resources.{limits,requests} block to the controller if the user has not already specified `resources` and set it to the persistent log maximum plus a fudge factor of 100MB to allow for the certs and some overhead. By default this comes out to 10100MB, which slides under the Autopilot 10GiB limit by virtue of GiB/MB differences. Towards #2777 Co-authored-by: Robert Bailey <[email protected]>

Adds a cloud product hook specific to the scheduling field, for use in the fleet / gameserverset APIs. This might seem narrowly scoped, but this field is the only one that propagates down outside the template. Modifies one e2e to test that Autopilot rejects appropriately. Adds several skips to other tests. Towards #2777

markmandel · 2023-03-07T21:41:47Z

@zmerlynn I think this can be closed now, yeah?

zmerlynn added the kind/feature New features for Agones label Oct 26, 2022

gongmax assigned zmerlynn and unassigned zmerlynn Oct 26, 2022

zmerlynn self-assigned this Oct 26, 2022

This was referenced Oct 28, 2022

Sync Pod host ports back to GameServer in GCP #2782

Merged

Split port allocators, implement Autopilot port allocation/policies #2789

Merged

This was referenced Nov 11, 2022

Immutable replicas field would allow PodDisruptionBudget on selected GameServer Pods #2806

Closed

GameServer: Implement (immutable) scale subresource, add pdb #2807

Merged

zmerlynn mentioned this issue Nov 18, 2022

Explicit flag for "can be disrupted": eviction.safe #2794

Closed

This was referenced Dec 1, 2022

Agones game server pods should use a non-preempting PriorityClass #2793

Closed

GKE Autopilot: Separate game server workloads #2840

Merged

Support smooth upgrade #2843

Closed

zmerlynn mentioned this issue Dec 16, 2022

Revert workload separation for Autopilot #2876

Merged

This was referenced Jan 12, 2023

controller/extensions: Add explicit requests/limits for ephemeral storage #2900

Merged

GKE Autopilot: Add terraform module, users #2912

Merged

zmerlynn mentioned this issue Jan 19, 2023

Add GKE Autopilot to e2e #2913

Merged

zmerlynn mentioned this issue Jan 26, 2023

Graduate SafeToEvict to Beta #2931

Closed

zmerlynn mentioned this issue Feb 23, 2023

Add integration guide for Google Cloud Managed Service for Prometheus #2990

Merged

zmerlynn closed this as completed Mar 7, 2023

Kalaiselvi84 added this to the 1.30.0 milestone Apr 10, 2023

zmerlynn mentioned this issue Feb 5, 2024

[FeatureRequest]: GameServer can run on aws fargate? #3633

Closed

zmerlynn mentioned this issue Mar 20, 2024

Support Passthrough on GKE Autopilot #3721

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Agones on GKE Autopilot #2777

Support Agones on GKE Autopilot #2777

zmerlynn commented Oct 26, 2022 •

edited

Loading

roberthbailey commented Oct 26, 2022

zmerlynn commented Oct 26, 2022 •

edited

Loading

markmandel commented Mar 7, 2023

Support Agones on GKE Autopilot #2777

Support Agones on GKE Autopilot #2777

Comments

zmerlynn commented Oct 26, 2022 • edited Loading

roberthbailey commented Oct 26, 2022

zmerlynn commented Oct 26, 2022 • edited Loading

markmandel commented Mar 7, 2023

zmerlynn commented Oct 26, 2022 •

edited

Loading

zmerlynn commented Oct 26, 2022 •

edited

Loading