Explicit flag for "can be disrupted": eviction.safe #2794

zmerlynn · 2022-11-07T20:52:57Z

Overview

In Agones today, you need intimate knowledge of sources of voluntary pod disruption on your cluster to understand how to configure the GameServer specification best for your game server. For example:

Pod graceful termination (via terminationGracePeriodSeconds) is allowed by all sources of voluntary disruption, but by default, Cluster Autoscaler only permits 10m of graceful termination (and GKE does not allow this to be tuned). Additionally, pod graceful termination requires cooperation of the pod, as it needs to intercept SIGTERM (or it must define a preStop hook).
The safe-to-evict=false pod annotation influences the behavior of Cluster Autoscaler, but no other sources of voluntary disruption - but this flag is blocked on GKE Autopilot, as mentioned in Support Agones on GKE Autopilot #2777.
With GameServer: Implement (immutable) scale subresource, add pdb #2807, you can use PodDisruptionBudget as a way to express “do not disrupt” for game server pods. This is roughly the equivalent of the safe-to-evict=false pod annotation, but works for anything that uses the standard eviction libraries (including GKE node upgrades, for up to an hour per node). However, PodDisruptionBudget may be violated by pod preemption.
The safest way to control pod preemption is separating game servers to dedicated nodes, which is beyond the scope of this proposal (but ties nicely into something we can do in Autopilot ala GKE Autopilot: Separate game server workloads #2840).

As long as you understand the complexity, you can today tune Agones to what you need. But some of the complexity changes between cloud products (GKE Autopilot is more opinionated than GKE Standard).

Here we propose a field eviction within the GameServer spec that controls whether your game server understands graceful termination, and whether it should be disrupted for autoscaling events, upgrades, or allowed to run to completion without disruption.

`eviction.safe`

We propose field eviction in GameServer with a single field safe, a string-typed enum with options Always, Never, OnUpgrade, like:

eviction:
  safe: Always

Each option uses a slightly different permutation of:

the safe-to-evict annotation to block Cluster Autoscaler based eviction
and the agones.dev/safe-to-evict label selector to select the PDB from GameServer: Implement (immutable) scale subresource, add pdb #2807, which blocks Cluster Autoscaler and (for a limited time) disruption from node upgrades.
- Note that PDBs do influence pod preemption as well, but it's not guaranteed.

As a quick reference:

evictions.safe setting	`safe-to-evict` pod annotation	`agones.dev/safe-to-evict` label
`Never` (default)	`false`	`false` (matches PDB)
`OnUpgrade`	`false`	`true` (does not match PDB)
`Always`	`true`	`true` (does not match PDB)

`safe: Never` (the default)

Agones ensures the game server pod runs to completion by:

enabling the safe-to-evict=false pod annotation
applying the PDB from GameServer: Implement (immutable) scale subresource, add pdb #2807

`safe: OnUpgrade`

Agones blocks Cluster Autoscaler based evictions, but allows eviction from upgrades. On certain providers (GKE, for example), the graceful termination support for upgrades is much longer than is support by Cluster Autoscaler (1h vs 10m), so supporting eviction for upgrades alone may work well.

The game server binary must be prepared to receive SIGTERM for evictions. If the default grace period of 30s is insufficient, you must tune terminationGracePeriodSeconds in the pod spec as well. For broad support across cloud providers, the game server must exit within 1h after receiving SIGTERM.

`safe: Always`

Allow evictions by both Cluster Autoscaler and upgrades. As with OnUpgrade, the game server binary must be prepared to receive SIGTERM. If the default grace period of 30s is insufficient, you must tune terminationGracePeriodSeconds in the pod spec as well.

By default, Cluster Autoscaler supports only 10m of graceful termination. For broad support across cloud providers, the game server must exit within 10m after receiving SIGTERM if running on a cluster where Cluster Autoscaler down-scaling is enabled (or running on a product that implicitly does, like GKE Autopilot). As with OnUpgrade, one hour of graceful termination is broadly supported if Autoscaler is not enabled (and in fact, if Autoscaler is not enabled, safe: Always is the same as safe: OnUpgrade).

Backward Compatibility with `safe-to-evict=true` pod annotation; Existing overrides in template

We propose that using a similar approach to #2754, for any case where Agones modifies the pod, we honor the provided GameServer.template in preference to anything Agones would set automatically - for example, if you set the agones.dev/safe-to-evict label in template, Agones assumes you know better. We will warn when such conflicts occur to make it obvious, though.

That said, one case we need to be particularly sensitive of is the case in #2754 itself: safe defaults to Never. Mixing the safe: Never GameServer default with the safe-to-evict=true pod annotation doesn’t make sense: the inflexible PodDisruptionBudget prevents Cluster Autoscaler scale-down, even if safe-to-evict=true would allow it for other cases.

Prior to the proposal here, the safe-to-evict pod annotation is currently the only documented Agones disruption control. To maintain backwards compatibility, we match the original intent of the safe-to-evict=true pod annotation, which is to allow graceful terminatio: If the safe-to-evict=true pod annotation is present in the template, we enable safe: Always in the GameServer.

We do not make a similar inference for terminationGracePeriodSeconds (despite it also being in our documentation) as it is unnecessary.

With this one piece of backward compatibility resolved, it’s safe to enable the safe: Never default without breaking existing workloads.

Feature gate: new `SafeToEvict` replaces `LifecycleContract`.

We propose a new feature gate SafeToEvict (which will replace LifecycleContract). Graduation to Beta should be relatively rapid as the feature has broad backward compatibility. The only situation where backward compatibility is violated is if the user is negatively impacted by a PDB on the game server pods, e.g. the user is relying on the default safe-to-evict=false pod annotation, but also is assuming that node drain (e.g. via cluster upgrade) will evict. Given that our existing documentation doesn't allow for node drains, we think this is a safe case for backwards incompatibility. To be conservative about it, though, when we graduate SafeToEvict to Beta, we will flag it a potentially breaking change in Release Notes to call attention to it. To allow this situation, the user simply needs to set eviction: safe: OnUpgrade.

Why not default to `OnUpgrade` then?

Despite the possibility of introducing a behavior change, we still think Never is the right default. OnUpgrade requires the game server to understand SIGTERM, and typically terminate within an hour.

Dependency on `SDKGracefulTermination` feature gate

Anything but safe: Never relies on the SDKGracefulTermination feature gate. If SDKGracefulTermination is disabled but the GameServer has safe: Always or safe: OnUpgrade (explicitly, or implicitly via backwards compatibility), we will warn but allow it.

In #2831, we proposed graduating SDKGracefulTermination to Beta, so this situation should only occur if SDKGracefulTermination is explicitly disabled and SafeToEvict is explicitly enabled, which seems like an odd choice.

The text was updated successfully, but these errors were encountered:

roberthbailey · 2022-11-09T05:42:28Z

In general, a disruptionTolerant game server will always require some time period for cleanup, e.g. a "run-to-completion" time for the pod (whether that is 10 seconds or 4 hours).

There is a big difference in how the system can treat the pod depending on what that cleanup time is, e.g. a game server that can clean up in 10 seconds could run on a spot VM whereas one that needs more than 10 minutes cannot be marked as safeToEvict (unless the cluster autoscaler is configured with custom flags).

Are you proposing that the disruptionTolerant flag contain this time? And if so, which policy or policies might we set based on the value?

zmerlynn · 2022-11-09T13:59:55Z

Maybe bad idea: use terminationGracePeriodSeconds but at the game server level? Then we could differentiate based on cloud product how to handle the particular setting best (and whether it's even supported on that cloud product) and change policies to match. So e.g. if terminationGracePeriodSeconds was <10m, we would assume it's evictable by normal means. With 10m<x<1h we might do something else (given the CA termination threshold), etc.

Basically, if the game server had a way to express "I would like to stay running for X time", I feel like we could have policies flow from that rather than having people manually discover the right policies for their particular workload / product. The name is more of a bikeshed discussion. WDYT?

roberthbailey · 2022-11-12T07:58:20Z

One wrinkle we need to consider about the "I would like to stay running for X time" case is that a game server configuration probably wants to express that time as time once the server has been allocated (e.g. ready servers should be able to be reaped by the system since they have no active players on them). So it would be something like, "once I'm allocated and have active players, I need to keep running for at least 30 minutes to let them finish their session before I can be disrupted."

Right now this can be achieved by setting the graceful termination period to 30 minutes, but that only handles some types of voluntary disruptions and not others (e.g. the autoscaler evictions that override the graceful termination value and set it to 10 minutes).

Getting this right likely requires some cooperation from the game server itself though, e.g. to exit quickly if it is not allocated even if it has a long graceful termination period set or to exit voluntarily after a session finishes before the time it wants to stay running expires.

zmerlynn · 2022-11-14T18:14:13Z

You're right that I was very optimistic about how easy this was to plug into the existing lifecycle. It seems like we would probably need to build a field that has some indication of whether the game server itself knows anything about graceful termination, and maybe build off there.

One thing I note is that the SDK seems to have no control over the Shutdown lifecycle. If it did, we could actually track some of this on the SDK sidecar instead of the game server. For example, even if a game server had no native support for graceful termination but did know about reusing, if the metadata expressed e.g. minimumRuntimeAfterAllocated, we could manage the lifecycle from the SDK sidecar by having it disallow re-use after there was a SIGTERM. (Presumably one approach today would just be to return an error on Ready()?)

Let me spend a little more time with this and try to figure out something reasonable.

zmerlynn · 2022-11-18T19:50:43Z

@markmandel @roberthbailey Full proposal in top comment, PTAL!

markmandel · 2022-11-19T14:00:39Z

As long as you understand the complexity, you can today tune Agones to what you need. But some of the complexity changes between cloud providers

Also on-prem. All this stuff is tuneable for custom installations -- so we'd also need to account for that. So if a user has tuned their gke autoscaler to be longer than 10m, how do we let this system know? Helm config? 😄

Implementation-wise, in-container reuse works by the game server calling SDK.Ready(), setting the GameServer back to Ready, whereas between-container reuse works by the game server calling SDK.Shutdown(), exiting the container, then the next container instance calling SDK.Ready

🤔 I feel a bit.... icky? about making Shutdown() do different things, depending on policy here. Feels like a lot of magic. I'd rather we were explicit.

This also takes away the ability for the GameServer to explicitly shut itself down, if it actually determines it should actually shut down properly, rather than restart. Which ties into GameServers in allocated state can respond to fleet updating #2682, in which on an update a GameServer would likely choose to shutdown before it's full session length is required.

Maybe we add have a SDK.Restart() ? to be explicit about when we want to indicate that a process is going to restart? (really, it just moves the GameServer back to Scheduled, and the rest can be taken care of by the process).

If we do that, I don't think we need a betweenSessionGraceSeconds ? Not quite sure why we need this actually? I'm assuming there is a controller that will be keeping an eye on how long GameServers have been running and what their allotted reusePodForSeconds is - and if they could never fulfill that contract, shutting them down? (also we could triple check this at allocation time as well).

Not sure we can find a solution for this one (or should for this design), but something to think about, having had conversations around "shut this down only after n sessions have occurred or there are no players connected after y seconds. But that can be done down the line, and probably tied to #2716 (and also there's no graceful termination for 100 players going to zero).

zmerlynn · 2022-11-20T19:19:08Z

🤔 I feel a bit.... icky? about making Shutdown() do different things, depending on policy here. Feels like a lot of magic. I'd rather we were explicit.

Hmm. I think it depends on how we see the contract. IMO, assuming we weren't in the middle of a fleet scale-down (which I agree I didn't discuss, but I think it can be covered), there is no contract difference today between giving a gameserver the same pod to reuse (by magically transitioning to Scheduled) vs today, where we shut down the pod. The interactions above that are all really Agones control plane concern - one way to think about it is just to change the language: "between-container pod reuse" can also be thought of as "fast GameServer rescheduling". It's just a control plane feature, as there should be no functional difference to the game server binary. Does that make sense?

Regardless of talks of contracts, though, this feature is still opt-in. It requires the ExitAfterShutdown promise - anyone reading the docs that is uncomfortable with pod reuse can either not provide the promise, or can just explicitly set reusePodForSeconds: 0. On empty message default settings, we will not reuse the pod.

If we do that, I don't think we need a betweenSessionGraceSeconds? Not quite sure why we need this actually? I'm assuming there is a controller that will be keeping an eye on how long GameServers have been running and what their allotted reusePodForSeconds is - and if they could never fulfill that contract, shutting them down? (also we could triple check this at allocation time as well).

This setting is designed to allow for a bit of time to call Ready() after the container restarts, and block re-use if we don't have that time. Imagine:

we had reusePodForSeconds: 3600 (1h)
and sessionSeconds: 600 (10m)
but the game server binary takes about 3m to reinitialize (sounds absurd, but maybe it's a lot of data)
and the previous call to Shutdown happens at 48m into the pod's life
In that scenario, your game server would spend 2m reinitializing, only to be shut down at 50m with initialization still remaining to do. Now you've wasted 2m/50m (4%) of the pod's lifetime.

This is an advanced setting and I only expect to need to change it in specific circumstances - one aspect of the API design I'm still kind of toying with is how to move this setting to a different message to make it clearer you probably don't need to care.

markmandel · 2022-11-20T20:36:08Z

Hmm. I think it depends on how we see the contract. IMO, assuming we weren't in the middle of a fleet scale-down (which I agree I didn't discuss, but I think it can be covered), there is no contract difference today between giving a gameserver the same pod to reuse (by magically transitioning to Scheduled) vs today, where we shut down the pod

Right, be in your scenario we are taking away the option of immediate, actual shutdown - which in many circumstances may be the thing that the end user needs to do (Fleet update is just one example).

I'd rather give the end user the explicit option, than make the documentation a series of if/else statements for a singular function depending on which promise you make when and where. It's far easy for a end user to reason about, and also easier for us to document, maintain and test. (Maybe SDK.ReUse() SDK.Recycle() ? SDK.Scheduled() is better?)

This setting is designed to allow for a bit of time to call Ready() after the container restarts, and block re-use if we don't have that time. I

How do we force people to call SDK.Ready() within that window? I don't think we can, so the setting is relatively moot - and we definitely shouldn't be doing it for them.

markmandel · 2022-11-21T00:29:51Z

I was thinking about this some more on my dog walk this morning 😁 i think I know why this feels so complicated to me, and I think it's because we've conflated two things, that I don't think need to be joined together:

GameServer Disruption
GameServer reuse (pod, container, process)

And I think for the sake of this design, we actually only need to care about No. 1, and more discussion of No. 2 should probably open back up in #2781.

This would be my suggestion - to keep things simple, as an end user what I care about is (ignoring implementation details):

Configure if my GameServer is disruptable or not. If it's not, it doesn't get disrupted by anything (what we have now)
If my GameServer can be disrupted, being able to configure and potentially validate the time I have in my process to finalise things for the platform I'm running on before the GameServer is forced to shut down.
A signal that lets my gameserver process know as soon as it is being disrupted.
Documentation that explains different disruption scenarios and the limits therein (probably ona per-provider and use case model) of when disruption can occur, or where one might want it to.

None of this actually cares about if the GameServer is being reused or not -- that's entirely up to the end user to decide and manage. They know when they have been disrupted (most likely SIGTERM signal), and they have configured how long they have left. The onus on them to manage it correctly themselves.

I think the idea of promises and reusePodForSeconds etc is trying to hand-hold way more than is actually necessary - yes we likely want to abstract away the part of "how long do you have after disruption", but in this model, Agones doesn't need to even know how long a session could be. In fact this approach is better, as any long lives server for lobbies or MMO style games doesn't have a finite time attached to it.

… pdb (#2807) This PR, under the `LifecycleContract` feature gate suggested in #2794 * Adds a scale resource to `GameServer` by adding an `immutableReplicas` field, which has a `default`, `min` and `max` of 1. Having a `scale` subresource lets us define a `PodDisruptionBudget` that can be set to `maxUnavailable: 0%` [1]. * Adds a PDB per namespace with label selector `agones.dev/safe-to-evict: "false"`, which nothing yet adds. * Adds a mechanism to get feature gate values in Helm by using: {{- $featureGates := include "agones.featureGates" . | fromYaml }} * Cleanup / documentation of feature gate mechanisms After this PR, it's possible to define a fleet with the label and have all `GameServer` pods protected by a `PodDisruptionBudget`, e.g.: ``` $ kubectl scale fleet/fleet-example --replicas=5 fleet.agones.dev/fleet-example scaled $ kubectl describe pdb Name: agones-gameserver-safe-to-evict-false Namespace: default Max unavailable: 0% Selector: agones.dev/safe-to-evict=false Status: Allowed disruptions: 0 Current: 4 Desired: 5 Total: 5 Events: <none> ``` Additionally, because min/max/default are 1, Kubernetes enforces the immutability for us: ``` $ kubectl scale gs/fleet-example-k6dfs-6m5nq --replicas=1 gameserver.agones.dev/fleet-example-k6dfs-6m5nq scaled $ kubectl scale gs/fleet-example-k6dfs-6m5nq --replicas=2 The GameServer "fleet-example-k6dfs-6m5nq" is invalid: spec.immutableReplicas: Invalid value: 2: spec.immutableReplicas in body should be less than or equal to 1 $ kubectl scale gs/fleet-example-k6dfs-6m5nq --replicas=0 The GameServer "fleet-example-k6dfs-6m5nq" is invalid: spec.immutableReplicas: Invalid value: 0: spec.immutableReplicas in body should be greater than or equal to 1 ``` The only artifact of this addition is a new field in the Spec/Status named `immutableReplicas`, in the Kubernetes object. This field is not present in the in-memory representation for `GameServer`, nor is it present in `etcd` (by defaulting rules). The field is visible on `describe` or `get -oyaml`, but is otherwise ignored. [1] https://kubernetes.io/docs/tasks/run-application/configure-pdb/#identify-an-application-to-protect

zmerlynn · 2022-11-30T14:03:19Z

@markmandel Updated the proposal and stripped it back to the original intent: a single flag to control whether the game server can be gracefully terminated or needs run to completion. I'll re-open #2781 and iterate there on discussion of pod reuse. For posterity, the old proposal is in the first comment, last modified 2022/11/18.

zmerlynn · 2022-12-02T14:25:22Z

Updated the proposal again after I realized my understanding of pod preemption was wrong. See #2793 (comment).

zmerlynn · 2022-12-05T19:27:30Z

Updated the feature gate section to discuss the behavior change and document how we'll turn off the PDB as necessary.

markmandel · 2022-12-06T04:57:32Z

Note: For broad support across cloud providers, the game server must exit within 10m after receiving SIGTERM if running on a cluster where Cluster Autoscaler down-scaling is enabled (or running on a product that implicitly does, like GKE Autopilot). One hour of graceful termination is broadly supported if Autoscaler is not enabled.

Given only up until recently has someone asked us to allow for the cluster autoscaler to evict running GameServers - I'm wondering if we should optimise for disabling that feature, and only allow it if people opt in?

Right now, it doesn't seem possible to have gracefultermination, but not let the cluster autoscaler scale down instances - is that by design?

zmerlynn · 2022-12-06T17:03:39Z

Given only up until recently has someone asked us to allow for the cluster autoscaler to evict running GameServers - I'm wondering if we should optimise for disabling that feature, and only allow it if people opt in?

That's what the default of SafeToEvict: "false" does, if I understand your question?

Right now, it doesn't seem possible to have gracefultermination, but not let the cluster autoscaler scale down instances - is that by design?

If the user specifies the pod annotation safe-to-evict=true, we infer they want SafeToEvict: "true", yes, and that they don't want an overly restrictive PDB (both the PDB and safe-to-evict=false would block the autoscaler).

Outside of voluntary disruption by the autoscaler, cluster upgrade and preemption are the primary other sources. So I think you're saying, is there a way to block the autoscaler but allow graceful termination in other circumstances? If SafeToEvict is "true" we mark safe-to-evict=true and don't use a restrictive PDB, but we also have a design maxim that says the user is allowed to override our "automatic" policy. So the following:

spec:
  safeToEvict: "true"
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

looks really weird, but it says "game server supports graceful termination, but I just don't want Cluster Autoscaler to evict it". It might be a reasonable configuration for e.g. allowing upgrades but not scaledown.

zmerlynn · 2022-12-07T01:27:22Z

One gotcha I'll call out working through the code: Currently, we only apply safe-to-evict=false when the scheduling profile is Packed. I'm going to propose dropping this distinction for combinatoric explosion reasons - I don't think the schduling policy has a drastic effect here, and the annotation does nothing if autoscaler is not enabled. The net effect is that even clusters with a scheduling policy of Distributed would end up, with default settings, with an annotation that is probably irrelevant and a restrictive PDB. The latter is probably correct, since if they can't support graceful termination, we need to potentially block upgrades.

zmerlynn · 2022-12-08T00:42:20Z

Updated the proposal after our meeting today. It is now a 3-way enum instead of true/false and more easily captures the concept of "upgrades only". PTAL.

Per googleforgames#2794, rename `LifecycleContract` to `SafeToEvict`.

* SDKGracefulTermination: Promote to beta Follows checklist in features.go. Closes #2831 * Rename `LifecycleContract` feature gate to `SafeToEvict` Per #2794, rename `LifecycleContract` to `SafeToEvict`.

markmandel · 2022-12-08T22:58:32Z

This LGTM!

Only question I had as a follow up:

If I set UpgradesOnly or Always - Should this mechanism validate that I have a terminationGracePeriodSeconds on at least one of my Pods?

zmerlynn · 2022-12-08T23:21:25Z

This LGTM!

If I set UpgradesOnly or Always - Should this mechanism validate that I have a terminationGracePeriodSeconds on at least one of my Pods?

I think this is a good case for a webhook warning. We currently don't use the webhook warning response, but we could. That said, there are limited circumstances when the default terminationGracePeriodSeconds of 30s may be applicable - e.g. sufficient checkpointing, or maybe it's a relay server that can afford to be bounced because it knows how to send a client redirect or sth. So it's a little nannyish, too.

markmandel · 2022-12-08T23:45:23Z

webhook warning.

What is a "warning" from a webhook? 😄 I don't know this one!

We could also add it at a later date if we find people are footgunning themselves.

SGTM!

(I still wish I could work out a better than than safeToEvict, but I'm fresh out of better ideas 😄

markmandel · 2022-12-08T23:47:26Z

Oooh, I'll give you one extra idea for potential configuration:

apiVersion: "agones.dev/v1"
kind: GameServer
metadata:
  name: "xonotic"
spec:
  eviction:
    safe: Always
  ports:
    - name: default
      containerPort: 26000
  template:
    spec:
      containers:
      - name: xonotic
        image: us-docker.pkg.dev/agones-images/examples:0.9

Not sold on "safety" as the sub key - but this feels better to me, and also allows us to expand the eviction configuration over time. WDYT?

Edit: changed safety to safe - I like that better.

zmerlynn · 2022-12-08T23:51:53Z

What is a "warning" from a webhook? 😄 I don't know this one!

Check out https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#response (and ^F for warning).. it's relatively new. kubectl has good support for them, so they appear on the command line. The Autopilot policy enforcement actually uses it extensively to warn for changes it made.

zmerlynn · 2022-12-08T23:53:40Z

I like

eviction:
  safe: Blah

I can roll with that.

zmerlynn · 2022-12-09T01:39:38Z

Updated the proposal to use the eviction envelope. Implementation in #2857.

roberthbailey · 2022-12-09T05:16:28Z

@markmandel

...but this feels better to me, and also allows us to expand the eviction configuration over time. WDYT?

What other sub-fields do you image being under eviction in the future?

zmerlynn · 2022-12-09T17:49:37Z

What other sub-fields do you image being under eviction in the future?

I'm not clear exactly what we need to future-proof either, but there is a general philosophy to use an object/message even when you think all you need is a single field (it's sort of the multi-variate form of "use an enum instead of a bool").

One thing for sure is that it gives us a place to offer an alternate representation, like if we (much later) decide the enum was a bad idea, it may be simple enough to also allow the list, i.e.

now:

eviction:
  safe: Always

then:

eviction:
  safeFor:
    - upgrades
    - autoscaler

and offer magic translation between them.

markmandel · 2022-12-09T17:51:52Z

I feel like if we wanted to add a generic field terminationGracePeriodSeconds time or session length, this would be a good spot for it to go.

Top level singular fields always worry me, because there is no space to expand if you want to add extra options down the line.

roberthbailey · 2022-12-12T19:59:01Z

Thanks for explaining.

eviction:
  safe: Blah

LGTM

…mes#2849) * SDKGracefulTermination: Promote to beta Follows checklist in features.go. Closes googleforgames#2831 * Rename `LifecycleContract` feature gate to `SafeToEvict` Per googleforgames#2794, rename `LifecycleContract` to `SafeToEvict`.

…ok (#2857) * SafeToEvict feature: Implement the `eviction.safe` API logic Towards #2794: This commit implements the defaulting/validation for the eviction.safe field, under the SafeToEvict feature gate. In particular, it enshrines the defaulting logic for the somewhat complicated backwards compatibility case (safe-to-evict=true is set but SafeToEvict is not). Along the way, refactor a very repetitive unit test table. * Ran gen-api-docs * Implement SafeToEvict enforcement of safe-to-evict annotations/label This commit adds the interpretation of GameServer.Status.Eviction: * Following the pattern in #2840, moves the enforcement of cloudproduct specific things up to the controller, adding a new cloudproduct SetEviction hook to handle it. * As a result, when the SafeToEvict feature flag is on, we disable annotation in gameserver.go and test that we don't set anything (as a change detection test). * Adds a generic and GKE Autopilot version of how to set policy from the eviction.safe status. In Autopilot, we skip using safe-to-evict=false.

zmerlynn · 2023-05-09T16:50:07Z

This is implemented, see https://agones.dev/site/docs/advanced/controlling-disruption/

zmerlynn added the kind/feature New features for Agones label Nov 7, 2022

zmerlynn mentioned this issue Nov 11, 2022

Immutable replicas field would allow PodDisruptionBudget on selected GameServer Pods #2806

Closed

zmerlynn self-assigned this Nov 14, 2022

zmerlynn changed the title ~~Explicit flag for "can be disrupted"~~ Explicit flag for "can be disrupted": Lifecycle Contracts Nov 18, 2022

zmerlynn added the kind/design Proposal discussing new features / fixes and how they should be implemented label Nov 18, 2022

This was referenced Nov 18, 2022

GameServer: Implement (immutable) scale subresource, add pdb #2807

Merged

Allow graceful restarts of game server containers instead of shutdown #2781

Open

zmerlynn mentioned this issue Nov 29, 2022

Graduate SDKGracefulTermination to beta #2831

Closed

zmerlynn changed the title ~~Explicit flag for "can be disrupted": Lifecycle Contracts~~ Explicit flag for "can be disrupted": safeToEvict Nov 30, 2022

zmerlynn mentioned this issue Dec 5, 2022

Rename LifecycleContract feature gate to SafeToEvict #2849

Merged

zmerlynn mentioned this issue Dec 7, 2022

SafeToEvict: Implement Eviction API, add SetEviction cloud product hook #2857

Merged

zmerlynn added a commit to zmerlynn/agones that referenced this issue Dec 8, 2022

Rename LifecycleContract feature gate to SafeToEvict

042d25b

Per googleforgames#2794, rename `LifecycleContract` to `SafeToEvict`.

zmerlynn changed the title ~~Explicit flag for "can be disrupted": safeToEvict~~ Explicit flag for "can be disrupted": eviction.safe Jan 24, 2023

zmerlynn mentioned this issue Jan 26, 2023

Graduate SafeToEvict to Beta #2931

Closed

zmerlynn closed this as completed May 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicit flag for "can be disrupted": eviction.safe #2794

Explicit flag for "can be disrupted": eviction.safe #2794

zmerlynn commented Nov 7, 2022 •

edited

Loading

roberthbailey commented Nov 9, 2022

zmerlynn commented Nov 9, 2022

roberthbailey commented Nov 12, 2022

zmerlynn commented Nov 14, 2022 •

edited

Loading

zmerlynn commented Nov 18, 2022

markmandel commented Nov 19, 2022

zmerlynn commented Nov 20, 2022 •

edited

Loading

markmandel commented Nov 20, 2022

markmandel commented Nov 21, 2022

zmerlynn commented Nov 30, 2022

zmerlynn commented Dec 2, 2022 •

edited

Loading

zmerlynn commented Dec 5, 2022

markmandel commented Dec 6, 2022

zmerlynn commented Dec 6, 2022

zmerlynn commented Dec 7, 2022

zmerlynn commented Dec 8, 2022

markmandel commented Dec 8, 2022

zmerlynn commented Dec 8, 2022 •

edited

Loading

markmandel commented Dec 8, 2022

markmandel commented Dec 8, 2022 •

edited

Loading

zmerlynn commented Dec 8, 2022 •

edited

Loading

zmerlynn commented Dec 8, 2022

zmerlynn commented Dec 9, 2022

roberthbailey commented Dec 9, 2022

zmerlynn commented Dec 9, 2022

markmandel commented Dec 9, 2022

roberthbailey commented Dec 12, 2022

zmerlynn commented May 9, 2023

Explicit flag for "can be disrupted": eviction.safe #2794

Explicit flag for "can be disrupted": eviction.safe #2794

Comments

zmerlynn commented Nov 7, 2022 • edited Loading

Overview

eviction.safe

safe: Never (the default)

safe: OnUpgrade

safe: Always

Backward Compatibility with safe-to-evict=true pod annotation; Existing overrides in template

Feature gate: new SafeToEvict replaces LifecycleContract.

Why not default to OnUpgrade then?

Dependency on SDKGracefulTermination feature gate

roberthbailey commented Nov 9, 2022

zmerlynn commented Nov 9, 2022

roberthbailey commented Nov 12, 2022

zmerlynn commented Nov 14, 2022 • edited Loading

zmerlynn commented Nov 18, 2022

markmandel commented Nov 19, 2022

zmerlynn commented Nov 20, 2022 • edited Loading

markmandel commented Nov 20, 2022

markmandel commented Nov 21, 2022

zmerlynn commented Nov 30, 2022

zmerlynn commented Dec 2, 2022 • edited Loading

zmerlynn commented Dec 5, 2022

markmandel commented Dec 6, 2022

zmerlynn commented Dec 6, 2022

zmerlynn commented Dec 7, 2022

zmerlynn commented Dec 8, 2022

markmandel commented Dec 8, 2022

zmerlynn commented Dec 8, 2022 • edited Loading

markmandel commented Dec 8, 2022

markmandel commented Dec 8, 2022 • edited Loading

zmerlynn commented Dec 8, 2022 • edited Loading

zmerlynn commented Dec 8, 2022

zmerlynn commented Dec 9, 2022

roberthbailey commented Dec 9, 2022

zmerlynn commented Dec 9, 2022

markmandel commented Dec 9, 2022

roberthbailey commented Dec 12, 2022

zmerlynn commented May 9, 2023

zmerlynn commented Nov 7, 2022 •

edited

Loading

`eviction.safe`

`safe: Never` (the default)

`safe: OnUpgrade`

`safe: Always`

Backward Compatibility with `safe-to-evict=true` pod annotation; Existing overrides in template

Feature gate: new `SafeToEvict` replaces `LifecycleContract`.

Why not default to `OnUpgrade` then?

Dependency on `SDKGracefulTermination` feature gate

zmerlynn commented Nov 14, 2022 •

edited

Loading

zmerlynn commented Nov 20, 2022 •

edited

Loading

zmerlynn commented Dec 2, 2022 •

edited

Loading

zmerlynn commented Dec 8, 2022 •

edited

Loading

markmandel commented Dec 8, 2022 •

edited

Loading

zmerlynn commented Dec 8, 2022 •

edited

Loading