Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit flag for "can be disrupted": eviction.safe #2794

Closed
zmerlynn opened this issue Nov 7, 2022 · 28 comments
Closed

Explicit flag for "can be disrupted": eviction.safe #2794

zmerlynn opened this issue Nov 7, 2022 · 28 comments
Assignees
Labels
kind/design Proposal discussing new features / fixes and how they should be implemented kind/feature New features for Agones

Comments

@zmerlynn
Copy link
Collaborator

zmerlynn commented Nov 7, 2022

Overview

In Agones today, you need intimate knowledge of sources of voluntary pod disruption on your cluster to understand how to configure the GameServer specification best for your game server. For example:

As long as you understand the complexity, you can today tune Agones to what you need. But some of the complexity changes between cloud products (GKE Autopilot is more opinionated than GKE Standard).

Here we propose a field eviction within the GameServer spec that controls whether your game server understands graceful termination, and whether it should be disrupted for autoscaling events, upgrades, or allowed to run to completion without disruption.

eviction.safe

We propose field eviction in GameServer with a single field safe, a string-typed enum with options Always, Never, OnUpgrade, like:

eviction:
  safe: Always

Each option uses a slightly different permutation of:

As a quick reference:

evictions.safe setting safe-to-evict pod annotation agones.dev/safe-to-evict label
Never (default) false false (matches PDB)
OnUpgrade false true (does not match PDB)
Always true true (does not match PDB)

safe: Never (the default)

Agones ensures the game server pod runs to completion by:

safe: OnUpgrade

Agones blocks Cluster Autoscaler based evictions, but allows eviction from upgrades. On certain providers (GKE, for example), the graceful termination support for upgrades is much longer than is support by Cluster Autoscaler (1h vs 10m), so supporting eviction for upgrades alone may work well.

The game server binary must be prepared to receive SIGTERM for evictions. If the default grace period of 30s is insufficient, you must tune terminationGracePeriodSeconds in the pod spec as well. For broad support across cloud providers, the game server must exit within 1h after receiving SIGTERM.

safe: Always

Allow evictions by both Cluster Autoscaler and upgrades. As with OnUpgrade, the game server binary must be prepared to receive SIGTERM. If the default grace period of 30s is insufficient, you must tune terminationGracePeriodSeconds in the pod spec as well.

By default, Cluster Autoscaler supports only 10m of graceful termination. For broad support across cloud providers, the game server must exit within 10m after receiving SIGTERM if running on a cluster where Cluster Autoscaler down-scaling is enabled (or running on a product that implicitly does, like GKE Autopilot). As with OnUpgrade, one hour of graceful termination is broadly supported if Autoscaler is not enabled (and in fact, if Autoscaler is not enabled, safe: Always is the same as safe: OnUpgrade).

Backward Compatibility with safe-to-evict=true pod annotation; Existing overrides in template

We propose that using a similar approach to #2754, for any case where Agones modifies the pod, we honor the provided GameServer.template in preference to anything Agones would set automatically - for example, if you set the agones.dev/safe-to-evict label in template, Agones assumes you know better. We will warn when such conflicts occur to make it obvious, though.

That said, one case we need to be particularly sensitive of is the case in #2754 itself: safe defaults to Never. Mixing the safe: Never GameServer default with the safe-to-evict=true pod annotation doesn’t make sense: the inflexible PodDisruptionBudget prevents Cluster Autoscaler scale-down, even if safe-to-evict=true would allow it for other cases.

Prior to the proposal here, the safe-to-evict pod annotation is currently the only documented Agones disruption control. To maintain backwards compatibility, we match the original intent of the safe-to-evict=true pod annotation, which is to allow graceful terminatio: If the safe-to-evict=true pod annotation is present in the template, we enable safe: Always in the GameServer.

We do not make a similar inference for terminationGracePeriodSeconds (despite it also being in our documentation) as it is unnecessary.

With this one piece of backward compatibility resolved, it’s safe to enable the safe: Never default without breaking existing workloads.

Feature gate: new SafeToEvict replaces LifecycleContract.

We propose a new feature gate SafeToEvict (which will replace LifecycleContract). Graduation to Beta should be relatively rapid as the feature has broad backward compatibility. The only situation where backward compatibility is violated is if the user is negatively impacted by a PDB on the game server pods, e.g. the user is relying on the default safe-to-evict=false pod annotation, but also is assuming that node drain (e.g. via cluster upgrade) will evict. Given that our existing documentation doesn't allow for node drains, we think this is a safe case for backwards incompatibility. To be conservative about it, though, when we graduate SafeToEvict to Beta, we will flag it a potentially breaking change in Release Notes to call attention to it. To allow this situation, the user simply needs to set eviction: safe: OnUpgrade.

Why not default to OnUpgrade then?

Despite the possibility of introducing a behavior change, we still think Never is the right default. OnUpgrade requires the game server to understand SIGTERM, and typically terminate within an hour.

Dependency on SDKGracefulTermination feature gate

Anything but safe: Never relies on the SDKGracefulTermination feature gate. If SDKGracefulTermination is disabled but the GameServer has safe: Always or safe: OnUpgrade (explicitly, or implicitly via backwards compatibility), we will warn but allow it.

In #2831, we proposed graduating SDKGracefulTermination to Beta, so this situation should only occur if SDKGracefulTermination is explicitly disabled and SafeToEvict is explicitly enabled, which seems like an odd choice.

@zmerlynn zmerlynn added the kind/feature New features for Agones label Nov 7, 2022
@roberthbailey
Copy link
Member

In general, a disruptionTolerant game server will always require some time period for cleanup, e.g. a "run-to-completion" time for the pod (whether that is 10 seconds or 4 hours).

There is a big difference in how the system can treat the pod depending on what that cleanup time is, e.g. a game server that can clean up in 10 seconds could run on a spot VM whereas one that needs more than 10 minutes cannot be marked as safeToEvict (unless the cluster autoscaler is configured with custom flags).

Are you proposing that the disruptionTolerant flag contain this time? And if so, which policy or policies might we set based on the value?

@zmerlynn
Copy link
Collaborator Author

zmerlynn commented Nov 9, 2022

Maybe bad idea: use terminationGracePeriodSeconds but at the game server level? Then we could differentiate based on cloud product how to handle the particular setting best (and whether it's even supported on that cloud product) and change policies to match. So e.g. if terminationGracePeriodSeconds was <10m, we would assume it's evictable by normal means. With 10m<x<1h we might do something else (given the CA termination threshold), etc.

Basically, if the game server had a way to express "I would like to stay running for X time", I feel like we could have policies flow from that rather than having people manually discover the right policies for their particular workload / product. The name is more of a bikeshed discussion. WDYT?

@roberthbailey
Copy link
Member

One wrinkle we need to consider about the "I would like to stay running for X time" case is that a game server configuration probably wants to express that time as time once the server has been allocated (e.g. ready servers should be able to be reaped by the system since they have no active players on them). So it would be something like, "once I'm allocated and have active players, I need to keep running for at least 30 minutes to let them finish their session before I can be disrupted."

Right now this can be achieved by setting the graceful termination period to 30 minutes, but that only handles some types of voluntary disruptions and not others (e.g. the autoscaler evictions that override the graceful termination value and set it to 10 minutes).

Getting this right likely requires some cooperation from the game server itself though, e.g. to exit quickly if it is not allocated even if it has a long graceful termination period set or to exit voluntarily after a session finishes before the time it wants to stay running expires.

@zmerlynn
Copy link
Collaborator Author

zmerlynn commented Nov 14, 2022

You're right that I was very optimistic about how easy this was to plug into the existing lifecycle. It seems like we would probably need to build a field that has some indication of whether the game server itself knows anything about graceful termination, and maybe build off there.

One thing I note is that the SDK seems to have no control over the Shutdown lifecycle. If it did, we could actually track some of this on the SDK sidecar instead of the game server. For example, even if a game server had no native support for graceful termination but did know about reusing, if the metadata expressed e.g. minimumRuntimeAfterAllocated, we could manage the lifecycle from the SDK sidecar by having it disallow re-use after there was a SIGTERM. (Presumably one approach today would just be to return an error on Ready()?)

Let me spend a little more time with this and try to figure out something reasonable.

@zmerlynn zmerlynn self-assigned this Nov 14, 2022
@zmerlynn zmerlynn changed the title Explicit flag for "can be disrupted" Explicit flag for "can be disrupted": Lifecycle Contracts Nov 18, 2022
@zmerlynn zmerlynn added the kind/design Proposal discussing new features / fixes and how they should be implemented label Nov 18, 2022
@zmerlynn
Copy link
Collaborator Author

@markmandel @roberthbailey Full proposal in top comment, PTAL!

@markmandel
Copy link
Member

As long as you understand the complexity, you can today tune Agones to what you need. But some of the complexity changes between cloud providers

Also on-prem. All this stuff is tuneable for custom installations -- so we'd also need to account for that. So if a user has tuned their gke autoscaler to be longer than 10m, how do we let this system know? Helm config? 😄

Implementation-wise, in-container reuse works by the game server calling SDK.Ready(), setting the GameServer back to Ready, whereas between-container reuse works by the game server calling SDK.Shutdown(), exiting the container, then the next container instance calling SDK.Ready

🤔 I feel a bit.... icky? about making Shutdown() do different things, depending on policy here. Feels like a lot of magic. I'd rather we were explicit.

This also takes away the ability for the GameServer to explicitly shut itself down, if it actually determines it should actually shut down properly, rather than restart. Which ties into GameServers in allocated state can respond to fleet updating #2682, in which on an update a GameServer would likely choose to shutdown before it's full session length is required.

Maybe we add have a SDK.Restart() ? to be explicit about when we want to indicate that a process is going to restart? (really, it just moves the GameServer back to Scheduled, and the rest can be taken care of by the process).

If we do that, I don't think we need a betweenSessionGraceSeconds ? Not quite sure why we need this actually? I'm assuming there is a controller that will be keeping an eye on how long GameServers have been running and what their allotted reusePodForSeconds is - and if they could never fulfill that contract, shutting them down? (also we could triple check this at allocation time as well).

Not sure we can find a solution for this one (or should for this design), but something to think about, having had conversations around "shut this down only after n sessions have occurred or there are no players connected after y seconds. But that can be done down the line, and probably tied to #2716 (and also there's no graceful termination for 100 players going to zero).

@zmerlynn
Copy link
Collaborator Author

zmerlynn commented Nov 20, 2022

🤔 I feel a bit.... icky? about making Shutdown() do different things, depending on policy here. Feels like a lot of magic. I'd rather we were explicit.

Hmm. I think it depends on how we see the contract. IMO, assuming we weren't in the middle of a fleet scale-down (which I agree I didn't discuss, but I think it can be covered), there is no contract difference today between giving a gameserver the same pod to reuse (by magically transitioning to Scheduled) vs today, where we shut down the pod. The interactions above that are all really Agones control plane concern - one way to think about it is just to change the language: "between-container pod reuse" can also be thought of as "fast GameServer rescheduling". It's just a control plane feature, as there should be no functional difference to the game server binary. Does that make sense?

Regardless of talks of contracts, though, this feature is still opt-in. It requires the ExitAfterShutdown promise - anyone reading the docs that is uncomfortable with pod reuse can either not provide the promise, or can just explicitly set reusePodForSeconds: 0. On empty message default settings, we will not reuse the pod.

If we do that, I don't think we need a betweenSessionGraceSeconds? Not quite sure why we need this actually? I'm assuming there is a controller that will be keeping an eye on how long GameServers have been running and what their allotted reusePodForSeconds is - and if they could never fulfill that contract, shutting them down? (also we could triple check this at allocation time as well).

This setting is designed to allow for a bit of time to call Ready() after the container restarts, and block re-use if we don't have that time. Imagine:

  • we had reusePodForSeconds: 3600 (1h)
  • and sessionSeconds: 600 (10m)
  • but the game server binary takes about 3m to reinitialize (sounds absurd, but maybe it's a lot of data)
  • and the previous call to Shutdown happens at 48m into the pod's life
    In that scenario, your game server would spend 2m reinitializing, only to be shut down at 50m with initialization still remaining to do. Now you've wasted 2m/50m (4%) of the pod's lifetime.

This is an advanced setting and I only expect to need to change it in specific circumstances - one aspect of the API design I'm still kind of toying with is how to move this setting to a different message to make it clearer you probably don't need to care.

@markmandel
Copy link
Member

Hmm. I think it depends on how we see the contract. IMO, assuming we weren't in the middle of a fleet scale-down (which I agree I didn't discuss, but I think it can be covered), there is no contract difference today between giving a gameserver the same pod to reuse (by magically transitioning to Scheduled) vs today, where we shut down the pod

Right, be in your scenario we are taking away the option of immediate, actual shutdown - which in many circumstances may be the thing that the end user needs to do (Fleet update is just one example).

I'd rather give the end user the explicit option, than make the documentation a series of if/else statements for a singular function depending on which promise you make when and where. It's far easy for a end user to reason about, and also easier for us to document, maintain and test. (Maybe SDK.ReUse() SDK.Recycle() ? SDK.Scheduled() is better?)

This setting is designed to allow for a bit of time to call Ready() after the container restarts, and block re-use if we don't have that time. I

How do we force people to call SDK.Ready() within that window? I don't think we can, so the setting is relatively moot - and we definitely shouldn't be doing it for them.

@markmandel
Copy link
Member

I was thinking about this some more on my dog walk this morning 😁 i think I know why this feels so complicated to me, and I think it's because we've conflated two things, that I don't think need to be joined together:

  1. GameServer Disruption
  2. GameServer reuse (pod, container, process)

And I think for the sake of this design, we actually only need to care about No. 1, and more discussion of No. 2 should probably open back up in #2781.

This would be my suggestion - to keep things simple, as an end user what I care about is (ignoring implementation details):

  1. Configure if my GameServer is disruptable or not. If it's not, it doesn't get disrupted by anything (what we have now)
  2. If my GameServer can be disrupted, being able to configure and potentially validate the time I have in my process to finalise things for the platform I'm running on before the GameServer is forced to shut down.
  3. A signal that lets my gameserver process know as soon as it is being disrupted.
  4. Documentation that explains different disruption scenarios and the limits therein (probably ona per-provider and use case model) of when disruption can occur, or where one might want it to.

None of this actually cares about if the GameServer is being reused or not -- that's entirely up to the end user to decide and manage. They know when they have been disrupted (most likely SIGTERM signal), and they have configured how long they have left. The onus on them to manage it correctly themselves.

I think the idea of promises and reusePodForSeconds etc is trying to hand-hold way more than is actually necessary - yes we likely want to abstract away the part of "how long do you have after disruption", but in this model, Agones doesn't need to even know how long a session could be. In fact this approach is better, as any long lives server for lobbies or MMO style games doesn't have a finite time attached to it.

roberthbailey pushed a commit that referenced this issue Nov 29, 2022
… pdb (#2807)

This PR, under the `LifecycleContract` feature gate suggested in #2794

* Adds a scale resource to `GameServer` by adding an
`immutableReplicas` field, which has a `default`, `min` and `max` of 1.
Having a `scale` subresource lets us define a `PodDisruptionBudget`
that can be set to `maxUnavailable: 0%` [1].

* Adds a PDB per namespace with label selector
`agones.dev/safe-to-evict: "false"`, which nothing yet adds.

* Adds a mechanism to get feature gate values in Helm by using:
  {{- $featureGates := include "agones.featureGates" . | fromYaml }}

* Cleanup / documentation of feature gate mechanisms

After this PR, it's possible to define a fleet with the label and have
all `GameServer` pods protected by a `PodDisruptionBudget`, e.g.:

```
$ kubectl scale fleet/fleet-example --replicas=5
fleet.agones.dev/fleet-example scaled
$ kubectl describe pdb
Name:             agones-gameserver-safe-to-evict-false
Namespace:        default
Max unavailable:  0%
Selector:         agones.dev/safe-to-evict=false
Status:
    Allowed disruptions:  0
    Current:              4
    Desired:              5
    Total:                5
Events:                   <none>
```

Additionally, because min/max/default are 1, Kubernetes enforces the
immutability for us:

```
$ kubectl scale gs/fleet-example-k6dfs-6m5nq --replicas=1
gameserver.agones.dev/fleet-example-k6dfs-6m5nq scaled
$ kubectl scale gs/fleet-example-k6dfs-6m5nq --replicas=2
The GameServer "fleet-example-k6dfs-6m5nq" is invalid: spec.immutableReplicas: Invalid value: 2: spec.immutableReplicas in body should be less than or equal to 1
$ kubectl scale gs/fleet-example-k6dfs-6m5nq --replicas=0
The GameServer "fleet-example-k6dfs-6m5nq" is invalid: spec.immutableReplicas: Invalid value: 0: spec.immutableReplicas in body should be greater than or equal to 1
```

The only artifact of this addition is a new field in the Spec/Status
named `immutableReplicas`, in the Kubernetes object. This field is not
present in the in-memory representation for `GameServer`, nor is it
present in `etcd` (by defaulting rules). The field is visible on
`describe` or `get -oyaml`, but is otherwise ignored.

[1] https://kubernetes.io/docs/tasks/run-application/configure-pdb/#identify-an-application-to-protect
@zmerlynn zmerlynn changed the title Explicit flag for "can be disrupted": Lifecycle Contracts Explicit flag for "can be disrupted": safeToEvict Nov 30, 2022
@zmerlynn
Copy link
Collaborator Author

@markmandel Updated the proposal and stripped it back to the original intent: a single flag to control whether the game server can be gracefully terminated or needs run to completion. I'll re-open #2781 and iterate there on discussion of pod reuse. For posterity, the old proposal is in the first comment, last modified 2022/11/18.

@zmerlynn
Copy link
Collaborator Author

zmerlynn commented Dec 2, 2022

Updated the proposal again after I realized my understanding of pod preemption was wrong. See #2793 (comment).

@zmerlynn
Copy link
Collaborator Author

zmerlynn commented Dec 5, 2022

Updated the feature gate section to discuss the behavior change and document how we'll turn off the PDB as necessary.

@markmandel
Copy link
Member

  • Note: For broad support across cloud providers, the game server must exit within 10m after receiving SIGTERM if running on a cluster where Cluster Autoscaler down-scaling is enabled (or running on a product that implicitly does, like GKE Autopilot). One hour of graceful termination is broadly supported if Autoscaler is not enabled.

Given only up until recently has someone asked us to allow for the cluster autoscaler to evict running GameServers - I'm wondering if we should optimise for disabling that feature, and only allow it if people opt in?

Right now, it doesn't seem possible to have gracefultermination, but not let the cluster autoscaler scale down instances - is that by design?

@zmerlynn
Copy link
Collaborator Author

zmerlynn commented Dec 6, 2022

Given only up until recently has someone asked us to allow for the cluster autoscaler to evict running GameServers - I'm wondering if we should optimise for disabling that feature, and only allow it if people opt in?

That's what the default of SafeToEvict: "false" does, if I understand your question?

Right now, it doesn't seem possible to have gracefultermination, but not let the cluster autoscaler scale down instances - is that by design?

If the user specifies the pod annotation safe-to-evict=true, we infer they want SafeToEvict: "true", yes, and that they don't want an overly restrictive PDB (both the PDB and safe-to-evict=false would block the autoscaler).

Outside of voluntary disruption by the autoscaler, cluster upgrade and preemption are the primary other sources. So I think you're saying, is there a way to block the autoscaler but allow graceful termination in other circumstances? If SafeToEvict is "true" we mark safe-to-evict=true and don't use a restrictive PDB, but we also have a design maxim that says the user is allowed to override our "automatic" policy. So the following:

spec:
  safeToEvict: "true"
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

looks really weird, but it says "game server supports graceful termination, but I just don't want Cluster Autoscaler to evict it". It might be a reasonable configuration for e.g. allowing upgrades but not scaledown.

@zmerlynn
Copy link
Collaborator Author

zmerlynn commented Dec 7, 2022

One gotcha I'll call out working through the code: Currently, we only apply safe-to-evict=false when the scheduling profile is Packed. I'm going to propose dropping this distinction for combinatoric explosion reasons - I don't think the schduling policy has a drastic effect here, and the annotation does nothing if autoscaler is not enabled. The net effect is that even clusters with a scheduling policy of Distributed would end up, with default settings, with an annotation that is probably irrelevant and a restrictive PDB. The latter is probably correct, since if they can't support graceful termination, we need to potentially block upgrades.

@zmerlynn
Copy link
Collaborator Author

zmerlynn commented Dec 8, 2022

Updated the proposal after our meeting today. It is now a 3-way enum instead of true/false and more easily captures the concept of "upgrades only". PTAL.

zmerlynn added a commit to zmerlynn/agones that referenced this issue Dec 8, 2022
Per googleforgames#2794, rename `LifecycleContract` to `SafeToEvict`.
zmerlynn added a commit that referenced this issue Dec 8, 2022
* SDKGracefulTermination: Promote to beta

Follows checklist in features.go.

Closes #2831

* Rename `LifecycleContract` feature gate to `SafeToEvict`

Per #2794, rename `LifecycleContract` to `SafeToEvict`.
@markmandel
Copy link
Member

This LGTM!

Only question I had as a follow up:

If I set UpgradesOnly or Always - Should this mechanism validate that I have a terminationGracePeriodSeconds on at least one of my Pods?

@zmerlynn
Copy link
Collaborator Author

zmerlynn commented Dec 8, 2022

This LGTM!

huzzah

If I set UpgradesOnly or Always - Should this mechanism validate that I have a terminationGracePeriodSeconds on at least one of my Pods?

I think this is a good case for a webhook warning. We currently don't use the webhook warning response, but we could. That said, there are limited circumstances when the default terminationGracePeriodSeconds of 30s may be applicable - e.g. sufficient checkpointing, or maybe it's a relay server that can afford to be bounced because it knows how to send a client redirect or sth. So it's a little nannyish, too.

@markmandel
Copy link
Member

webhook warning.

What is a "warning" from a webhook? 😄 I don't know this one!

We could also add it at a later date if we find people are footgunning themselves.

SGTM!

(I still wish I could work out a better than than safeToEvict, but I'm fresh out of better ideas 😄

@markmandel
Copy link
Member

markmandel commented Dec 8, 2022

Oooh, I'll give you one extra idea for potential configuration:

apiVersion: "agones.dev/v1"
kind: GameServer
metadata:
  name: "xonotic"
spec:
  eviction:
    safe: Always
  ports:
    - name: default
      containerPort: 26000
  template:
    spec:
      containers:
      - name: xonotic
        image: us-docker.pkg.dev/agones-images/examples:0.9

Not sold on "safety" as the sub key - but this feels better to me, and also allows us to expand the eviction configuration over time. WDYT?

Edit: changed safety to safe - I like that better.

@zmerlynn
Copy link
Collaborator Author

zmerlynn commented Dec 8, 2022

What is a "warning" from a webhook? 😄 I don't know this one!

Check out https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#response (and ^F for warning).. it's relatively new. kubectl has good support for them, so they appear on the command line. The Autopilot policy enforcement actually uses it extensively to warn for changes it made.

@zmerlynn
Copy link
Collaborator Author

zmerlynn commented Dec 8, 2022

I like

eviction:
  safe: Blah

I can roll with that.

@zmerlynn
Copy link
Collaborator Author

zmerlynn commented Dec 9, 2022

Updated the proposal to use the eviction envelope. Implementation in #2857.

@roberthbailey
Copy link
Member

@markmandel

...but this feels better to me, and also allows us to expand the eviction configuration over time. WDYT?

What other sub-fields do you image being under eviction in the future?

@zmerlynn
Copy link
Collaborator Author

zmerlynn commented Dec 9, 2022

What other sub-fields do you image being under eviction in the future?

I'm not clear exactly what we need to future-proof either, but there is a general philosophy to use an object/message even when you think all you need is a single field (it's sort of the multi-variate form of "use an enum instead of a bool").

One thing for sure is that it gives us a place to offer an alternate representation, like if we (much later) decide the enum was a bad idea, it may be simple enough to also allow the list, i.e.

now:

eviction:
  safe: Always

then:

eviction:
  safeFor:
    - upgrades
    - autoscaler

and offer magic translation between them.

@markmandel
Copy link
Member

I feel like if we wanted to add a generic field terminationGracePeriodSeconds time or session length, this would be a good spot for it to go.

Top level singular fields always worry me, because there is no space to expand if you want to add extra options down the line.

@roberthbailey
Copy link
Member

Thanks for explaining.

eviction:
  safe: Blah

LGTM

chiayi pushed a commit to chiayi/agones that referenced this issue Dec 13, 2022
…mes#2849)

* SDKGracefulTermination: Promote to beta

Follows checklist in features.go.

Closes googleforgames#2831

* Rename `LifecycleContract` feature gate to `SafeToEvict`

Per googleforgames#2794, rename `LifecycleContract` to `SafeToEvict`.
roberthbailey pushed a commit that referenced this issue Jan 5, 2023
…ok (#2857)

* SafeToEvict feature: Implement the `eviction.safe` API logic

Towards #2794: This commit implements the defaulting/validation for
the eviction.safe field, under the SafeToEvict feature gate. In
particular, it enshrines the defaulting logic for the somewhat
complicated backwards compatibility case (safe-to-evict=true is set
but SafeToEvict is not).

Along the way, refactor a very repetitive unit test table.

* Ran gen-api-docs

* Implement SafeToEvict enforcement of safe-to-evict annotations/label

This commit adds the interpretation of GameServer.Status.Eviction:

* Following the pattern in #2840, moves the enforcement of
cloudproduct specific things up to the controller, adding a new cloudproduct
SetEviction hook to handle it.
  * As a result, when the SafeToEvict feature flag is on, we disable annotation
    in gameserver.go and test that we don't set anything (as a change detection test).

* Adds a generic and GKE Autopilot version of how to set policy from the
eviction.safe status. In Autopilot, we skip using safe-to-evict=false.
@zmerlynn zmerlynn changed the title Explicit flag for "can be disrupted": safeToEvict Explicit flag for "can be disrupted": eviction.safe Jan 24, 2023
@zmerlynn
Copy link
Collaborator Author

zmerlynn commented May 9, 2023

This is implemented, see https://agones.dev/site/docs/advanced/controlling-disruption/

@zmerlynn zmerlynn closed this as completed May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Proposal discussing new features / fixes and how they should be implemented kind/feature New features for Agones
Projects
None yet
Development

No branches or pull requests

3 participants