Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving cluster to a new node pool doesn't recreate all fleets #398

Closed
KamiMay opened this issue Oct 24, 2018 · 18 comments · Fixed by #1279
Closed

Moving cluster to a new node pool doesn't recreate all fleets #398

KamiMay opened this issue Oct 24, 2018 · 18 comments · Fixed by #1279
Labels
area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc help wanted We would love help on these issues. Please come help us! kind/bug These are bugs.
Milestone

Comments

@KamiMay
Copy link

KamiMay commented Oct 24, 2018

I've noticed something weird today. I needed to swap node pool in GKE so I created new node pool and deleted old one. I expected all instances in the old node pool to recover in the new one after some time. However in my particular case I could only see 1 of the 3 servers in workloads page on GCloud. So I checked fleets to see if it has min availability which was 1 of each kind = 3. And kubectl describe fleets indicated that 3 servers were online and available, however when I tried to connect to one that was listed but not in workloads it failed to connect to it, I was able to connect to the one appearing in the workloads, but not others. I had to delete fleets and recreate them for them to appear and work correctly again.

@markmandel
Copy link
Member

I have a strong feeling this is because if a pod get's deleted, the backing GameServer is left in a zombie state (i.e. not deleted along with it).

We should implement functionality to say that if Pod get's removed, the owning GameServer should be deleted too. This should solve this issue.

@markmandel markmandel added kind/bug These are bugs. area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc help wanted We would love help on these issues. Please come help us! labels Oct 24, 2018
@markmandel
Copy link
Member

Actually - I'm not sure what this is - I tested deleting the backing Pod from a GameServer, and the GameServer gets deleted. More investigation required!

@KamiMay
Copy link
Author

KamiMay commented Nov 15, 2018

I think best way to recreate it is to follow what I have done. Because it has recreated one of the fleets but not the other two. So it might be a random factor in there. Would suggest trying it with a few different fleets. It seems to be random. Sometimes happens sometimes doesn't.

@KamiMay
Copy link
Author

KamiMay commented Nov 16, 2018

I will try reproduce this issue today and provide all the relevant details along the way

@KamiMay
Copy link
Author

KamiMay commented Nov 16, 2018

Before migrating to a new node pool:

  • I have 3 instances running. All of thaem are 1vCPU 3.75GB RAM
  • I have 3 fleets running, each of them have 1 replica set + agones-controller and matchmaker.
  • It looks like this on gcloud:
    image
  • At this point when i run kubectl describe fleets I get following:
Name:         deathmatch-server
Namespace:    default
Labels:       <none>
Annotations:  gameMode=deathmatch
              kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"stable.agones.dev/v1alpha1","kind":"Fleet","metadata":{"annotations":{"gameMode":"deathmatch"},"name":"deathmatch-server","namespace":"d...
API Version:  stable.agones.dev/v1alpha1
Kind:         Fleet
Metadata:
  Cluster Name:
  Creation Timestamp:  2018-11-16T17:26:14Z
  Generation:          1
  Resource Version:    9493539
  Self Link:           /apis/stable.agones.dev/v1alpha1/namespaces/default/fleets/deathmatch-server
  UID:                 b7fd3272-e9c4-11e8-b175-42010a8400fd
Spec:
  Replicas:  1
  Strategy:
    Rolling Update:
      Max Surge:        25%
      Max Unavailable:  25%
    Type:               RollingUpdate
  Template:
    Metadata:
      Creation Timestamp:  <nil>
      Labels:
        Game Mode:  deathmatch
    Spec:
      Health:
      Ports:
        Container Port:  4444
        Name:            default
        Port Policy:     dynamic
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Image:    game-server:0.3.0.4
            Name:     deathmatch-server
            Resources:
              Requests:
                Cpu:  300m
Status:
  Allocated Replicas:  0
  Ready Replicas:      1
  Replicas:            1
Events:
  Type    Reason                 Age   From              Message
  ----    ------                 ----  ----              -------
  Normal  CreatingGameServerSet  6m    fleet-controller  Created GameServerSet deathmatch-server-ktjv8
 
Name:         endless-server
Namespace:    default
Labels:       <none>
Annotations:  gameMode=endless
              kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"stable.agones.dev/v1alpha1","kind":"Fleet","metadata":{"annotations":{"gameMode":"endless"},"name":"endless-server","namespace":"default...
API Version:  stable.agones.dev/v1alpha1
Kind:         Fleet
Metadata:
  Cluster Name:
  Creation Timestamp:  2018-11-16T17:24:08Z
  Generation:          1
  Resource Version:    9493228
  Self Link:           /apis/stable.agones.dev/v1alpha1/namespaces/default/fleets/endless-server
  UID:                 6cf44e7c-e9c4-11e8-b175-42010a8400fd
Spec:
  Replicas:  1
  Strategy:
    Rolling Update:
      Max Surge:        25%
      Max Unavailable:  25%
    Type:               RollingUpdate
  Template:
    Metadata:
      Creation Timestamp:  <nil>
      Labels:
        Game Mode:  endless
    Spec:
      Health:
      Ports:
        Container Port:  4444
        Name:            default
        Port Policy:     dynamic
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Image:    game-server:0.3.0.4
            Name:     endless-server
            Resources:
              Requests:
                Cpu:  300m
Status:
  Allocated Replicas:  0
  Ready Replicas:      1
  Replicas:            1
Events:
  Type    Reason                 Age   From              Message
  ----    ------                 ----  ----              -------
  Normal  CreatingGameServerSet  8m    fleet-controller  Created GameServerSet endless-server-vjb45
  
Name:         royale-server
Namespace:    default
Labels:       <none>
Annotations:  gameMode=royale
              kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"stable.agones.dev/v1alpha1","kind":"Fleet","metadata":{"annotations":{"gameMode":"royale"},"name":"royale-server","namespace":"default"}...
API Version:  stable.agones.dev/v1alpha1
Kind:         Fleet
Metadata:
  Cluster Name:
  Creation Timestamp:  2018-11-16T17:26:18Z
  Generation:          1
  Resource Version:    9493571
  Self Link:           /apis/stable.agones.dev/v1alpha1/namespaces/default/fleets/royale-server
  UID:                 baad690c-e9c4-11e8-b175-42010a8400fd
Spec:
  Replicas:  1
  Strategy:
    Rolling Update:
      Max Surge:        25%
      Max Unavailable:  25%
    Type:               RollingUpdate
  Template:
    Metadata:
      Creation Timestamp:  <nil>
      Labels:
        Game Mode:  royale
    Spec:
      Health:
      Ports:
        Container Port:  4444
        Name:            default
        Port Policy:     dynamic
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Image:    game-server:0.3.0.4
            Name:     royale-server
            Resources:
              Requests:
                Cpu:  300m
Status:
  Allocated Replicas:  0
  Ready Replicas:      1
  Replicas:            1
Events:
  Type    Reason                 Age   From              Message
  ----    ------                 ----  ----              -------
  Normal  CreatingGameServerSet  6m    fleet-controller  Created GameServerSet royale-server-28ft9
  • I've added new node pool to the cluster of the same spec as old node pool
  • Deleted old node pool
  • After deletion it looks like this on gcloud:
    image
  • After deletion completed and migration happend I only observe two fleets recreated, however kubectl describe fleets insists I have all of them online
Name:         deathmatch-server
Namespace:    default
Labels:       <none>
Annotations:  gameMode=deathmatch
              kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"stable.agones.dev/v1alpha1","kind":"Fleet","metadata":{"annotations":{"gameMode":"deathmatch"},"name":"deathmatch-server","namespace":"d...
API Version:  stable.agones.dev/v1alpha1
Kind:         Fleet
Metadata:
  Cluster Name:
  Creation Timestamp:  2018-11-16T17:26:14Z
  Generation:          1
  Resource Version:    9495633
  Self Link:           /apis/stable.agones.dev/v1alpha1/namespaces/default/fleets/deathmatch-server
  UID:                 b7fd3272-e9c4-11e8-b175-42010a8400fd
Spec:
  Replicas:  1
  Strategy:
    Rolling Update:
      Max Surge:        25%
      Max Unavailable:  25%
    Type:               RollingUpdate
  Template:
    Metadata:
      Creation Timestamp:  <nil>
      Labels:
        Game Mode:  deathmatch
    Spec:
      Health:
      Ports:
        Container Port:  4444
        Name:            default
        Port Policy:     dynamic
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Image:    game-server:0.3.0.4
            Name:     deathmatch-server
            Resources:
              Requests:
                Cpu:  300m
Status:
  Allocated Replicas:  0
  Ready Replicas:      1
  Replicas:            1
Events:
  Type    Reason                 Age   From              Message
  ----    ------                 ----  ----              -------
  Normal  CreatingGameServerSet  14m   fleet-controller  Created GameServerSet deathmatch-server-ktjv8
  
Name:         endless-server
Namespace:    default
Labels:       <none>
Annotations:  gameMode=endless
              kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"stable.agones.dev/v1alpha1","kind":"Fleet","metadata":{"annotations":{"gameMode":"endless"},"name":"endless-server","namespace":"default...
API Version:  stable.agones.dev/v1alpha1
Kind:         Fleet
Metadata:
  Cluster Name:
  Creation Timestamp:  2018-11-16T17:24:08Z
  Generation:          1
  Resource Version:    9493228
  Self Link:           /apis/stable.agones.dev/v1alpha1/namespaces/default/fleets/endless-server
  UID:                 6cf44e7c-e9c4-11e8-b175-42010a8400fd
Spec:
  Replicas:  1
  Strategy:
    Rolling Update:
      Max Surge:        25%
      Max Unavailable:  25%
    Type:               RollingUpdate
  Template:
    Metadata:
      Creation Timestamp:  <nil>
      Labels:
        Game Mode:  endless
    Spec:
      Health:
      Ports:
        Container Port:  4444
        Name:            default
        Port Policy:     dynamic
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Image:    game-server:0.3.0.4
            Name:     endless-server
            Resources:
              Requests:
                Cpu:  300m
Status:
  Allocated Replicas:  0
  Ready Replicas:      1
  Replicas:            1
Events:
  Type    Reason                 Age   From              Message
  ----    ------                 ----  ----              -------
  Normal  CreatingGameServerSet  16m   fleet-controller  Created GameServerSet endless-server-vjb45
  
Name:         royale-server
Namespace:    default
Labels:       <none>
Annotations:  gameMode=royale
              kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"stable.agones.dev/v1alpha1","kind":"Fleet","metadata":{"annotations":{"gameMode":"royale"},"name":"royale-server","namespace":"default"}...
API Version:  stable.agones.dev/v1alpha1
Kind:         Fleet
Metadata:
  Cluster Name:
  Creation Timestamp:  2018-11-16T17:26:18Z
  Generation:          1
  Resource Version:    9493571
  Self Link:           /apis/stable.agones.dev/v1alpha1/namespaces/default/fleets/royale-server
  UID:                 baad690c-e9c4-11e8-b175-42010a8400fd
Spec:
  Replicas:  1
  Strategy:
    Rolling Update:
      Max Surge:        25%
      Max Unavailable:  25%
    Type:               RollingUpdate
  Template:
    Metadata:
      Creation Timestamp:  <nil>
      Labels:
        Game Mode:  royale
    Spec:
      Health:
      Ports:
        Container Port:  4444
        Name:            default
        Port Policy:     dynamic
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Image:    game-server:0.3.0.4
            Name:     royale-server
            Resources:
              Requests:
                Cpu:  300m
Status:
  Allocated Replicas:  0
  Ready Replicas:      1
  Replicas:            1
Events:
  Type    Reason                 Age   From              Message
  ----    ------                 ----  ----              -------
  Normal  CreatingGameServerSet  14m   fleet-controller  Created GameServerSet royale-server-28ft9

@aLekSer
Copy link
Collaborator

aLekSer commented Jan 18, 2019

I was able to reproduce the issue on GKE. On the second attempt only. At first I switched from 4 nodes node-pool to 3 nodes new pool - all pods remains the same, after second attempt I switched to 3 nodes new pool and delete old and now the output of kubectl get pods and kubectl get gs differ.
Also note that after switching to new node pool I can allocate the server but cannot connect to GS using nc -u. It seems that IP and port contains information from previous node pool.

:build$ kubectl get pods
NAME                              READY     STATUS    RESTARTS   AGE
fleet-example-s99fw-6vvp9-n25rv   2/2       Running   0          10m
fleet-example-s99fw-95n9f-z57ph   2/2       Running   0          10m
simple-udp3-5ng8z-xvw87-shkkj     2/2       Running   0          1h
simple-udp32-8mps7-bx8rr-5mwtd    2/2       Running   0          1h
simple-udp32-8mps7-j8ffc-fdpzs    2/2       Running   0          1h
simple-udp322-phvk2-5t6x4-fmh7q   2/2       Running   0          1h
:build$ kubectl get gs
NAME                        STATE     ADDRESS           PORT      NODE                                     AGE
fleet-example-s99fw-6vvp9   Ready     35.247.112.202    7704      gke-test-cluster-pool-2-ab46da87-6qxw    11m
fleet-example-s99fw-95n9f   Ready     35.247.112.202    7938      gke-test-cluster-pool-2-ab46da87-6qxw    10m
simple-udp3-5ng8z-4nztc     Ready     35.247.88.114     7026      gke-test-cluster-default-4b096bd5-vt6d   1h
simple-udp3-5ng8z-9k2fp     Ready     35.247.88.114     7769      gke-test-cluster-default-4b096bd5-vt6d   1h
simple-udp3-5ng8z-9m82d     Ready     104.196.235.107   7794      gke-test-cluster-pool-1-f618dd8c-mkw0    1h
simple-udp3-5ng8z-mwwrd     Ready     35.247.88.114     7283      gke-test-cluster-default-4b096bd5-vt6d   1h
simple-udp3-5ng8z-xvw87     Ready     35.247.88.114     7762      gke-test-cluster-pool-2-ab46da87-6qfz    1h
simple-udp32-8mps7-8ng9j    Ready     104.196.235.107   7143      gke-test-cluster-pool-1-f618dd8c-mkw0    1h
simple-udp32-8mps7-bx8rr    Ready     35.247.112.202    7832      gke-test-cluster-pool-2-ab46da87-6qxw    1h
simple-udp32-8mps7-gc22g    Ready     35.247.88.114     7840      gke-test-cluster-default-4b096bd5-vt6d   1h
simple-udp32-8mps7-j8ffc    Ready     35.247.88.114     7516      gke-test-cluster-pool-2-ab46da87-6qfz    1h
simple-udp32-8mps7-p49xz    Ready     35.247.7.156      7010      gke-test-cluster-pool-1-f618dd8c-5d0t    1h
simple-udp322-phvk2-5t6x4   Ready     35.247.88.114     7631      gke-test-cluster-pool-2-ab46da87-6qfz    1h

No events in simple-udp3 fleet and 5 Current Replicas:

:build$ kubectl describe fleet simple-udp3                                                                                                                                  
Name:         simple-udp3
Namespace:    default
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"stable.agones.dev/v1alpha1","kind":"Fleet","metadata":{"annotations":{},"name":"simple-udp3","namespace":"default"},"spec":{"replicas":5...
API Version:  stable.agones.dev/v1alpha1
Kind:         Fleet
Metadata:
  Creation Timestamp:  2019-01-18T09:45:43Z
  Generation:          1
  Resource Version:    176561
  Self Link:           /apis/stable.agones.dev/v1alpha1/namespaces/default/fleets/simple-udp3
  UID:                 d28f4018-1b05-11e9-b6e3-42010a8a002f
Spec:
  Replicas:    5
  Scheduling:  Packed
  Strategy:
    Rolling Update:
      Max Surge:        25%
      Max Unavailable:  25%
    Type:               RollingUpdate
  Template:
    Metadata:
      Creation Timestamp:  <nil>
    Spec:
      Health:
      Ports:
        Container Port:  7654
        Name:            default
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Image:  gcr.io/agones-images/udp-server:0.5
            Name:   simple-udp
            Resources:
Status:
  Allocated Replicas:  0
  Ready Replicas:      5
  Replicas:            5
Events:                <none>

However in pods list there exist only one record for this fleet not five:

simple-udp3-5ng8z-xvw87-shkkj     2/2       Running   0          1h

@aLekSer
Copy link
Collaborator

aLekSer commented Jan 21, 2019

As you can see above from the output of kubectl get gs the NODE field shows that pods belongs to different node pools: gke-test-cluster-default, gke-test-cluster-pool-2, gke-test-cluster-pool-1. However only pool-2 was running at that moment.
Also I noticed that ADDRESS and PORT for some Pods was not changed after deleting the node pool.

@markmandel
Copy link
Member

So it seems if you delete a nodePool, then the Pods still exist inside Kubernetes?

I'm starting to think this might be a GKE bug!

@aLekSer
Copy link
Collaborator

aLekSer commented Jan 23, 2019

Found such error messages on agones-controller:
2019-01-23 05:38:06.539 UTC-8 error creating gameserver for gameserverset fleet-example-x24zx: Internal error occurred: failed calling admission webhook "mutations.stable.agones.dev": Post https://agones-controller-service.agones-system.svc:443/mutate?timeout=30s: no endpoints available for service "agones-controller-service"
which appears on calls to gameserversets.(*Controller).syncMoreGameServers() and (*Controller).syncGameServerSetState()

Also note that after switching node_pool new nodes should be added to gameserver firewall game-server-firewall manually.

@markmandel
Copy link
Member

ooh, I wonder if it's because the agones controller is being taken down -- and that means the webhook can't be fired - which may not be something we can actually fix? 😕

@markmandel
Copy link
Member

2 thoughts for next steps:

Add some logging here:
https://github.com/GoogleCloudPlatform/agones/blob/master/pkg/gameservers/controller.go#L158

And see if the Pod deletion event gets fired when you switch the nodepools. I'm wondering if they don't, and that's what is causing the issue.

  1. Setup the Agones controller to run on it's own nodepool, and then switch out the nodepool for the gameservers, and see if it happens there.

I'm wondering if there controller not being removed by the nodepool solves the issue (at least partially) - or at least provides a document-able workaround.

@aLekSer
Copy link
Collaborator

aLekSer commented Feb 26, 2019

I will reproduce this on the latest master with these 2 steps

@aLekSer
Copy link
Collaborator

aLekSer commented Feb 26, 2019

In the gameservers controller code:

if oldPod.Spec.NodeName != newPod.Spec.NodeName {

This condition works only on Pod creating, it fires two time - one on first pool, and on second.
Note that oldPod.Spec.NodeName always empty.

After node pool switch all gs restarted, but game-server-firewall setting is missing, so nc -u is not working on gs after node switch.
Another issues is during the process the connection to kubectl was refused:

kubectl get gs
No resources found.
The connection to the server 35.197.87.248 was refused - did you specify the right host or port?

However nc -u is working after node pool switch if run from Node itself:

gcloud ssh ...
toolbox
nc -u 127.0.0.1 7424

@markmandel
Copy link
Member

After node pool switch all gs restarted, but game-server-firewall setting is missing, so nc -u is not working on gs after node switch.

A new nodepool will need to be told to have the firewall tag, it won't be included automatically - so I don't think that that part is a bug.

So apart from that, does it work?

@aLekSer
Copy link
Collaborator

aLekSer commented Feb 27, 2019

@markmandel
With separation of node_pools "agones-system" and "default" fleets are restarted well. If we add new node_pool and then delete "default", all GS got restarted on new nodes.
I think this bug does not include "agones-system" node pool restart.

@roberthbailey
Copy link
Member

@aLekSer - based on your last update, it sounds like Mark's guess above is likely correct: the problem occurs when the agones controller is down.

What I don't understand is why it wouldn't fix itself once the controller came back up. With a level triggered system (see thockin's nice presentation here) it shouldn't be an issue if a single "event" is missed; the controller should look at the current state when it comes up and make it match the desired state.

@aLekSer
Copy link
Collaborator

aLekSer commented May 23, 2019

@roberthbailey Not quite sure about the root cause for now.

@markmandel
Copy link
Member

Now that #1008 is written, I think we can close this, as we give advice on how to perform upgrades that mitigate this issue (what seems to mostly be a race condition).

Also, the advice to setup separate node pools in production also seems to resolve it.

markmandel added a commit to markmandel/agones that referenced this issue Jan 16, 2020
If a Pod gets deleted, especially during GameServer Ready or Allocated
state, and the controller is either crashed, missing or unable to access
master, when the controller comes back up, the GameServer is left in a
zombie state in which it could be Allocated, but there is no Pod process
to back it.

Ideally, scenarios like this shouldn't happen, but it is possible,
depending on user interaction with Kubernetes, so we should cover the
scenario, as it requires manual intervention to fix otherwise.

This PR implements a controller that periodically checks GameServers to
ensure they have backing Pods, such that if this happens the GameServer
is marked as Unhealthy, and a Fleet can eventually return to a healed,
stable state, and not require manual intervention.

Closes googleforgames#1170
Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
markmandel added a commit to markmandel/agones that referenced this issue Jan 16, 2020
If a Pod gets deleted, especially during GameServer Ready or Allocated
state, and the controller is either crashed, missing or unable to access
master, when the controller comes back up, the GameServer is left in a
zombie state in which it could be Allocated, but there is no Pod process
to back it.

Ideally, scenarios like this shouldn't happen, but it is possible,
depending on user interaction with Kubernetes, so we should cover the
scenario, as it requires manual intervention to fix otherwise.

This PR implements a controller that periodically checks GameServers to
ensure they have backing Pods, such that if this happens the GameServer
is marked as Unhealthy, and a Fleet can eventually return to a healed,
stable state, and not require manual intervention.

There is no e2e test, as I couldn't work out a viable way to break the
Agones controller, and then bring is back reliably.

Closes googleforgames#1170
Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
markmandel added a commit to markmandel/agones that referenced this issue Jan 16, 2020
If a Pod gets deleted, especially during GameServer Ready or Allocated
state, and the controller is either crashed, missing or unable to access
master, when the controller comes back up, the GameServer is left in a
zombie state in which it could be Allocated, but there is no Pod process
to back it.

Ideally, scenarios like this shouldn't happen, but it is possible,
depending on user interaction with Kubernetes, so we should cover the
scenario, as it requires manual intervention to fix otherwise.

This PR implements a controller that periodically checks GameServers to
ensure they have backing Pods, such that if this happens the GameServer
is marked as Unhealthy, and a Fleet can eventually return to a healed,
stable state, and not require manual intervention.

There is no e2e test, as I couldn't work out a viable way to break the
Agones controller, and then bring is back reliably.

Closes googleforgames#1170
Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
markmandel added a commit to markmandel/agones that referenced this issue Jan 16, 2020
If a Pod gets deleted, especially during GameServer Ready or Allocated
state, and the controller is either crashed, missing or unable to access
master, when the controller comes back up, the GameServer is left in a
zombie state in which it could be Allocated, but there is no Pod process
to back it.

Ideally, scenarios like this shouldn't happen, but it is possible,
depending on user interaction with Kubernetes, so we should cover the
scenario, as it requires manual intervention to fix otherwise.

This PR implements a controller that periodically checks GameServers to
ensure they have backing Pods, such that if this happens the GameServer
is marked as Unhealthy, and a Fleet can eventually return to a healed,
stable state, and not require manual intervention.

There is no e2e test, as I couldn't work out a viable way to break the
Agones controller, and then bring is back reliably.

Closes googleforgames#1170
Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
markmandel added a commit to markmandel/agones that referenced this issue Jan 16, 2020
If a Pod gets deleted, especially during GameServer Ready or Allocated
state, and the controller is either crashed, missing or unable to access
master, when the controller comes back up, the GameServer is left in a
zombie state in which it could be Allocated, but there is no Pod process
to back it.

Ideally, scenarios like this shouldn't happen, but it is possible,
depending on user interaction with Kubernetes, so we should cover the
scenario, as it requires manual intervention to fix otherwise.

This PR implements a controller that periodically checks GameServers to
ensure they have backing Pods, such that if this happens the GameServer
is marked as Unhealthy, and a Fleet can eventually return to a healed,
stable state, and not require manual intervention.

There is no e2e test, as I couldn't work out a viable way to break the
Agones controller, and then bring is back reliably.

Closes googleforgames#1170
Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
markmandel added a commit to markmandel/agones that referenced this issue Jan 16, 2020
If a Pod gets deleted, especially during GameServer Ready or Allocated
state, and the controller is either crashed, missing or unable to access
master, when the controller comes back up, the GameServer is left in a
zombie state in which it could be Allocated, but there is no Pod process
to back it.

Ideally, scenarios like this shouldn't happen, but it is possible,
depending on user interaction with Kubernetes, so we should cover the
scenario, as it requires manual intervention to fix otherwise.

This PR implements a controller that periodically checks GameServers to
ensure they have backing Pods, such that if this happens the GameServer
is marked as Unhealthy, and a Fleet can eventually return to a healed,
stable state, and not require manual intervention.

There is no e2e test, as I couldn't work out a viable way to break the
Agones controller, and then bring is back reliably.

Closes googleforgames#1170
Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
markmandel added a commit to markmandel/agones that referenced this issue Jan 16, 2020
If a Pod gets deleted, especially during GameServer Ready or Allocated
state, and the controller is either crashed, missing or unable to access
master, when the controller comes back up, the GameServer is left in a
zombie state in which it could be Allocated, but there is no Pod process
to back it.

Ideally, scenarios like this shouldn't happen, but it is possible,
depending on user interaction with Kubernetes, so we should cover the
scenario, as it requires manual intervention to fix otherwise.

This PR implements a controller that periodically checks GameServers to
ensure they have backing Pods, such that if this happens the GameServer
is marked as Unhealthy, and a Fleet can eventually return to a healed,
stable state, and not require manual intervention.

There is no e2e test, as I couldn't work out a viable way to break the
Agones controller, and then bring is back reliably.

Closes googleforgames#1170
Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
markmandel added a commit to markmandel/agones that referenced this issue Jan 28, 2020
If a Pod gets deleted, especially during GameServer Ready or Allocated
state, and the controller is either crashed, missing or unable to access
master, when the controller comes back up, the GameServer is left in a
zombie state in which it could be Allocated, but there is no Pod process
to back it.

Ideally, scenarios like this shouldn't happen, but it is possible,
depending on user interaction with Kubernetes, so we should cover the
scenario, as it requires manual intervention to fix otherwise.

This PR implements a controller that periodically checks GameServers to
ensure they have backing Pods, such that if this happens the GameServer
is marked as Unhealthy, and a Fleet can eventually return to a healed,
stable state, and not require manual intervention.

Closes googleforgames#1170
Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
markmandel added a commit to markmandel/agones that referenced this issue Jan 29, 2020
If a Pod gets deleted, especially during GameServer Ready or Allocated
state, and the controller is either crashed, missing or unable to access
master, when the controller comes back up, the GameServer is left in a
zombie state in which it could be Allocated, but there is no Pod process
to back it.

Ideally, scenarios like this shouldn't happen, but it is possible,
depending on user interaction with Kubernetes, so we should cover the
scenario, as it requires manual intervention to fix otherwise.

This PR implements a controller that periodically checks GameServers to
ensure they have backing Pods, such that if this happens the GameServer
is marked as Unhealthy, and a Fleet can eventually return to a healed,
stable state, and not require manual intervention.

Closes googleforgames#1170
Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
markmandel added a commit to markmandel/agones that referenced this issue Feb 7, 2020
If a Pod gets deleted, especially during GameServer Ready or Allocated
state, and the controller is either crashed, missing or unable to access
master, when the controller comes back up, the GameServer is left in a
zombie state in which it could be Allocated, but there is no Pod process
to back it.

Ideally, scenarios like this shouldn't happen, but it is possible,
depending on user interaction with Kubernetes, so we should cover the
scenario, as it requires manual intervention to fix otherwise.

This PR implements a controller that periodically checks GameServers to
ensure they have backing Pods, such that if this happens the GameServer
is marked as Unhealthy, and a Fleet can eventually return to a healed,
stable state, and not require manual intervention.

Closes googleforgames#1170
Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
markmandel added a commit that referenced this issue Feb 7, 2020
* Fix for Pod deletion during unavailable controller

If a Pod gets deleted, especially during GameServer Ready or Allocated
state, and the controller is either crashed, missing or unable to access
master, when the controller comes back up, the GameServer is left in a
zombie state in which it could be Allocated, but there is no Pod process
to back it.

Ideally, scenarios like this shouldn't happen, but it is possible,
depending on user interaction with Kubernetes, so we should cover the
scenario, as it requires manual intervention to fix otherwise.

This PR implements a controller that periodically checks GameServers to
ensure they have backing Pods, such that if this happens the GameServer
is marked as Unhealthy, and a Fleet can eventually return to a healed,
stable state, and not require manual intervention.

Closes #1170
Closes #398 (especially combined with fix for #1245)
@markmandel markmandel added this to the 1.4.0 milestone Feb 7, 2020
ilkercelikyilmaz pushed a commit to ilkercelikyilmaz/agones that referenced this issue Oct 23, 2020
* Fix for Pod deletion during unavailable controller

If a Pod gets deleted, especially during GameServer Ready or Allocated
state, and the controller is either crashed, missing or unable to access
master, when the controller comes back up, the GameServer is left in a
zombie state in which it could be Allocated, but there is no Pod process
to back it.

Ideally, scenarios like this shouldn't happen, but it is possible,
depending on user interaction with Kubernetes, so we should cover the
scenario, as it requires manual intervention to fix otherwise.

This PR implements a controller that periodically checks GameServers to
ensure they have backing Pods, such that if this happens the GameServer
is marked as Unhealthy, and a Fleet can eventually return to a healed,
stable state, and not require manual intervention.

Closes googleforgames#1170
Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc help wanted We would love help on these issues. Please come help us! kind/bug These are bugs.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants