-
Notifications
You must be signed in to change notification settings - Fork 820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moving cluster to a new node pool doesn't recreate all fleets #398
Comments
I have a strong feeling this is because if a pod get's deleted, the backing GameServer is left in a zombie state (i.e. not deleted along with it). We should implement functionality to say that if Pod get's removed, the owning GameServer should be deleted too. This should solve this issue. |
Actually - I'm not sure what this is - I tested deleting the backing Pod from a GameServer, and the GameServer gets deleted. More investigation required! |
I think best way to recreate it is to follow what I have done. Because it has recreated one of the fleets but not the other two. So it might be a random factor in there. Would suggest trying it with a few different fleets. It seems to be random. Sometimes happens sometimes doesn't. |
I will try reproduce this issue today and provide all the relevant details along the way |
I was able to reproduce the issue on GKE. On the second attempt only. At first I switched from 4 nodes node-pool to 3 nodes new pool - all pods remains the same, after second attempt I switched to 3 nodes new pool and delete old and now the output of
No events in simple-udp3 fleet and 5 Current Replicas:
However in pods list there exist only one record for this fleet not five:
|
As you can see above from the output of |
So it seems if you delete a nodePool, then the Pods still exist inside Kubernetes? I'm starting to think this might be a GKE bug! |
Found such error messages on agones-controller: Also note that after switching node_pool new nodes should be added to gameserver firewall |
ooh, I wonder if it's because the agones controller is being taken down -- and that means the webhook can't be fired - which may not be something we can actually fix? 😕 |
2 thoughts for next steps: Add some logging here: And see if the Pod deletion event gets fired when you switch the nodepools. I'm wondering if they don't, and that's what is causing the issue.
I'm wondering if there controller not being removed by the nodepool solves the issue (at least partially) - or at least provides a document-able workaround. |
I will reproduce this on the latest master with these 2 steps |
In the gameservers controller code:
This condition works only on Pod creating, it fires two time - one on first pool, and on second. After node pool switch all gs restarted, but
However
|
A new nodepool will need to be told to have the firewall tag, it won't be included automatically - so I don't think that that part is a bug. So apart from that, does it work? |
@markmandel |
@aLekSer - based on your last update, it sounds like Mark's guess above is likely correct: the problem occurs when the agones controller is down. What I don't understand is why it wouldn't fix itself once the controller came back up. With a level triggered system (see thockin's nice presentation here) it shouldn't be an issue if a single "event" is missed; the controller should look at the current state when it comes up and make it match the desired state. |
@roberthbailey Not quite sure about the root cause for now. |
Now that #1008 is written, I think we can close this, as we give advice on how to perform upgrades that mitigate this issue (what seems to mostly be a race condition). Also, the advice to setup separate node pools in production also seems to resolve it. |
If a Pod gets deleted, especially during GameServer Ready or Allocated state, and the controller is either crashed, missing or unable to access master, when the controller comes back up, the GameServer is left in a zombie state in which it could be Allocated, but there is no Pod process to back it. Ideally, scenarios like this shouldn't happen, but it is possible, depending on user interaction with Kubernetes, so we should cover the scenario, as it requires manual intervention to fix otherwise. This PR implements a controller that periodically checks GameServers to ensure they have backing Pods, such that if this happens the GameServer is marked as Unhealthy, and a Fleet can eventually return to a healed, stable state, and not require manual intervention. Closes googleforgames#1170 Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
If a Pod gets deleted, especially during GameServer Ready or Allocated state, and the controller is either crashed, missing or unable to access master, when the controller comes back up, the GameServer is left in a zombie state in which it could be Allocated, but there is no Pod process to back it. Ideally, scenarios like this shouldn't happen, but it is possible, depending on user interaction with Kubernetes, so we should cover the scenario, as it requires manual intervention to fix otherwise. This PR implements a controller that periodically checks GameServers to ensure they have backing Pods, such that if this happens the GameServer is marked as Unhealthy, and a Fleet can eventually return to a healed, stable state, and not require manual intervention. There is no e2e test, as I couldn't work out a viable way to break the Agones controller, and then bring is back reliably. Closes googleforgames#1170 Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
If a Pod gets deleted, especially during GameServer Ready or Allocated state, and the controller is either crashed, missing or unable to access master, when the controller comes back up, the GameServer is left in a zombie state in which it could be Allocated, but there is no Pod process to back it. Ideally, scenarios like this shouldn't happen, but it is possible, depending on user interaction with Kubernetes, so we should cover the scenario, as it requires manual intervention to fix otherwise. This PR implements a controller that periodically checks GameServers to ensure they have backing Pods, such that if this happens the GameServer is marked as Unhealthy, and a Fleet can eventually return to a healed, stable state, and not require manual intervention. There is no e2e test, as I couldn't work out a viable way to break the Agones controller, and then bring is back reliably. Closes googleforgames#1170 Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
If a Pod gets deleted, especially during GameServer Ready or Allocated state, and the controller is either crashed, missing or unable to access master, when the controller comes back up, the GameServer is left in a zombie state in which it could be Allocated, but there is no Pod process to back it. Ideally, scenarios like this shouldn't happen, but it is possible, depending on user interaction with Kubernetes, so we should cover the scenario, as it requires manual intervention to fix otherwise. This PR implements a controller that periodically checks GameServers to ensure they have backing Pods, such that if this happens the GameServer is marked as Unhealthy, and a Fleet can eventually return to a healed, stable state, and not require manual intervention. There is no e2e test, as I couldn't work out a viable way to break the Agones controller, and then bring is back reliably. Closes googleforgames#1170 Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
If a Pod gets deleted, especially during GameServer Ready or Allocated state, and the controller is either crashed, missing or unable to access master, when the controller comes back up, the GameServer is left in a zombie state in which it could be Allocated, but there is no Pod process to back it. Ideally, scenarios like this shouldn't happen, but it is possible, depending on user interaction with Kubernetes, so we should cover the scenario, as it requires manual intervention to fix otherwise. This PR implements a controller that periodically checks GameServers to ensure they have backing Pods, such that if this happens the GameServer is marked as Unhealthy, and a Fleet can eventually return to a healed, stable state, and not require manual intervention. There is no e2e test, as I couldn't work out a viable way to break the Agones controller, and then bring is back reliably. Closes googleforgames#1170 Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
If a Pod gets deleted, especially during GameServer Ready or Allocated state, and the controller is either crashed, missing or unable to access master, when the controller comes back up, the GameServer is left in a zombie state in which it could be Allocated, but there is no Pod process to back it. Ideally, scenarios like this shouldn't happen, but it is possible, depending on user interaction with Kubernetes, so we should cover the scenario, as it requires manual intervention to fix otherwise. This PR implements a controller that periodically checks GameServers to ensure they have backing Pods, such that if this happens the GameServer is marked as Unhealthy, and a Fleet can eventually return to a healed, stable state, and not require manual intervention. There is no e2e test, as I couldn't work out a viable way to break the Agones controller, and then bring is back reliably. Closes googleforgames#1170 Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
If a Pod gets deleted, especially during GameServer Ready or Allocated state, and the controller is either crashed, missing or unable to access master, when the controller comes back up, the GameServer is left in a zombie state in which it could be Allocated, but there is no Pod process to back it. Ideally, scenarios like this shouldn't happen, but it is possible, depending on user interaction with Kubernetes, so we should cover the scenario, as it requires manual intervention to fix otherwise. This PR implements a controller that periodically checks GameServers to ensure they have backing Pods, such that if this happens the GameServer is marked as Unhealthy, and a Fleet can eventually return to a healed, stable state, and not require manual intervention. There is no e2e test, as I couldn't work out a viable way to break the Agones controller, and then bring is back reliably. Closes googleforgames#1170 Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
If a Pod gets deleted, especially during GameServer Ready or Allocated state, and the controller is either crashed, missing or unable to access master, when the controller comes back up, the GameServer is left in a zombie state in which it could be Allocated, but there is no Pod process to back it. Ideally, scenarios like this shouldn't happen, but it is possible, depending on user interaction with Kubernetes, so we should cover the scenario, as it requires manual intervention to fix otherwise. This PR implements a controller that periodically checks GameServers to ensure they have backing Pods, such that if this happens the GameServer is marked as Unhealthy, and a Fleet can eventually return to a healed, stable state, and not require manual intervention. Closes googleforgames#1170 Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
If a Pod gets deleted, especially during GameServer Ready or Allocated state, and the controller is either crashed, missing or unable to access master, when the controller comes back up, the GameServer is left in a zombie state in which it could be Allocated, but there is no Pod process to back it. Ideally, scenarios like this shouldn't happen, but it is possible, depending on user interaction with Kubernetes, so we should cover the scenario, as it requires manual intervention to fix otherwise. This PR implements a controller that periodically checks GameServers to ensure they have backing Pods, such that if this happens the GameServer is marked as Unhealthy, and a Fleet can eventually return to a healed, stable state, and not require manual intervention. Closes googleforgames#1170 Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
If a Pod gets deleted, especially during GameServer Ready or Allocated state, and the controller is either crashed, missing or unable to access master, when the controller comes back up, the GameServer is left in a zombie state in which it could be Allocated, but there is no Pod process to back it. Ideally, scenarios like this shouldn't happen, but it is possible, depending on user interaction with Kubernetes, so we should cover the scenario, as it requires manual intervention to fix otherwise. This PR implements a controller that periodically checks GameServers to ensure they have backing Pods, such that if this happens the GameServer is marked as Unhealthy, and a Fleet can eventually return to a healed, stable state, and not require manual intervention. Closes googleforgames#1170 Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
* Fix for Pod deletion during unavailable controller If a Pod gets deleted, especially during GameServer Ready or Allocated state, and the controller is either crashed, missing or unable to access master, when the controller comes back up, the GameServer is left in a zombie state in which it could be Allocated, but there is no Pod process to back it. Ideally, scenarios like this shouldn't happen, but it is possible, depending on user interaction with Kubernetes, so we should cover the scenario, as it requires manual intervention to fix otherwise. This PR implements a controller that periodically checks GameServers to ensure they have backing Pods, such that if this happens the GameServer is marked as Unhealthy, and a Fleet can eventually return to a healed, stable state, and not require manual intervention. Closes #1170 Closes #398 (especially combined with fix for #1245)
* Fix for Pod deletion during unavailable controller If a Pod gets deleted, especially during GameServer Ready or Allocated state, and the controller is either crashed, missing or unable to access master, when the controller comes back up, the GameServer is left in a zombie state in which it could be Allocated, but there is no Pod process to back it. Ideally, scenarios like this shouldn't happen, but it is possible, depending on user interaction with Kubernetes, so we should cover the scenario, as it requires manual intervention to fix otherwise. This PR implements a controller that periodically checks GameServers to ensure they have backing Pods, such that if this happens the GameServer is marked as Unhealthy, and a Fleet can eventually return to a healed, stable state, and not require manual intervention. Closes googleforgames#1170 Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
I've noticed something weird today. I needed to swap node pool in GKE so I created new node pool and deleted old one. I expected all instances in the old node pool to recover in the new one after some time. However in my particular case I could only see 1 of the 3 servers in
workloads
page on GCloud. So I checked fleets to see if it has min availability which was 1 of each kind = 3. Andkubectl describe fleets
indicated that 3 servers were online and available, however when I tried to connect to one that was listed but not inworkloads
it failed to connect to it, I was able to connect to the one appearing in theworkloads
, but not others. I had to delete fleets and recreate them for them to appear and work correctly again.The text was updated successfully, but these errors were encountered: