Agones controller down when enabling CustomFasSyncInterval on an existing cluster #2675

RY-2718 · 2022-07-20T06:50:26Z

What happened:
When we enabled the CustomFasSyncInterval feature by updating Agones components in our cluster where FleetAutoscaler had been deployed, Agones controller did not start up and got to CrashLoopBackOff state.

❯ kubectl logs -n agones-system agones-controller-58796bf457-5csh7
{"filename":"/home/agones/logs/agones-controller-20220719_013917.log","message":"logging to file","numbackups":99,"severity":"info","source":"main","time":"2022-07-19T01:39:17.41748582Z"}
{"logLevel":"info","message":"Setting LogLevel configuration","severity":"info","source":"main","time":"2022-07-19T01:39:17.417534057Z"}
{"ctlConf":{"MinPort":7000,"MaxPort":8000,"SidecarImage":"gcr.io/agones-images/agones-sdk:1.24.0","SidecarCPURequest":"30m","SidecarCPULimit":"0","SidecarMemoryRequest":"0","SidecarMemoryLimit":"0","SdkServiceAccount":"agones-sdk","AlwaysPullSidecar":false,"PrometheusMetrics":true,"Stackdriver":false,"StackdriverLabels":"","KeyFile":"/home/agones/certs/server.key","CertFile":"/home/agones/certs/server.crt","KubeConfig":"","GCPProjectID":"","NumWorkers":100,"APIServerSustainedQPS":400,"APIServerBurstQPS":500,"LogDir":"/home/agones/logs","LogLevel":"info","LogSizeLimitMB":10000,"AllocationBatchWaitTime":500000000},"featureGates":"CustomFasSyncInterval=true\u0026Example=true\u0026NodeExternalDNS=true\u0026PlayerAllocationFilter=false\u0026PlayerTracking=false\u0026SDKGracefulTermination=false\u0026StateAllocationFilter=false","message":"starting gameServer operator...","severity":"info","source":"main","time":"2022-07-19T01:39:17.417708326Z","version":"1.24.0"}
W0719 01:39:17.417807       1 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
{"message":"Starting http server...","severity":"info","source":"main","time":"2022-07-19T01:39:17.422540667Z"}
{"message":"https server started","server":{"Mux":{}},"severity":"info","source":"*https.Server","time":"2022-07-19T01:39:17.422833153Z"}
E0719 01:39:17.478637       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 166 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1867360, 0x2b65560})
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xfffffffe})
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75
panic({0x1867360, 0x2b65560})
        /usr/local/go/src/runtime/panic.go:1038 +0x215
agones.dev/agones/pkg/fleetautoscalers.(*Controller).addFasThread(0xc0003010e0, 0xc0002eb380, 0x1)
        /go/src/agones.dev/agones/pkg/fleetautoscalers/controller.go:415 +0x6d
agones.dev/agones/pkg/fleetautoscalers.NewController.func1({0x1a8b7c0, 0xc0002eb380})
        /go/src/agones.dev/agones/pkg/fleetautoscalers/controller.go:124 +0x6e
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
        /go/src/agones.dev/agones/vendor/k8s.io/client-go/tools/cache/controller.go:231
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
        /go/src/agones.dev/agones/vendor/k8s.io/client-go/tools/cache/shared_informer.go:777 +0x9f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f94b1b4ded8)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000495f38, {0x1d238c0, 0xc0005802a0}, 0x1, 0xc0004d7380)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0, 0x3b9aca00, 0x0, 0x0, 0x0)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0002f7b00)
        /go/src/agones.dev/agones/vendor/k8s.io/client-go/tools/cache/shared_informer.go:771 +0x6b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x88
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x13f2c4d]

goroutine 166 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xfffffffe})
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0xd8
panic({0x1867360, 0x2b65560})
        /usr/local/go/src/runtime/panic.go:1038 +0x215
agones.dev/agones/pkg/fleetautoscalers.(*Controller).addFasThread(0xc0003010e0, 0xc0002eb380, 0x1)
        /go/src/agones.dev/agones/pkg/fleetautoscalers/controller.go:415 +0x6d
agones.dev/agones/pkg/fleetautoscalers.NewController.func1({0x1a8b7c0, 0xc0002eb380})
        /go/src/agones.dev/agones/pkg/fleetautoscalers/controller.go:124 +0x6e
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
        /go/src/agones.dev/agones/vendor/k8s.io/client-go/tools/cache/controller.go:231
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
        /go/src/agones.dev/agones/vendor/k8s.io/client-go/tools/cache/shared_informer.go:777 +0x9f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f94b1b4ded8)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000495f38, {0x1d238c0, 0xc0005802a0}, 0x1, 0xc0004d7380)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0, 0x3b9aca00, 0x0, 0x0, 0x0)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0002f7b00)
        /go/src/agones.dev/agones/vendor/k8s.io/client-go/tools/cache/shared_informer.go:771 +0x6b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x88

What you expected to happen:
Agones controller starts up successfully after we enable the CustomFasSyncInterval feature by updating Agones components in our cluster even if FleetAutoscaler has already been deployed.

How to reproduce it (as minimally and precisely as possible):
We can reproduce this issue using our local minikube cluster. Follow the instructions below:

# install Agones without CustomFasSyncInterval
minikube start -p sync-interval-test --driver virtualbox --kubernetes-version 1.22.10
minikube profile sync-interval-test
helm repo add agones https://agones.dev/chart/stable
helm upgrade --install agones --version v1.23.0 \
	--namespace agones-system \
	--create-namespace agones/agones

# install fleet and fleetAutoscaler
sleep 20 # wait for Agones controller to become ready
kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/release-1.23.0/examples/simple-game-server/fleet.yaml
kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/release-1.23.0/examples/simple-game-server/fleetautoscaler.yaml

## upgrade Agones to enable CustomFasSyncInterval
helm upgrade agones agones/agones -n agones-system --set agones.featureGates="CustomFasSyncInterval=true"

# Agones Controller will get stuck in CrashLoopBackOff
watch kubectl get pods -n agones-system

Anything else we need to know?:
We can avoid this issue by filling the default value to syncparameter on existing FleetAutoscaler before upgrading Agones.

# disable CustomFasSyncInterval
helm upgrade agones --version v1.23.0 \
	--namespace agones-system agones/agones \
	--set agones.featureGates="CustomFasSyncInterval=false

sleep 20 # wait for Agones controller to become ready

# update FleetAutoscaler before upgrading Agones
kubectl patch fleetautoscaler simple-game-server-autoscaler \
  --type json -p '[{"op": "add", "path": "/spec/sync", "value": {"type": "FixedInterval", "fixedInterval":{"seconds": 30}}}]


# enable CustomFasSyncInterval
helm upgrade agones --version v1.23.0 \
  --namespace agones-system agones/agones \
  --set agones.featureGates="CustomFasSyncInterval=true

# Agones Controller will run
watch kubectl get pods -n agones-system

Environment:

Agones version: v1.23.0
Kubernetes version (use kubectl version): v1.22.10
Cloud provider or hardware configuration: minikube / MacBook Pro (intel CPU)
Install method (yaml/helm): helm

The text was updated successfully, but these errors were encountered:

markmandel · 2022-07-20T18:10:02Z

Thanks for the bug report!

I'm trying to find it, but I think we covered this is another issue (or maybe it was Slack) - but short answer, I think this will be a wontfix - or at the very least, we won't implement a live cluster upgrade (it's a complexity nightmare, there are too many upgrade paths to test).

That being said, we should likely update our docs here:
https://agones.dev/site/docs/installation/upgrading/#upgrading-agones-single-cluster:~:text=Delete%20the%20current%20set%20of%20Fleets%20and/or%20GameServers%20in%20your%20cluster.

In that it should be explicit that you need to delete all the Agones CRD values before upgrading (which includes FleetAutoscalers) within a cluster, and we should include that that includes Feature Flag updates.

We do run a pre-hook on delete to delete everything:
https://github.com/googleforgames/agones/blob/c9efc7def67a0306053a96fb9f0bfee9b57bd411/install/helm/agones/templates/hooks/pre_delete_hook.yaml -- maybe we should do the same on upgrade?

RY-2718 · 2022-07-21T03:38:53Z

Thank you for your response!
As it sounds difficult to fix this issue and a live cluster upgrade doesn't seem to be a recommended way, we will create a new cluster and install new version of Agones, then migrate our workloads according to the docs.

markmandel · 2022-07-21T16:51:12Z

We do run a pre-hook on delete to delete everything:
https://github.com/googleforgames/agones/blob/c9efc7def67a0306053a96fb9f0bfee9b57bd411/install/helm/agones/templates/hooks/pre_delete_hook.yaml -- maybe we should do the same on upgrade?

@roberthbailey what do you think of the above idea?

Noting that we get reports that when doing in-place updates with feature gates can cause issues when existing resources are left in place, and therefore making a note in the docs to: a) Feature Gate changes are upgrades b) Delete all resource (including FleetAutoscalers) c) Even on Helm upgrades, delete the Agones resources. Work on googleforgames#2675

Noting that we get reports that when doing in-place updates with feature gates can cause issues when existing resources are left in place, and therefore making a note in the docs to: a) Feature Gate changes are upgrades b) Delete all resource (including FleetAutoscalers) c) Even on Helm upgrades, delete the Agones resources. Work on #2675

roberthbailey · 2022-07-22T16:16:53Z

While I agree that we don't currently support live cluster upgrade, I don't want to take the stance that we won't ever support it. It's difficult and requires a lot more testing than we currently do, but it would be really nice to get to the point where we can support it in the future, especially as the feature velocity of the project decreases as the system matures and we want to make upgrade easier so that people pick up bug fixes and security patches.

I'm concerned that adding a hook to upgrade to delete things could end up being disruptive (e.g. deleting allocated game servers) and unexpected. As one example, we tell people to helm upgrade to set the server TLS cert on the allocator service and right now that doesn't affect any workloads that might be running (all it does it update one component).

markmandel · 2022-07-22T16:35:41Z

As one example, we tell people to helm upgrade to set the server TLS cert on the allocator service and right now that doesn't affect any workloads that might be running (all it does it update one component).

That's an excellent point. In which case I rescind my suggestion to add the deletion operation to the upgrade hook.

markmandel · 2022-07-22T16:36:12Z

In which case - since #2678 is merged, shall we close this?

roberthbailey · 2022-07-22T16:39:27Z

shall we close this?

sounds good to me.

RY-2718 added the kind/bug These are bugs. label Jul 20, 2022

markmandel mentioned this issue Jul 22, 2022

Docs: Note that Feature Gate changes are upgrades #2678

Merged

roberthbailey closed this as completed Jul 22, 2022

SaitejaTamma added this to the 1.25.0 milestone Jul 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agones controller down when enabling CustomFasSyncInterval on an existing cluster #2675

Agones controller down when enabling CustomFasSyncInterval on an existing cluster #2675

RY-2718 commented Jul 20, 2022

markmandel commented Jul 20, 2022 •

edited

Loading

RY-2718 commented Jul 21, 2022

markmandel commented Jul 21, 2022

roberthbailey commented Jul 22, 2022

markmandel commented Jul 22, 2022

markmandel commented Jul 22, 2022

roberthbailey commented Jul 22, 2022

Agones controller down when enabling CustomFasSyncInterval on an existing cluster #2675

Agones controller down when enabling CustomFasSyncInterval on an existing cluster #2675

Comments

RY-2718 commented Jul 20, 2022

markmandel commented Jul 20, 2022 • edited Loading

RY-2718 commented Jul 21, 2022

markmandel commented Jul 21, 2022

roberthbailey commented Jul 22, 2022

markmandel commented Jul 22, 2022

markmandel commented Jul 22, 2022

roberthbailey commented Jul 22, 2022

markmandel commented Jul 20, 2022 •

edited

Loading