Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agones controller down when enabling CustomFasSyncInterval on an existing cluster #2675

Closed
RY-2718 opened this issue Jul 20, 2022 · 7 comments
Labels
kind/bug These are bugs.
Milestone

Comments

@RY-2718
Copy link

RY-2718 commented Jul 20, 2022

What happened:
When we enabled the CustomFasSyncInterval feature by updating Agones components in our cluster where FleetAutoscaler had been deployed, Agones controller did not start up and got to CrashLoopBackOff state.

❯ kubectl logs -n agones-system agones-controller-58796bf457-5csh7
{"filename":"/home/agones/logs/agones-controller-20220719_013917.log","message":"logging to file","numbackups":99,"severity":"info","source":"main","time":"2022-07-19T01:39:17.41748582Z"}
{"logLevel":"info","message":"Setting LogLevel configuration","severity":"info","source":"main","time":"2022-07-19T01:39:17.417534057Z"}
{"ctlConf":{"MinPort":7000,"MaxPort":8000,"SidecarImage":"gcr.io/agones-images/agones-sdk:1.24.0","SidecarCPURequest":"30m","SidecarCPULimit":"0","SidecarMemoryRequest":"0","SidecarMemoryLimit":"0","SdkServiceAccount":"agones-sdk","AlwaysPullSidecar":false,"PrometheusMetrics":true,"Stackdriver":false,"StackdriverLabels":"","KeyFile":"/home/agones/certs/server.key","CertFile":"/home/agones/certs/server.crt","KubeConfig":"","GCPProjectID":"","NumWorkers":100,"APIServerSustainedQPS":400,"APIServerBurstQPS":500,"LogDir":"/home/agones/logs","LogLevel":"info","LogSizeLimitMB":10000,"AllocationBatchWaitTime":500000000},"featureGates":"CustomFasSyncInterval=true\u0026Example=true\u0026NodeExternalDNS=true\u0026PlayerAllocationFilter=false\u0026PlayerTracking=false\u0026SDKGracefulTermination=false\u0026StateAllocationFilter=false","message":"starting gameServer operator...","severity":"info","source":"main","time":"2022-07-19T01:39:17.417708326Z","version":"1.24.0"}
W0719 01:39:17.417807       1 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
{"message":"Starting http server...","severity":"info","source":"main","time":"2022-07-19T01:39:17.422540667Z"}
{"message":"https server started","server":{"Mux":{}},"severity":"info","source":"*https.Server","time":"2022-07-19T01:39:17.422833153Z"}
E0719 01:39:17.478637       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 166 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1867360, 0x2b65560})
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xfffffffe})
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75
panic({0x1867360, 0x2b65560})
        /usr/local/go/src/runtime/panic.go:1038 +0x215
agones.dev/agones/pkg/fleetautoscalers.(*Controller).addFasThread(0xc0003010e0, 0xc0002eb380, 0x1)
        /go/src/agones.dev/agones/pkg/fleetautoscalers/controller.go:415 +0x6d
agones.dev/agones/pkg/fleetautoscalers.NewController.func1({0x1a8b7c0, 0xc0002eb380})
        /go/src/agones.dev/agones/pkg/fleetautoscalers/controller.go:124 +0x6e
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
        /go/src/agones.dev/agones/vendor/k8s.io/client-go/tools/cache/controller.go:231
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
        /go/src/agones.dev/agones/vendor/k8s.io/client-go/tools/cache/shared_informer.go:777 +0x9f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f94b1b4ded8)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000495f38, {0x1d238c0, 0xc0005802a0}, 0x1, 0xc0004d7380)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0, 0x3b9aca00, 0x0, 0x0, 0x0)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0002f7b00)
        /go/src/agones.dev/agones/vendor/k8s.io/client-go/tools/cache/shared_informer.go:771 +0x6b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x88
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x13f2c4d]

goroutine 166 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xfffffffe})
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0xd8
panic({0x1867360, 0x2b65560})
        /usr/local/go/src/runtime/panic.go:1038 +0x215
agones.dev/agones/pkg/fleetautoscalers.(*Controller).addFasThread(0xc0003010e0, 0xc0002eb380, 0x1)
        /go/src/agones.dev/agones/pkg/fleetautoscalers/controller.go:415 +0x6d
agones.dev/agones/pkg/fleetautoscalers.NewController.func1({0x1a8b7c0, 0xc0002eb380})
        /go/src/agones.dev/agones/pkg/fleetautoscalers/controller.go:124 +0x6e
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
        /go/src/agones.dev/agones/vendor/k8s.io/client-go/tools/cache/controller.go:231
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
        /go/src/agones.dev/agones/vendor/k8s.io/client-go/tools/cache/shared_informer.go:777 +0x9f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f94b1b4ded8)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000495f38, {0x1d238c0, 0xc0005802a0}, 0x1, 0xc0004d7380)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0, 0x3b9aca00, 0x0, 0x0, 0x0)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0002f7b00)
        /go/src/agones.dev/agones/vendor/k8s.io/client-go/tools/cache/shared_informer.go:771 +0x6b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x88

What you expected to happen:
Agones controller starts up successfully after we enable the CustomFasSyncInterval feature by updating Agones components in our cluster even if FleetAutoscaler has already been deployed.

How to reproduce it (as minimally and precisely as possible):
We can reproduce this issue using our local minikube cluster. Follow the instructions below:

# install Agones without CustomFasSyncInterval
minikube start -p sync-interval-test --driver virtualbox --kubernetes-version 1.22.10
minikube profile sync-interval-test
helm repo add agones https://agones.dev/chart/stable
helm upgrade --install agones --version v1.23.0 \
	--namespace agones-system \
	--create-namespace agones/agones

# install fleet and fleetAutoscaler
sleep 20 # wait for Agones controller to become ready
kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/release-1.23.0/examples/simple-game-server/fleet.yaml
kubectl apply -f https://raw.githubusercontent.com/googleforgames/agones/release-1.23.0/examples/simple-game-server/fleetautoscaler.yaml

## upgrade Agones to enable CustomFasSyncInterval
helm upgrade agones agones/agones -n agones-system --set agones.featureGates="CustomFasSyncInterval=true"

# Agones Controller will get stuck in CrashLoopBackOff
watch kubectl get pods -n agones-system

Anything else we need to know?:
We can avoid this issue by filling the default value to syncparameter on existing FleetAutoscaler before upgrading Agones.

# disable CustomFasSyncInterval
helm upgrade agones --version v1.23.0 \
	--namespace agones-system agones/agones \
	--set agones.featureGates="CustomFasSyncInterval=false

sleep 20 # wait for Agones controller to become ready

# update FleetAutoscaler before upgrading Agones
kubectl patch fleetautoscaler simple-game-server-autoscaler \
  --type json -p '[{"op": "add", "path": "/spec/sync", "value": {"type": "FixedInterval", "fixedInterval":{"seconds": 30}}}]


# enable CustomFasSyncInterval
helm upgrade agones --version v1.23.0 \
  --namespace agones-system agones/agones \
  --set agones.featureGates="CustomFasSyncInterval=true

# Agones Controller will run
watch kubectl get pods -n agones-system

Environment:

  • Agones version: v1.23.0
  • Kubernetes version (use kubectl version): v1.22.10
  • Cloud provider or hardware configuration: minikube / MacBook Pro (intel CPU)
  • Install method (yaml/helm): helm
@RY-2718 RY-2718 added the kind/bug These are bugs. label Jul 20, 2022
@markmandel
Copy link
Member

markmandel commented Jul 20, 2022

Thanks for the bug report!

I'm trying to find it, but I think we covered this is another issue (or maybe it was Slack) - but short answer, I think this will be a wontfix - or at the very least, we won't implement a live cluster upgrade (it's a complexity nightmare, there are too many upgrade paths to test).

That being said, we should likely update our docs here:
https://agones.dev/site/docs/installation/upgrading/#upgrading-agones-single-cluster:~:text=Delete%20the%20current%20set%20of%20Fleets%20and/or%20GameServers%20in%20your%20cluster.

In that it should be explicit that you need to delete all the Agones CRD values before upgrading (which includes FleetAutoscalers) within a cluster, and we should include that that includes Feature Flag updates.

We do run a pre-hook on delete to delete everything:
https://github.com/googleforgames/agones/blob/c9efc7def67a0306053a96fb9f0bfee9b57bd411/install/helm/agones/templates/hooks/pre_delete_hook.yaml -- maybe we should do the same on upgrade?

@RY-2718
Copy link
Author

RY-2718 commented Jul 21, 2022

Thank you for your response!
As it sounds difficult to fix this issue and a live cluster upgrade doesn't seem to be a recommended way, we will create a new cluster and install new version of Agones, then migrate our workloads according to the docs.

@markmandel
Copy link
Member

We do run a pre-hook on delete to delete everything:
https://github.com/googleforgames/agones/blob/c9efc7def67a0306053a96fb9f0bfee9b57bd411/install/helm/agones/templates/hooks/pre_delete_hook.yaml -- maybe we should do the same on upgrade?

@roberthbailey what do you think of the above idea?

markmandel added a commit to markmandel/agones that referenced this issue Jul 22, 2022
Noting that we get reports that when doing in-place updates with feature
gates can cause issues when existing resources are left in place, and
therefore making a note in the docs to:

a) Feature Gate changes are upgrades
b) Delete all resource (including FleetAutoscalers)
c) Even on Helm upgrades, delete the Agones resources.

Work on googleforgames#2675
roberthbailey pushed a commit that referenced this issue Jul 22, 2022
Noting that we get reports that when doing in-place updates with feature
gates can cause issues when existing resources are left in place, and
therefore making a note in the docs to:

a) Feature Gate changes are upgrades
b) Delete all resource (including FleetAutoscalers)
c) Even on Helm upgrades, delete the Agones resources.

Work on #2675
@roberthbailey
Copy link
Member

While I agree that we don't currently support live cluster upgrade, I don't want to take the stance that we won't ever support it. It's difficult and requires a lot more testing than we currently do, but it would be really nice to get to the point where we can support it in the future, especially as the feature velocity of the project decreases as the system matures and we want to make upgrade easier so that people pick up bug fixes and security patches.

I'm concerned that adding a hook to upgrade to delete things could end up being disruptive (e.g. deleting allocated game servers) and unexpected. As one example, we tell people to helm upgrade to set the server TLS cert on the allocator service and right now that doesn't affect any workloads that might be running (all it does it update one component).

@markmandel
Copy link
Member

As one example, we tell people to helm upgrade to set the server TLS cert on the allocator service and right now that doesn't affect any workloads that might be running (all it does it update one component).

That's an excellent point. In which case I rescind my suggestion to add the deletion operation to the upgrade hook.

@markmandel
Copy link
Member

In which case - since #2678 is merged, shall we close this?

@roberthbailey
Copy link
Member

shall we close this?

sounds good to me.

@SaitejaTamma SaitejaTamma added this to the 1.25.0 milestone Jul 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug These are bugs.
Projects
None yet
Development

No branches or pull requests

4 participants