Nil Reference/massive log spam in Controller [1.13] #2086

highlyunavailable · 2021-04-30T23:56:09Z

What happened:

The controller started to produce this error over and over as fast as it could log (1.64M entries in 20 minutes):

2021/04/30 23:46:45 http2: panic serving 172.31.39.241:52498: runtime error: invalid memory address or nil pointer dereference
goroutine 1780371 [running]:
net/http.(*http2serverConn).runHandler.func1(0xc0007a2288, 0xc0017dbf8e, 0xc000ce7980)
        /usr/local/go/src/net/http/h2_bundle.go:5714 +0x16b
panic(0x1767040, 0x2903ca0)
        /usr/local/go/src/runtime/panic.go:969 +0x166
strings.Count(0x0, 0x2, 0x1982a76, 0x1, 0xc000f2e6e0)
        /usr/local/go/src/strings/strings.go:84 +0xf4
k8s.io/apimachinery/pkg/runtime/schema.ParseGroupVersion(0x0, 0x2, 0x8cb061, 0x174f460, 0xc0003faae0, 0xc000f2e768, 0xc0003ce0b8, 0x1857d01)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/runtime/schema/group_version.go:227 +0x91
k8s.io/apimachinery/pkg/runtime/schema.FromAPIVersionAndKind(0x0, 0x2, 0x1613c13, 0xf, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/runtime/schema/group_version.go:310 +0x5a
k8s.io/apimachinery/pkg/apis/meta/v1.(*TypeMeta).GroupVersionKind(0xc001040af0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/apis/meta/v1/meta.go:128 +0x5e
k8s.io/apimachinery/pkg/runtime/serializer/versioning.(*codec).doEncode(0xc0017ccf00, 0x1c702c0, 0xc001040af0, 0x7f064da69478, 0xc0007a2288, 0x0, 0x0)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/runtime/serializer/versioning/versioning.go:217 +0x186
k8s.io/apimachinery/pkg/runtime/serializer/versioning.(*codec).Encode(0xc0017ccf00, 0x1c702c0, 0xc001040af0, 0x7f064da69478, 0xc0007a2288, 0xc00043f840, 0xc0000aea80)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/runtime/serializer/versioning/versioning.go:184 +0x178
agones.dev/agones/pkg/util/apiserver.(*APIServer).addSerializedHandler.func1(0x1ca1d00, 0xc0007a2288, 0xc0014cf500, 0x45fea0, 0xc0007c02d0)
        /go/src/agones.dev/agones/pkg/util/apiserver/apiserver.go:171 +0x37a
agones.dev/agones/pkg/util/https.ErrorHTTPHandler.func1(0x1ca1d00, 0xc0007a2288, 0xc0014cf500)
        /go/src/agones.dev/agones/pkg/util/https/https.go:52 +0x50
net/http.HandlerFunc.ServeHTTP(0xc000e40ce0, 0x1ca1d00, 0xc0007a2288, 0xc0014cf500)
        /usr/local/go/src/net/http/server.go:2041 +0x44
net/http.(*ServeMux).ServeHTTP(0xc000244440, 0x1ca1d00, 0xc0007a2288, 0xc0014cf500)
        /usr/local/go/src/net/http/server.go:2416 +0x1a5
net/http.serverHandler.ServeHTTP(0xc000372000, 0x1ca1d00, 0xc0007a2288, 0xc0014cf500)
        /usr/local/go/src/net/http/server.go:2836 +0xa3
net/http.initALPNRequest.ServeHTTP(0x1ca5d80, 0xc000aaf8c0, 0xc000699500, 0xc000372000, 0x1ca1d00, 0xc0007a2288, 0xc0014cf500)
        /usr/local/go/src/net/http/server.go:3410 +0x8d
net/http.(*http2serverConn).runHandler(0xc000ce7980, 0xc0007a2288, 0xc0014cf500, 0xc000df4700)
        /usr/local/go/src/net/http/h2_bundle.go:5721 +0x8b
created by net/http.(*http2serverConn).processHeaders
        /usr/local/go/src/net/http/h2_bundle.go:5455 +0x4e1

What you expected to happen:

Errors to be cleanly handled or the process to just crash so it can be restarted using normal backoff/alerting of crash (Deleting the pod resolved the issue).

How to reproduce it (as minimally and precisely as possible):

Unknown.

Anything else we need to know?: This appears to be a simple Nil check issue but it's unknown as to why it failed.

Environment:

Agones version: 1.13
Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.16-eks-7737de", GitCommit:"7737de131e58a68dda49cdd0ad821b4cb3665ae8", GitTreeState:"clean", BuildDate:"2021-03-10T21:33:25Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration: aws
Install method (yaml/helm): helm
Troubleshooting guide log(s): No available log (it's in datadog) but it just started with no prior messages.
Others:

The text was updated successfully, but these errors were encountered:

markmandel · 2021-05-01T00:47:23Z

🤔 is this an error in the http2 implementation of Go? Or is this an error in Agones?

It looks like the former? But I'm not sure.

markmandel · 2021-05-01T00:50:31Z

https://github.com/golang/go/blob/go1.14.12/src/net/http/h2_bundle.go#L5714 < that looks like the relevent line.

I'm not 100% sure what causes it though? Did something odd happen to the K8s control plane? (I'm assuming that IP is the K8s control plane?)

highlyunavailable · 2021-05-01T01:08:10Z

This spams millions of log entries in minutes. That’s an Agones problem as the library user. We pay per million log entries so that sucks.
Agones is pretty solidly in the middle of that call stack. It feels like the handling should be more or less robust to mitigate this through backoff or crash.

Not sure what happened, and that IP could have been the control plane but it’s EKS. I moved the pod to a different node a few minutes before so it could have been related to that, but it was working post move for a few minutes til this happened.

I’m happy to take this up somewhere else but I feel that I’m going to be told it’s someone else’s problem in that stack no matter who I open an issue with.

markmandel · 2021-05-01T01:34:22Z

I think I might have found it:

kubernetes/kubernetes#82497

Looks like a fix dropped as of 16 days ago. Seems like this has been here for a very long time, but you are the first person who has ever reported the issue! Never ever seen this before.

highlyunavailable · 2021-05-01T01:37:47Z

Holy moly! Well that makes me feel special. 🤣 I have mitigation in place now if it happens again but I guess the fix is kind of in process and it seems fairly rare so I won’t worry about it as much now. Thanks for investigating.

markmandel · 2021-05-01T01:48:27Z

🤔 do we need to put a mutex around the .Encode(...) portion of https://github.com/googleforgames/agones/blob/main/pkg/util/apiserver/apiserver.go#L171 ?

Let's keep an eye on things. I'd like to avoid putting a mutex in place, but if it's not thread safe, we might have to.

highlyunavailable · 2021-05-01T02:58:00Z

What about cloning the object? If the issue is "encoding the same object from multiple goroutines" then two different objects would completely alleviate the issue as well. k8sruntime.Object has a DeepCopy func on it that could probably do just that, and in fact it's a variant of what they propose in that other issue:

We can probably fix this globally by slightly tweaking UnsafeConversion to copy the top most object (where kind is set) on its way down the conversion stack.

We could do the same, insofar as we copy the object, encode the copy, until such time as the fix gets rolled into Agones via package updates.

highlyunavailable · 2021-05-01T18:12:54Z

Alright, I was able to super easily reproduce it in tests. I took TestAPIServerAddAPIResourceDiscovery and made a "parallel" copy. Running it without adding a mutex around encode or deep copying produces this result (you may have to run it a few times and clear the test cache between runs, it's a race condition so it's flaky):

Running tool: /usr/local/go/bin/go test -timeout 30s -run ^(TestAPIServerAddAPIResourceCRDHandler|TestAPIServerAddAPIResourceDiscovery|TestAPIServerAddAPIResourceDiscoveryParallel|TestSplitNameSpaceResource)$ agones.dev/agones/pkg/util/apiserver

{"body":"","headers":{"Accept-Encoding":["gzip"],"User-Agent":["Go-http-client/1.1"]},"host":"127.0.0.1:43235","message":"404","method":"GET","requestURI":"/apis/allocation.agones.dev/v1/namespaces/default/gameserverallZZZZions","severity":"warning","source":"*apiserver.APIServer","time":"2021-05-01T11:09:09.3230646-07:00","url":{"Scheme":"","Opaque":"","User":null,"Host":"","Path":"/apis/allocation.agones.dev/v1/namespaces/default/gameserverallZZZZions","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""}}
--- FAIL: TestAPIServerAddAPIResourceDiscoveryParallel (0.00s)
    --- FAIL: TestAPIServerAddAPIResourceDiscoveryParallel/Parallel_Tests (0.00s)
        --- FAIL: TestAPIServerAddAPIResourceDiscoveryParallel/Parallel_Tests/No_Accept_Header (0.00s)
            apiserver_test.go:200:
                        Error Trace:    apiserver_test.go:200
                        Error:          Not equal:
                                        expected: "v1"
                                        actual  : ""

                                        Diff:
                                        --- Expected
                                        +++ Actual
                                        @@ -1 +1 @@
                                        -v1
                                        +
                        Test:           TestAPIServerAddAPIResourceDiscoveryParallel/Parallel_Tests/No_Accept_Header
            apiserver_test.go:201:
                        Error Trace:    apiserver_test.go:201
                        Error:          Not equal:
                                        expected: "APIResourceList"
                                        actual  : ""

                                        Diff:
                                        --- Expected
                                        +++ Actual
                                        @@ -1 +1 @@
                                        -APIResourceList
                                        +
                        Test:           TestAPIServerAddAPIResourceDiscoveryParallel/Parallel_Tests/No_Accept_Header
FAIL
FAIL    agones.dev/agones/pkg/util/apiserver    0.014s

Adding a mutex or deep copying always makes the test pass:

Running tool: /usr/local/go/bin/go test -timeout 30s -run ^(TestAPIServerAddAPIResourceCRDHandler|TestAPIServerAddAPIResourceDiscovery|TestAPIServerAddAPIResourceDiscoveryParallel|TestSplitNameSpaceResource)$ agones.dev/agones/pkg/util/apiserver

ok  	agones.dev/agones/pkg/util/apiserver	0.014s

This works around a race condition present in the encoder for Kubernetes objects, as detailed in kubernetes/kubernetes#82497 Fixes googleforgames#2086

This works around a race condition present in the encoder for Kubernetes objects, as detailed in kubernetes/kubernetes#82497 Uses the fix from kubernetes/kubernetes#101123 Fixes googleforgames#2086

highlyunavailable · 2021-05-01T19:34:42Z

Fix is submitted, see PR. Turns out a shallow copy is plenty and everything is happy from just doing that. It makes sense since the actual inner objects aren't being tweaked, just the outer one, which avoids a. a full deep copy and b. avoids locking.

This works around a race condition present in the encoder for Kubernetes objects, as detailed in kubernetes/kubernetes#82497 Uses the fix from kubernetes/kubernetes#101123 Fixes googleforgames#2086

…2089) This works around a race condition present in the encoder for Kubernetes objects, as detailed in kubernetes/kubernetes#82497 Uses the fix from kubernetes/kubernetes#101123 Fixes #2086

highlyunavailable added the kind/bug These are bugs. label Apr 30, 2021

highlyunavailable changed the title ~~Crash/massive log spam in Controller [1.13]~~ Nil Reference/massive log spam in Controller [1.13] Apr 30, 2021

highlyunavailable mentioned this issue May 1, 2021

Clone Kubernetes objects in API Server before encoding them #2089

Merged

roberthbailey closed this as completed in #2089 May 3, 2021

sawagh added this to the 1.15.0 milestone Jun 1, 2021

sawagh added the area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc label Jun 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nil Reference/massive log spam in Controller [1.13] #2086

Nil Reference/massive log spam in Controller [1.13] #2086

highlyunavailable commented Apr 30, 2021 •

edited

Loading

markmandel commented May 1, 2021

markmandel commented May 1, 2021

highlyunavailable commented May 1, 2021 •

edited

Loading

markmandel commented May 1, 2021

highlyunavailable commented May 1, 2021

markmandel commented May 1, 2021

highlyunavailable commented May 1, 2021 •

edited

Loading

highlyunavailable commented May 1, 2021 •

edited

Loading

highlyunavailable commented May 1, 2021

Nil Reference/massive log spam in Controller [1.13] #2086

Nil Reference/massive log spam in Controller [1.13] #2086

Comments

highlyunavailable commented Apr 30, 2021 • edited Loading

markmandel commented May 1, 2021

markmandel commented May 1, 2021

highlyunavailable commented May 1, 2021 • edited Loading

markmandel commented May 1, 2021

highlyunavailable commented May 1, 2021

markmandel commented May 1, 2021

highlyunavailable commented May 1, 2021 • edited Loading

highlyunavailable commented May 1, 2021 • edited Loading

highlyunavailable commented May 1, 2021

highlyunavailable commented Apr 30, 2021 •

edited

Loading

highlyunavailable commented May 1, 2021 •

edited

Loading

highlyunavailable commented May 1, 2021 •

edited

Loading

highlyunavailable commented May 1, 2021 •

edited

Loading