Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nil Reference/massive log spam in Controller [1.13] #2086

Closed
highlyunavailable opened this issue Apr 30, 2021 · 9 comments · Fixed by #2089
Closed

Nil Reference/massive log spam in Controller [1.13] #2086

highlyunavailable opened this issue Apr 30, 2021 · 9 comments · Fixed by #2089
Labels
area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc kind/bug These are bugs.
Milestone

Comments

@highlyunavailable
Copy link
Contributor

highlyunavailable commented Apr 30, 2021

What happened:

The controller started to produce this error over and over as fast as it could log (1.64M entries in 20 minutes):

2021/04/30 23:46:45 http2: panic serving 172.31.39.241:52498: runtime error: invalid memory address or nil pointer dereference
goroutine 1780371 [running]:
net/http.(*http2serverConn).runHandler.func1(0xc0007a2288, 0xc0017dbf8e, 0xc000ce7980)
        /usr/local/go/src/net/http/h2_bundle.go:5714 +0x16b
panic(0x1767040, 0x2903ca0)
        /usr/local/go/src/runtime/panic.go:969 +0x166
strings.Count(0x0, 0x2, 0x1982a76, 0x1, 0xc000f2e6e0)
        /usr/local/go/src/strings/strings.go:84 +0xf4
k8s.io/apimachinery/pkg/runtime/schema.ParseGroupVersion(0x0, 0x2, 0x8cb061, 0x174f460, 0xc0003faae0, 0xc000f2e768, 0xc0003ce0b8, 0x1857d01)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/runtime/schema/group_version.go:227 +0x91
k8s.io/apimachinery/pkg/runtime/schema.FromAPIVersionAndKind(0x0, 0x2, 0x1613c13, 0xf, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/runtime/schema/group_version.go:310 +0x5a
k8s.io/apimachinery/pkg/apis/meta/v1.(*TypeMeta).GroupVersionKind(0xc001040af0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/apis/meta/v1/meta.go:128 +0x5e
k8s.io/apimachinery/pkg/runtime/serializer/versioning.(*codec).doEncode(0xc0017ccf00, 0x1c702c0, 0xc001040af0, 0x7f064da69478, 0xc0007a2288, 0x0, 0x0)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/runtime/serializer/versioning/versioning.go:217 +0x186
k8s.io/apimachinery/pkg/runtime/serializer/versioning.(*codec).Encode(0xc0017ccf00, 0x1c702c0, 0xc001040af0, 0x7f064da69478, 0xc0007a2288, 0xc00043f840, 0xc0000aea80)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/runtime/serializer/versioning/versioning.go:184 +0x178
agones.dev/agones/pkg/util/apiserver.(*APIServer).addSerializedHandler.func1(0x1ca1d00, 0xc0007a2288, 0xc0014cf500, 0x45fea0, 0xc0007c02d0)
        /go/src/agones.dev/agones/pkg/util/apiserver/apiserver.go:171 +0x37a
agones.dev/agones/pkg/util/https.ErrorHTTPHandler.func1(0x1ca1d00, 0xc0007a2288, 0xc0014cf500)
        /go/src/agones.dev/agones/pkg/util/https/https.go:52 +0x50
net/http.HandlerFunc.ServeHTTP(0xc000e40ce0, 0x1ca1d00, 0xc0007a2288, 0xc0014cf500)
        /usr/local/go/src/net/http/server.go:2041 +0x44
net/http.(*ServeMux).ServeHTTP(0xc000244440, 0x1ca1d00, 0xc0007a2288, 0xc0014cf500)
        /usr/local/go/src/net/http/server.go:2416 +0x1a5
net/http.serverHandler.ServeHTTP(0xc000372000, 0x1ca1d00, 0xc0007a2288, 0xc0014cf500)
        /usr/local/go/src/net/http/server.go:2836 +0xa3
net/http.initALPNRequest.ServeHTTP(0x1ca5d80, 0xc000aaf8c0, 0xc000699500, 0xc000372000, 0x1ca1d00, 0xc0007a2288, 0xc0014cf500)
        /usr/local/go/src/net/http/server.go:3410 +0x8d
net/http.(*http2serverConn).runHandler(0xc000ce7980, 0xc0007a2288, 0xc0014cf500, 0xc000df4700)
        /usr/local/go/src/net/http/h2_bundle.go:5721 +0x8b
created by net/http.(*http2serverConn).processHeaders
        /usr/local/go/src/net/http/h2_bundle.go:5455 +0x4e1

What you expected to happen:

Errors to be cleanly handled or the process to just crash so it can be restarted using normal backoff/alerting of crash (Deleting the pod resolved the issue).

How to reproduce it (as minimally and precisely as possible):

Unknown.

Anything else we need to know?: This appears to be a simple Nil check issue but it's unknown as to why it failed.

Environment:

  • Agones version: 1.13
  • Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.16-eks-7737de", GitCommit:"7737de131e58a68dda49cdd0ad821b4cb3665ae8", GitTreeState:"clean", BuildDate:"2021-03-10T21:33:25Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: aws
  • Install method (yaml/helm): helm
  • Troubleshooting guide log(s): No available log (it's in datadog) but it just started with no prior messages.
  • Others:
@highlyunavailable highlyunavailable added the kind/bug These are bugs. label Apr 30, 2021
@highlyunavailable highlyunavailable changed the title Crash/massive log spam in Controller [1.13] Nil Reference/massive log spam in Controller [1.13] Apr 30, 2021
@markmandel
Copy link
Member

🤔 is this an error in the http2 implementation of Go? Or is this an error in Agones?

It looks like the former? But I'm not sure.

@markmandel
Copy link
Member

https://github.com/golang/go/blob/go1.14.12/src/net/http/h2_bundle.go#L5714 < that looks like the relevent line.

I'm not 100% sure what causes it though? Did something odd happen to the K8s control plane? (I'm assuming that IP is the K8s control plane?)

@highlyunavailable
Copy link
Contributor Author

highlyunavailable commented May 1, 2021

  1. This spams millions of log entries in minutes. That’s an Agones problem as the library user. We pay per million log entries so that sucks.
  2. Agones is pretty solidly in the middle of that call stack. It feels like the handling should be more or less robust to mitigate this through backoff or crash.

Not sure what happened, and that IP could have been the control plane but it’s EKS. I moved the pod to a different node a few minutes before so it could have been related to that, but it was working post move for a few minutes til this happened.

I’m happy to take this up somewhere else but I feel that I’m going to be told it’s someone else’s problem in that stack no matter who I open an issue with.

@markmandel
Copy link
Member

I think I might have found it:

kubernetes/kubernetes#82497

Looks like a fix dropped as of 16 days ago. Seems like this has been here for a very long time, but you are the first person who has ever reported the issue! Never ever seen this before.

@highlyunavailable
Copy link
Contributor Author

Holy moly! Well that makes me feel special. 🤣 I have mitigation in place now if it happens again but I guess the fix is kind of in process and it seems fairly rare so I won’t worry about it as much now. Thanks for investigating.

@markmandel
Copy link
Member

🤔 do we need to put a mutex around the .Encode(...) portion of https://github.com/googleforgames/agones/blob/main/pkg/util/apiserver/apiserver.go#L171 ?

Let's keep an eye on things. I'd like to avoid putting a mutex in place, but if it's not thread safe, we might have to.

@highlyunavailable
Copy link
Contributor Author

highlyunavailable commented May 1, 2021

What about cloning the object? If the issue is "encoding the same object from multiple goroutines" then two different objects would completely alleviate the issue as well. k8sruntime.Object has a DeepCopy func on it that could probably do just that, and in fact it's a variant of what they propose in that other issue:

We can probably fix this globally by slightly tweaking UnsafeConversion to copy the top most object (where kind is set) on its way down the conversion stack.

We could do the same, insofar as we copy the object, encode the copy, until such time as the fix gets rolled into Agones via package updates.

@highlyunavailable
Copy link
Contributor Author

highlyunavailable commented May 1, 2021

Alright, I was able to super easily reproduce it in tests. I took TestAPIServerAddAPIResourceDiscovery and made a "parallel" copy. Running it without adding a mutex around encode or deep copying produces this result (you may have to run it a few times and clear the test cache between runs, it's a race condition so it's flaky):

Running tool: /usr/local/go/bin/go test -timeout 30s -run ^(TestAPIServerAddAPIResourceCRDHandler|TestAPIServerAddAPIResourceDiscovery|TestAPIServerAddAPIResourceDiscoveryParallel|TestSplitNameSpaceResource)$ agones.dev/agones/pkg/util/apiserver

{"body":"","headers":{"Accept-Encoding":["gzip"],"User-Agent":["Go-http-client/1.1"]},"host":"127.0.0.1:43235","message":"404","method":"GET","requestURI":"/apis/allocation.agones.dev/v1/namespaces/default/gameserverallZZZZions","severity":"warning","source":"*apiserver.APIServer","time":"2021-05-01T11:09:09.3230646-07:00","url":{"Scheme":"","Opaque":"","User":null,"Host":"","Path":"/apis/allocation.agones.dev/v1/namespaces/default/gameserverallZZZZions","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""}}
--- FAIL: TestAPIServerAddAPIResourceDiscoveryParallel (0.00s)
    --- FAIL: TestAPIServerAddAPIResourceDiscoveryParallel/Parallel_Tests (0.00s)
        --- FAIL: TestAPIServerAddAPIResourceDiscoveryParallel/Parallel_Tests/No_Accept_Header (0.00s)
            apiserver_test.go:200:
                        Error Trace:    apiserver_test.go:200
                        Error:          Not equal:
                                        expected: "v1"
                                        actual  : ""

                                        Diff:
                                        --- Expected
                                        +++ Actual
                                        @@ -1 +1 @@
                                        -v1
                                        +
                        Test:           TestAPIServerAddAPIResourceDiscoveryParallel/Parallel_Tests/No_Accept_Header
            apiserver_test.go:201:
                        Error Trace:    apiserver_test.go:201
                        Error:          Not equal:
                                        expected: "APIResourceList"
                                        actual  : ""

                                        Diff:
                                        --- Expected
                                        +++ Actual
                                        @@ -1 +1 @@
                                        -APIResourceList
                                        +
                        Test:           TestAPIServerAddAPIResourceDiscoveryParallel/Parallel_Tests/No_Accept_Header
FAIL
FAIL    agones.dev/agones/pkg/util/apiserver    0.014s

Adding a mutex or deep copying always makes the test pass:

Running tool: /usr/local/go/bin/go test -timeout 30s -run ^(TestAPIServerAddAPIResourceCRDHandler|TestAPIServerAddAPIResourceDiscovery|TestAPIServerAddAPIResourceDiscoveryParallel|TestSplitNameSpaceResource)$ agones.dev/agones/pkg/util/apiserver

ok  	agones.dev/agones/pkg/util/apiserver	0.014s

highlyunavailable added a commit to highlyunavailable/agones that referenced this issue May 1, 2021
This works around a race condition present in the encoder for
Kubernetes objects, as detailed in
kubernetes/kubernetes#82497

Fixes googleforgames#2086
highlyunavailable added a commit to highlyunavailable/agones that referenced this issue May 1, 2021
This works around a race condition present in the encoder for
Kubernetes objects, as detailed in
kubernetes/kubernetes#82497

Fixes googleforgames#2086
highlyunavailable added a commit to highlyunavailable/agones that referenced this issue May 1, 2021
This works around a race condition present in the encoder for
Kubernetes objects, as detailed in
kubernetes/kubernetes#82497

Fixes googleforgames#2086
highlyunavailable added a commit to highlyunavailable/agones that referenced this issue May 1, 2021
This works around a race condition present in the encoder for
Kubernetes objects, as detailed in
kubernetes/kubernetes#82497

Fixes googleforgames#2086
highlyunavailable added a commit to highlyunavailable/agones that referenced this issue May 1, 2021
This works around a race condition present in the encoder for
Kubernetes objects, as detailed in
kubernetes/kubernetes#82497

Uses the fix from kubernetes/kubernetes#101123

Fixes googleforgames#2086
highlyunavailable added a commit to highlyunavailable/agones that referenced this issue May 1, 2021
This works around a race condition present in the encoder for
Kubernetes objects, as detailed in
kubernetes/kubernetes#82497

Uses the fix from kubernetes/kubernetes#101123

Fixes googleforgames#2086
@highlyunavailable
Copy link
Contributor Author

Fix is submitted, see PR. Turns out a shallow copy is plenty and everything is happy from just doing that. It makes sense since the actual inner objects aren't being tweaked, just the outer one, which avoids a. a full deep copy and b. avoids locking.

highlyunavailable added a commit to highlyunavailable/agones that referenced this issue May 3, 2021
This works around a race condition present in the encoder for
Kubernetes objects, as detailed in
kubernetes/kubernetes#82497

Uses the fix from kubernetes/kubernetes#101123

Fixes googleforgames#2086
roberthbailey pushed a commit that referenced this issue May 3, 2021
…2089)

This works around a race condition present in the encoder for
Kubernetes objects, as detailed in
kubernetes/kubernetes#82497

Uses the fix from kubernetes/kubernetes#101123

Fixes #2086
@sawagh sawagh added this to the 1.15.0 milestone Jun 1, 2021
@sawagh sawagh added the area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc label Jun 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc kind/bug These are bugs.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants