-
Notifications
You must be signed in to change notification settings - Fork 820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nil Reference/massive log spam in Controller [1.13] #2086
Comments
🤔 is this an error in the http2 implementation of Go? Or is this an error in Agones? It looks like the former? But I'm not sure. |
https://github.com/golang/go/blob/go1.14.12/src/net/http/h2_bundle.go#L5714 < that looks like the relevent line. I'm not 100% sure what causes it though? Did something odd happen to the K8s control plane? (I'm assuming that IP is the K8s control plane?) |
Not sure what happened, and that IP could have been the control plane but it’s EKS. I moved the pod to a different node a few minutes before so it could have been related to that, but it was working post move for a few minutes til this happened. I’m happy to take this up somewhere else but I feel that I’m going to be told it’s someone else’s problem in that stack no matter who I open an issue with. |
I think I might have found it: Looks like a fix dropped as of 16 days ago. Seems like this has been here for a very long time, but you are the first person who has ever reported the issue! Never ever seen this before. |
Holy moly! Well that makes me feel special. 🤣 I have mitigation in place now if it happens again but I guess the fix is kind of in process and it seems fairly rare so I won’t worry about it as much now. Thanks for investigating. |
🤔 do we need to put a mutex around the Let's keep an eye on things. I'd like to avoid putting a mutex in place, but if it's not thread safe, we might have to. |
What about cloning the object? If the issue is "encoding the same object from multiple goroutines" then two different objects would completely alleviate the issue as well.
We could do the same, insofar as we copy the object, encode the copy, until such time as the fix gets rolled into Agones via package updates. |
Alright, I was able to super easily reproduce it in tests. I took
Adding a mutex or deep copying always makes the test pass:
|
This works around a race condition present in the encoder for Kubernetes objects, as detailed in kubernetes/kubernetes#82497 Fixes googleforgames#2086
This works around a race condition present in the encoder for Kubernetes objects, as detailed in kubernetes/kubernetes#82497 Fixes googleforgames#2086
This works around a race condition present in the encoder for Kubernetes objects, as detailed in kubernetes/kubernetes#82497 Fixes googleforgames#2086
This works around a race condition present in the encoder for Kubernetes objects, as detailed in kubernetes/kubernetes#82497 Fixes googleforgames#2086
This works around a race condition present in the encoder for Kubernetes objects, as detailed in kubernetes/kubernetes#82497 Uses the fix from kubernetes/kubernetes#101123 Fixes googleforgames#2086
This works around a race condition present in the encoder for Kubernetes objects, as detailed in kubernetes/kubernetes#82497 Uses the fix from kubernetes/kubernetes#101123 Fixes googleforgames#2086
Fix is submitted, see PR. Turns out a shallow copy is plenty and everything is happy from just doing that. It makes sense since the actual inner objects aren't being tweaked, just the outer one, which avoids a. a full deep copy and b. avoids locking. |
This works around a race condition present in the encoder for Kubernetes objects, as detailed in kubernetes/kubernetes#82497 Uses the fix from kubernetes/kubernetes#101123 Fixes googleforgames#2086
…2089) This works around a race condition present in the encoder for Kubernetes objects, as detailed in kubernetes/kubernetes#82497 Uses the fix from kubernetes/kubernetes#101123 Fixes #2086
What happened:
The controller started to produce this error over and over as fast as it could log (1.64M entries in 20 minutes):
What you expected to happen:
Errors to be cleanly handled or the process to just crash so it can be restarted using normal backoff/alerting of crash (Deleting the pod resolved the issue).
How to reproduce it (as minimally and precisely as possible):
Unknown.
Anything else we need to know?: This appears to be a simple Nil check issue but it's unknown as to why it failed.
Environment:
kubectl version
): Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.16-eks-7737de", GitCommit:"7737de131e58a68dda49cdd0ad821b4cb3665ae8", GitTreeState:"clean", BuildDate:"2021-03-10T21:33:25Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}The text was updated successfully, but these errors were encountered: