-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watching Secrets fails on startup of controller #1310
Comments
I'm also facing similar kind of issue, not exactly the same. After upgrading the controller-runtime to 0.15.0 and bumping k8s dependencies to 0.27.2, I started seeing below error in my operator:
My operator doesn't watch on secret resource and it was working fine for a couple of years without secret list permission. But somehow after dependency bump it started giving secret list permission error and the above error msgs continues in the logs |
The controller-runtime client automatically starts a watch on any type of resource that is fetched using the client's |
The issue has gotten slightly better and the ListAndWatch time is no longer high enough to cause restarts. Since the amount of configmaps and secrets has increased and we havent dont any relevant changes (mostly dependency updates), I highly suspect that the issue is related to the performance of the cluster and not client-go. |
I have a controller that is build with Kubebuilder and uses client-go for watching Secrets and ConfigMaps. It has been in use for multiple years. However, it has recently started to fail on startup due to an error with watching Secrets. The controller is open source and can be found here: https://github.com/Bankdata/styra-controller.
The error I get on startup is this:
E1016 12:18:42.569808 1 reflector.go:147] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Secret: failed to list *v1.Secret: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 45; INTERNAL_ERROR; received from peer I1016 12:19:42.466321 1 trace.go:236] Trace[467428356]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229 (16-Oct-2023 12:18:44.131) (total time: 58334ms):
In this log it times out after 58334ms and restarts the controller. I have tried to remove all logic from the
handler.EnqueueRequestsFromMapFunc
function and simply make it return an empty list ofreconcile.Request
, but it still times out.Eventually, after a couple of restarts it succeeds to watch Secrets within the allowed amount of time and successfully starts.
The controller also watches ConfigMaps but that doesn't seem to cause any problems. I have tried to make it only watch ConfigMaps and that works fine and doesn't cause time outs.
The cluster (running OpenShift) contains 3495 Secrets and 3276 ConfigMaps, but I dont know the sizes of them.
I cannot reproduce it in a local kind cluster.
I have found similar issues on GitHub suggesting to give the controller more resources or that there is too many Secrets in cluster or that the Kubernetes API is slow. However, the controller does not lack resources and a
kubectl get secrets -A
takes ~10 seconds. Also, the team hosting the cluster cannot see anything in their logs in the time ranges where we receive the errors.Personally, I think it is the cluster (Kubernetes API) that is performing poorly and I will try and get more information from the team hosting the cluster.
Updates
I increased the log level of client-go which made it log the following two lines:
And it is the functionality between these two log lines that takes the 60 seconds. The functionality is the list and watch functionality. The same log lines is produced for configmaps, but that does only take 11 seconds. So there seems to be a difference between the configmaps and secrets on Openshift.
In trying to reproduce this locally in a kind cluster. I deployed 85k secrets with a total size of approx. 1GB. The 1GB is based on RAM usage of the container with the controller (it has to index all the secrets in RAM). In the Openshift cluster the RAM usage is approx 1.2GB, so I am pretty confident that my local kind cluster has more secrets and with a total size bigger than the secrets in Openshift.
In the local kind cluster the time between the two log lines above were only 4 seconds. I feel that this concludes that the problem does not lie in client-go. Also,
kubectl get secrets -A
takes 45 seconds locally and only 8 seconds in Openshift, so there doesn't seem to be a relation between the evaluation time of that command and the performance of the list and watch functionality.The text was updated successfully, but these errors were encountered: