Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watching Secrets fails on startup of controller #1310

Closed
bdjgs opened this issue Oct 24, 2023 · 3 comments
Closed

Watching Secrets fails on startup of controller #1310

bdjgs opened this issue Oct 24, 2023 · 3 comments

Comments

@bdjgs
Copy link

bdjgs commented Oct 24, 2023

I have a controller that is build with Kubebuilder and uses client-go for watching Secrets and ConfigMaps. It has been in use for multiple years. However, it has recently started to fail on startup due to an error with watching Secrets. The controller is open source and can be found here: https://github.com/Bankdata/styra-controller.

The error I get on startup is this:

E1016 12:18:42.569808 1 reflector.go:147] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Secret: failed to list *v1.Secret: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 45; INTERNAL_ERROR; received from peer I1016 12:19:42.466321 1 trace.go:236] Trace[467428356]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229 (16-Oct-2023 12:18:44.131) (total time: 58334ms):

In this log it times out after 58334ms and restarts the controller. I have tried to remove all logic from the handler.EnqueueRequestsFromMapFunc function and simply make it return an empty list of reconcile.Request, but it still times out.
Eventually, after a couple of restarts it succeeds to watch Secrets within the allowed amount of time and successfully starts.

The controller also watches ConfigMaps but that doesn't seem to cause any problems. I have tried to make it only watch ConfigMaps and that works fine and doesn't cause time outs.

The cluster (running OpenShift) contains 3495 Secrets and 3276 ConfigMaps, but I dont know the sizes of them.

I cannot reproduce it in a local kind cluster.

I have found similar issues on GitHub suggesting to give the controller more resources or that there is too many Secrets in cluster or that the Kubernetes API is slow. However, the controller does not lack resources and a kubectl get secrets -A takes ~10 seconds. Also, the team hosting the cluster cannot see anything in their logs in the time ranges where we receive the errors.

Personally, I think it is the cluster (Kubernetes API) that is performing poorly and I will try and get more information from the team hosting the cluster.

Updates
I increased the log level of client-go which made it log the following two lines:

I1109 11:04:41.667607       1 reflector.go:325] Listing and watching *v1.Secret from pkg/cache/internal/informers.go:208
I1109 11:05:42.682687       1 reflector.go:351] Caches populated for *v1.Secret from pkg/cache/internal/informers.go:208

And it is the functionality between these two log lines that takes the 60 seconds. The functionality is the list and watch functionality. The same log lines is produced for configmaps, but that does only take 11 seconds. So there seems to be a difference between the configmaps and secrets on Openshift.

In trying to reproduce this locally in a kind cluster. I deployed 85k secrets with a total size of approx. 1GB. The 1GB is based on RAM usage of the container with the controller (it has to index all the secrets in RAM). In the Openshift cluster the RAM usage is approx 1.2GB, so I am pretty confident that my local kind cluster has more secrets and with a total size bigger than the secrets in Openshift.
In the local kind cluster the time between the two log lines above were only 4 seconds. I feel that this concludes that the problem does not lie in client-go. Also, kubectl get secrets -A takes 45 seconds locally and only 8 seconds in Openshift, so there doesn't seem to be a relation between the evaluation time of that command and the performance of the list and watch functionality.

@abhishekdwivedi3060
Copy link

abhishekdwivedi3060 commented Oct 26, 2023

I'm also facing similar kind of issue, not exactly the same. After upgrading the controller-runtime to 0.15.0 and bumping k8s dependencies to 0.27.2, I started seeing below error in my operator:

W1026 18:01:13.769684       1 reflector.go:533] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: failed to list *v1.Secret: secrets is forbidden: User "system:serviceaccount:operators:aerospike-operator-controller-manager" cannot list resource "secrets" in API group "" at the cluster scope

My operator doesn't watch on secret resource and it was working fine for a couple of years without secret list permission. But somehow after dependency bump it started giving secret list permission error and the above error msgs continues in the logs

@akutz
Copy link
Member

akutz commented Feb 1, 2024

The controller-runtime client automatically starts a watch on any type of resource that is fetched using the client's Get method. See kubernetes-sigs/controller-runtime#1222 for more information.

@bdjgs
Copy link
Author

bdjgs commented Apr 26, 2024

The issue has gotten slightly better and the ListAndWatch time is no longer high enough to cause restarts. Since the amount of configmaps and secrets has increased and we havent dont any relevant changes (mostly dependency updates), I highly suspect that the issue is related to the performance of the cluster and not client-go.

@bdjgs bdjgs closed this as completed Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants