Watching Secrets fails on startup of controller #1310

bdjgs · 2023-10-24T08:34:45Z

I have a controller that is build with Kubebuilder and uses client-go for watching Secrets and ConfigMaps. It has been in use for multiple years. However, it has recently started to fail on startup due to an error with watching Secrets. The controller is open source and can be found here: https://github.com/Bankdata/styra-controller.

The error I get on startup is this:

E1016 12:18:42.569808 1 reflector.go:147] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Secret: failed to list *v1.Secret: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 45; INTERNAL_ERROR; received from peer I1016 12:19:42.466321 1 trace.go:236] Trace[467428356]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229 (16-Oct-2023 12:18:44.131) (total time: 58334ms):

In this log it times out after 58334ms and restarts the controller. I have tried to remove all logic from the handler.EnqueueRequestsFromMapFunc function and simply make it return an empty list of reconcile.Request, but it still times out.
Eventually, after a couple of restarts it succeeds to watch Secrets within the allowed amount of time and successfully starts.

The controller also watches ConfigMaps but that doesn't seem to cause any problems. I have tried to make it only watch ConfigMaps and that works fine and doesn't cause time outs.

The cluster (running OpenShift) contains 3495 Secrets and 3276 ConfigMaps, but I dont know the sizes of them.

I cannot reproduce it in a local kind cluster.

I have found similar issues on GitHub suggesting to give the controller more resources or that there is too many Secrets in cluster or that the Kubernetes API is slow. However, the controller does not lack resources and a kubectl get secrets -A takes ~10 seconds. Also, the team hosting the cluster cannot see anything in their logs in the time ranges where we receive the errors.

Personally, I think it is the cluster (Kubernetes API) that is performing poorly and I will try and get more information from the team hosting the cluster.

Updates
I increased the log level of client-go which made it log the following two lines:

I1109 11:04:41.667607       1 reflector.go:325] Listing and watching *v1.Secret from pkg/cache/internal/informers.go:208
I1109 11:05:42.682687       1 reflector.go:351] Caches populated for *v1.Secret from pkg/cache/internal/informers.go:208

And it is the functionality between these two log lines that takes the 60 seconds. The functionality is the list and watch functionality. The same log lines is produced for configmaps, but that does only take 11 seconds. So there seems to be a difference between the configmaps and secrets on Openshift.

In trying to reproduce this locally in a kind cluster. I deployed 85k secrets with a total size of approx. 1GB. The 1GB is based on RAM usage of the container with the controller (it has to index all the secrets in RAM). In the Openshift cluster the RAM usage is approx 1.2GB, so I am pretty confident that my local kind cluster has more secrets and with a total size bigger than the secrets in Openshift.
In the local kind cluster the time between the two log lines above were only 4 seconds. I feel that this concludes that the problem does not lie in client-go. Also, kubectl get secrets -A takes 45 seconds locally and only 8 seconds in Openshift, so there doesn't seem to be a relation between the evaluation time of that command and the performance of the list and watch functionality.

The text was updated successfully, but these errors were encountered:

abhishekdwivedi3060 · 2023-10-26T18:03:10Z

I'm also facing similar kind of issue, not exactly the same. After upgrading the controller-runtime to 0.15.0 and bumping k8s dependencies to 0.27.2, I started seeing below error in my operator:

W1026 18:01:13.769684       1 reflector.go:533] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: failed to list *v1.Secret: secrets is forbidden: User "system:serviceaccount:operators:aerospike-operator-controller-manager" cannot list resource "secrets" in API group "" at the cluster scope

My operator doesn't watch on secret resource and it was working fine for a couple of years without secret list permission. But somehow after dependency bump it started giving secret list permission error and the above error msgs continues in the logs

akutz · 2024-02-01T15:31:20Z

The controller-runtime client automatically starts a watch on any type of resource that is fetched using the client's Get method. See kubernetes-sigs/controller-runtime#1222 for more information.

bdjgs · 2024-04-26T06:39:35Z

The issue has gotten slightly better and the ListAndWatch time is no longer high enough to cause restarts. Since the amount of configmaps and secrets has increased and we havent dont any relevant changes (mostly dependency updates), I highly suspect that the issue is related to the performance of the cluster and not client-go.

bdjgs closed this as completed Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watching Secrets fails on startup of controller #1310

Watching Secrets fails on startup of controller #1310

bdjgs commented Oct 24, 2023 •

edited

Loading

abhishekdwivedi3060 commented Oct 26, 2023 •

edited

Loading

akutz commented Feb 1, 2024

bdjgs commented Apr 26, 2024

Watching Secrets fails on startup of controller #1310

Watching Secrets fails on startup of controller #1310

Comments

bdjgs commented Oct 24, 2023 • edited Loading

abhishekdwivedi3060 commented Oct 26, 2023 • edited Loading

akutz commented Feb 1, 2024

bdjgs commented Apr 26, 2024

bdjgs commented Oct 24, 2023 •

edited

Loading

abhishekdwivedi3060 commented Oct 26, 2023 •

edited

Loading