Watches are not working for some controller after `informers_map.go:204: watch of *v1alpha1.SFService ended with: too old resource version: 79387199 (79398464)` #869

vivekzhere · 2020-03-20T05:57:39Z

We have four controllers for four different CRs in our operator. Once in a while we see these logs like

10:45:57.185826       1 reflector.go:326] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: watch of *v1alpha1.SFService ended with: too old
resource version: 79387199 (79398464)

these comes for three of the CRs. For one CR this log does not come. After that the controller for this one resource is not processing any request. Only a restart of the operator fixes this.

The CR for which watch is failing as huge number of resources (~30K). While the other reources are less than 10K in number.

Also we a have two instance of the operator running with leader election. We see a pattern that this issue happens when the slave becomes master and after few minutes go down (for update) and the other one becomes master again. In the first switch over the log for all four resources are seen. But in the second switch over only three resources are watched

We are using controller-runtime 0.4.0.

(kubernetes/kubernetes#22024)

The text was updated successfully, but these errors were encountered:

alvaroaleman · 2020-03-23T02:52:01Z

@vivekzhere the log message is generally nothing to worry about and expected to occasionally appear: kubernetes/kubernetes#22024 (comment)

After that the controller for this one resource is not processing any request. Only a restart of the operator fixes this.

That sounds like a bug. Does your reconciler still get requests and blocks when processing them or does it not even get them?

We see a pattern that this issue happens when the slave becomes master and after few minutes go down (for update) and the other one becomes master again.

I don't follow, are you referring to leader election with the slave/master terminology? If yes, I don't understand the and the other one becomes master again. The one that was initially not having the leader lease can only get the leader lease if the current leader drops it. That should only happen if it is stopped (unless there are timeouts refreshing the lease). If it is stopped, how can it later on become master again?

Could you list the sequence in which the instances get started/stopped and get the leader lease?

vivekzhere · 2020-03-23T04:32:09Z

@alvaroaleman

That sounds like a bug. Does your reconciler still get requests and blocks when processing them or does it not even get them?

No. The reconciler for this one controller is not getting any requests at all.

I don't follow, are you referring to leader election with the slave/master terminology?

Yes. I am referring to leader election with the slave/master terminology.

Could you list the sequence in which the instances get started/stopped and get the leader lease?

There are two instances of the operator running(inst-0 and inst-1).
Initially inst-0 is leader. Then inst-0 goes down for upgrade and inst-1 becomes the leader. The upgrade process takes some time(at least a couple of minutes). So inst-1 starts reconciling resources. Here all controllers are working correctly in inst-1. Then inst-0 comes back up after upgrade and is waiting for leader lease. Then inst-1 goes down for upgrade and inst-0 becomes leader again. Now in inst-0 only three controllers are reconciling resources. Reconciler for one controller is not receiving any request.

PS: inst-0 and inst-1 are deployed on two different vms. We are not deploying the operator on k8s We provide the kubeconfig for the k8s api server within the vm.

Now we have identified that issue happens only during upgrades which requires the recreation of these vms. In such cases the issue is happening every time. But we are not seeing the issue during upgrades which just replaces the binaries of operator and does not do a vm recreate.

alvaroaleman · 2020-03-23T12:13:36Z

Now we have identified that issue happens only during upgrades which requires the recreation of these vms. In such cases the issue is happening every time. But we are not seeing the issue during upgrades which just replaces the binaries of operator and does not do a vm recreate.

Okay, so are you sure this is an issue with controller-runtime then?

vivekzhere · 2020-03-25T14:33:15Z

@alvaroaleman We are not really sure this is a controller runtime issue anymore.

We were doing kustomize build config/crd | kubectl apply -f - before starting inst-0 and inst-1. So during an update this was happening twice which is redundant (But I assumed this should not cause any issue). So we removed this from inst-1. Now after this change we are not able to reproduce the issue.

Elaborating a little more on our setup. We have two api servers also deployed along with the operator. On the on VM on which inst-0 is running, an instance of api server is also running and inst-0 points to this api server. Similarly, on the VM on which inst-1 is running, an instance of api server is also running and inst-1 points to this api server. Both api servers point to the same etcd.

vincepri · 2020-03-26T16:57:08Z

/triage support

vincepri added this to the Next milestone Mar 26, 2020

k8s-ci-robot added the kind/support Categorizes issue or PR as a support question. label Mar 26, 2020

vivekzhere closed this as completed Apr 13, 2020

timebertt mentioned this issue Jun 16, 2021

Upgrade sigs.k8s.io/controller-runtime to v0.9, k8s.io/* to v0.21 gardener/gardener#4174

Closed

tthvo mentioned this issue Jul 25, 2022

build(operator-sdk): upgrade operator sdk version to 1.22.2 cryostatio/cryostat-operator#430

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watches are not working for some controller after `informers_map.go:204: watch of *v1alpha1.SFService ended with: too old resource version: 79387199 (79398464)` #869

Watches are not working for some controller after `informers_map.go:204: watch of *v1alpha1.SFService ended with: too old resource version: 79387199 (79398464)` #869

vivekzhere commented Mar 20, 2020

alvaroaleman commented Mar 23, 2020

vivekzhere commented Mar 23, 2020 •

edited

Loading

alvaroaleman commented Mar 23, 2020

vivekzhere commented Mar 25, 2020

vincepri commented Mar 26, 2020

Watches are not working for some controller after informers_map.go:204: watch of *v1alpha1.SFService ended with: too old resource version: 79387199 (79398464) #869

Watches are not working for some controller after informers_map.go:204: watch of *v1alpha1.SFService ended with: too old resource version: 79387199 (79398464) #869

Comments

vivekzhere commented Mar 20, 2020

alvaroaleman commented Mar 23, 2020

vivekzhere commented Mar 23, 2020 • edited Loading

alvaroaleman commented Mar 23, 2020

vivekzhere commented Mar 25, 2020

vincepri commented Mar 26, 2020

Watches are not working for some controller after `informers_map.go:204: watch of *v1alpha1.SFService ended with: too old resource version: 79387199 (79398464)` #869

Watches are not working for some controller after `informers_map.go:204: watch of *v1alpha1.SFService ended with: too old resource version: 79387199 (79398464)` #869

vivekzhere commented Mar 23, 2020 •

edited

Loading