Figure out if MHC informers should be stopped if the cluster goes away #2577

vincepri · 2020-03-06T20:02:47Z

In the MachineHealthCheck code we run informers and we run goroutines

cluster-api/controllers/machinehealthcheck_controller.go

Line 416 in 619071a

go nodeInformer.Run(ctx.Done())

to async track changes to workload cluster nodes.

If a cluster gets deleted the informer is never stopped, we need a spike to understand the long-term implications of not closing informers and if we should take actions.

One example that comes to mind is: let's say we have a long running management cluster that creates and deletes lots of clusters (for reasons), not closing the informer would effectively leak goroutines.

Another point to investigate is what happens if a control plane becomes unavailable and we can't connect to it. Do we have ways to recover? Should we give up the informer after a certain amount of failures?

/kind cleanup
/milestone Next
/cc @JoelSpeed

JoelSpeed · 2020-03-06T20:15:07Z

/assign

I'll take a look into this

JoelSpeed · 2020-03-11T13:39:25Z

We are definitely going to need to do something about this, I tested this while testing some MHC stuff. Once I'd deleted the clutser, I started getting log lines like:

I0311 09:19:46.752805       1 reflector.go:185] Listing and watching *v1.Node from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105
E0311 09:19:46.758862       1 reflector.go:153] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Failed to list *v1.Node: Get https://jspeed-capi-1-apiserver-1102670450.us-east-1.elb.amazonaws.com:6443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp: lookup jspeed-capi-1-apiserver-1102670450.us-east-1.elb.amazonaws.com on 10.96.0.10:53: no such host

Approximately every second, indefinitely.

Another point to investigate is what happens if a control plane becomes unavailable and we can't connect to it. Do we have ways to recover? Should we give up the informer after a certain amount of failures?

Given the above error message was looping, I suspect that it would reconnect, but I can try and simulate this scenario too

I'm not sure what the best approach to coming up with a stopping mechanism would be:

Use an informer to watch for cluster deletion events - heavy, could miss the events?
Start a second go routine which attempts to fetch the cluster every <period> and if 404, closes the informer?
Other?

ncdc · 2020-03-11T14:09:51Z

Given that we resync every 10 minutes, even if we miss a cluster deletion event (which is extremely unlikely), we would be able to address the issue at the next resync.

We are already watching Clusters in the MHC controller, so we have that part covered. We need to add code in Reconcile somewhere that stops the cache and deletes the entry if the workload cluster is not reachable.

For an intermittently available workload cluster, the next time we reconcile an MHC for it, watchClusterNodes would try to recreate the cache.

Does that sound workable?

JoelSpeed · 2020-03-11T14:34:10Z

We are already watching Clusters in the MHC controller, so we have that part covered. We need to add code in Reconcile somewhere that stops the cache and deletes the entry if the workload cluster is not reachable.

My concern with adding something in reconcile is that we might not reconcile an MHC for a given cluster if it's been deleted right? I'm thinking, Cluster gets deleted, GC deletes MHCs for cluster, event triggers reconcile, all of these MHCs have gone, we now don't know which cluster they were for, so can't stop the right cache I don't think

One idea I just had to solve that problem would be to add a finalizer (though I'm hesitant at doing that) which, when an MHC is deleted, would list all MHCs by cluster, check if any are left for the same cluster, if not stop the cache, though this doesn't solve the intermittent connectivity problem

I also think it might be worth trying to keep this logic within the watchClusterNodes section of the code. Whatever we decide to do on this, I think we should try to keep it reasonably generic for the sakes of #2414

vincepri · 2020-04-20T23:05:48Z

/milestone v0.3.4

Should be fixed in #2880

vincepri · 2020-04-27T17:35:51Z

Merging this with #2414

/close

k8s-ci-robot · 2020-04-27T17:35:58Z

@vincepri: Closing this issue.

In response to this:

Merging this with #2414

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added this to the Next milestone Mar 6, 2020

k8s-ci-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Mar 6, 2020

ncdc mentioned this issue Mar 6, 2020

🏃 Use controller-runtime cache for watching remote nodes over informers #2496

Merged

k8s-ci-robot assigned JoelSpeed Mar 6, 2020

vincepri mentioned this issue Mar 23, 2020

KubeadmControlPlane v2 iteration and robustness #2753

Closed

ncdc mentioned this issue Mar 31, 2020

[WIP]✨ Add remote cluster cache manager #2835

Closed

JoelSpeed mentioned this issue Apr 8, 2020

✨ Add remote cluster cache manager #2880

Merged

vincepri mentioned this issue Apr 13, 2020

MachineHealthCheck nodes cache doesn't stop after cluster deletion #2902

Closed

k8s-ci-robot modified the milestones: Next, v0.3.4 Apr 20, 2020

k8s-ci-robot closed this as completed Apr 27, 2020

vincepri mentioned this issue Apr 27, 2020

Create shared package for watching Nodes in remote Clusters #2414

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out if MHC informers should be stopped if the cluster goes away #2577

Figure out if MHC informers should be stopped if the cluster goes away #2577

vincepri commented Mar 6, 2020

JoelSpeed commented Mar 6, 2020

JoelSpeed commented Mar 11, 2020

ncdc commented Mar 11, 2020

JoelSpeed commented Mar 11, 2020

vincepri commented Apr 20, 2020

vincepri commented Apr 27, 2020

k8s-ci-robot commented Apr 27, 2020

Figure out if MHC informers should be stopped if the cluster goes away #2577

Figure out if MHC informers should be stopped if the cluster goes away #2577

Comments

vincepri commented Mar 6, 2020

JoelSpeed commented Mar 6, 2020

JoelSpeed commented Mar 11, 2020

ncdc commented Mar 11, 2020

JoelSpeed commented Mar 11, 2020

vincepri commented Apr 20, 2020

vincepri commented Apr 27, 2020

k8s-ci-robot commented Apr 27, 2020