Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out if MHC informers should be stopped if the cluster goes away #2577

Closed
vincepri opened this issue Mar 6, 2020 · 7 comments
Closed
Assignees
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt.
Milestone

Comments

@vincepri
Copy link
Member

vincepri commented Mar 6, 2020

In the MachineHealthCheck code we run informers and we run goroutines

go nodeInformer.Run(ctx.Done())
to async track changes to workload cluster nodes.

If a cluster gets deleted the informer is never stopped, we need a spike to understand the long-term implications of not closing informers and if we should take actions.

One example that comes to mind is: let's say we have a long running management cluster that creates and deletes lots of clusters (for reasons), not closing the informer would effectively leak goroutines.

Another point to investigate is what happens if a control plane becomes unavailable and we can't connect to it. Do we have ways to recover? Should we give up the informer after a certain amount of failures?

/kind cleanup
/milestone Next
/cc @JoelSpeed

@k8s-ci-robot k8s-ci-robot added this to the Next milestone Mar 6, 2020
@k8s-ci-robot k8s-ci-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Mar 6, 2020
@JoelSpeed
Copy link
Contributor

/assign

I'll take a look into this

@JoelSpeed
Copy link
Contributor

We are definitely going to need to do something about this, I tested this while testing some MHC stuff. Once I'd deleted the clutser, I started getting log lines like:

I0311 09:19:46.752805       1 reflector.go:185] Listing and watching *v1.Node from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105
E0311 09:19:46.758862       1 reflector.go:153] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Failed to list *v1.Node: Get https://jspeed-capi-1-apiserver-1102670450.us-east-1.elb.amazonaws.com:6443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp: lookup jspeed-capi-1-apiserver-1102670450.us-east-1.elb.amazonaws.com on 10.96.0.10:53: no such host

Approximately every second, indefinitely.

Another point to investigate is what happens if a control plane becomes unavailable and we can't connect to it. Do we have ways to recover? Should we give up the informer after a certain amount of failures?

Given the above error message was looping, I suspect that it would reconnect, but I can try and simulate this scenario too

I'm not sure what the best approach to coming up with a stopping mechanism would be:

  • Use an informer to watch for cluster deletion events - heavy, could miss the events?
  • Start a second go routine which attempts to fetch the cluster every <period> and if 404, closes the informer?
  • Other?

@ncdc
Copy link
Contributor

ncdc commented Mar 11, 2020

Given that we resync every 10 minutes, even if we miss a cluster deletion event (which is extremely unlikely), we would be able to address the issue at the next resync.

We are already watching Clusters in the MHC controller, so we have that part covered. We need to add code in Reconcile somewhere that stops the cache and deletes the entry if the workload cluster is not reachable.

For an intermittently available workload cluster, the next time we reconcile an MHC for it, watchClusterNodes would try to recreate the cache.

Does that sound workable?

@JoelSpeed
Copy link
Contributor

We are already watching Clusters in the MHC controller, so we have that part covered. We need to add code in Reconcile somewhere that stops the cache and deletes the entry if the workload cluster is not reachable.

My concern with adding something in reconcile is that we might not reconcile an MHC for a given cluster if it's been deleted right? I'm thinking, Cluster gets deleted, GC deletes MHCs for cluster, event triggers reconcile, all of these MHCs have gone, we now don't know which cluster they were for, so can't stop the right cache I don't think

One idea I just had to solve that problem would be to add a finalizer (though I'm hesitant at doing that) which, when an MHC is deleted, would list all MHCs by cluster, check if any are left for the same cluster, if not stop the cache, though this doesn't solve the intermittent connectivity problem

I also think it might be worth trying to keep this logic within the watchClusterNodes section of the code. Whatever we decide to do on this, I think we should try to keep it reasonably generic for the sakes of #2414

@vincepri
Copy link
Member Author

/milestone v0.3.4

Should be fixed in #2880

@k8s-ci-robot k8s-ci-robot modified the milestones: Next, v0.3.4 Apr 20, 2020
@vincepri
Copy link
Member Author

Merging this with #2414

/close

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closing this issue.

In response to this:

Merging this with #2414

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt.
Projects
None yet
Development

No branches or pull requests

4 participants