-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figure out if MHC informers should be stopped if the cluster goes away #2577
Comments
/assign I'll take a look into this |
We are definitely going to need to do something about this, I tested this while testing some MHC stuff. Once I'd deleted the clutser, I started getting log lines like:
Approximately every second, indefinitely.
Given the above error message was looping, I suspect that it would reconnect, but I can try and simulate this scenario too I'm not sure what the best approach to coming up with a stopping mechanism would be:
|
Given that we resync every 10 minutes, even if we miss a cluster deletion event (which is extremely unlikely), we would be able to address the issue at the next resync. We are already watching Clusters in the MHC controller, so we have that part covered. We need to add code in Reconcile somewhere that stops the cache and deletes the entry if the workload cluster is not reachable. For an intermittently available workload cluster, the next time we reconcile an MHC for it, Does that sound workable? |
My concern with adding something in reconcile is that we might not reconcile an MHC for a given cluster if it's been deleted right? I'm thinking, Cluster gets deleted, GC deletes MHCs for cluster, event triggers reconcile, all of these MHCs have gone, we now don't know which cluster they were for, so can't stop the right cache I don't think One idea I just had to solve that problem would be to add a finalizer (though I'm hesitant at doing that) which, when an MHC is deleted, would list all MHCs by cluster, check if any are left for the same cluster, if not stop the cache, though this doesn't solve the intermittent connectivity problem I also think it might be worth trying to keep this logic within the |
/milestone v0.3.4 Should be fixed in #2880 |
Merging this with #2414 /close |
@vincepri: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
In the MachineHealthCheck code we run informers and we run goroutines
cluster-api/controllers/machinehealthcheck_controller.go
Line 416 in 619071a
If a cluster gets deleted the informer is never stopped, we need a spike to understand the long-term implications of not closing informers and if we should take actions.
One example that comes to mind is: let's say we have a long running management cluster that creates and deletes lots of clusters (for reasons), not closing the informer would effectively leak goroutines.
Another point to investigate is what happens if a control plane becomes unavailable and we can't connect to it. Do we have ways to recover? Should we give up the informer after a certain amount of failures?
/kind cleanup
/milestone Next
/cc @JoelSpeed
The text was updated successfully, but these errors were encountered: