-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Add remote cluster cache manager #2880
✨ Add remote cluster cache manager #2880
Conversation
Things definitely brings up some interesting questions...
|
/milestone v0.3.4 |
I could see us adding goroutine(s) that periodically attempt to do a live query against each workload cluster's apiserver, maybe? |
I'd leave it as a client responsibility to decide what to do when connectivity cannot be established to a workload cluster. In some cases, we might want a controller to not error if we can't establish contact, in some other case we might want to return an error and go in backoff loop. Given that all use cases aren't clear yet, I'd leave it to clients to decide and bring up requirements in the future. |
I think it should be this cluster cache manager's responsibility to evict the workload cache if connectivity is down. Otherwise, clients (controllers, really) will be unaware they're not getting updated data. |
What if the lost of connectivity is temporary (e.g. upgrades), would we have the ability to recover? |
We could code it to retry up to $conditions (n times, some timeout, etc) before evicting. After that, it would require either an event (create/update/delete) or a full (management cluster cache) resync to get it recreated. |
That sounds good, I do think we need to think about recovering from connectivity errors |
@JoelSpeed how would the watch fail? |
@ncdc When calling |
@JoelSpeed ah, that makes sense. Thanks for clarifying! My suggestion was in a different section of the code, though: a goroutine owned by the cluster cache manager/tracker/reconciler. The retries would be hidden to the other controllers. So a controller would have to be reconciling something and call |
@ncdc Ack yeah sorry, I follow, don't think I explained very well, I think we are on the same page. I meant that after an eviction, as you've said, the controller would need to have some event to trigger the reconcile, and then Watch would fail. So the client controller wouldn't know that the remote cluster is broken until it is next reconciling and calls Watch |
Do we want to try and tackle the connection loss in this PR or should that be considered a follow up? |
I like the idea of having PRs be as complete as possible, but will defer to @vincepri if he feels differently. |
Connection loss seems an integral part of this package, +1 to what Andy said |
Ack, I can try having a go at the suggestion from @ncdc. Does anyone have any opinions about how long the time out should be (1m, 5m?), how long the polling interval should be (10s, 30s?), should these be user configurable (flags to the manager?), and a "safe" but lightweight way to check connectivity to the cluster (poll |
@JoelSpeed This is what CAPA is currently using to health check https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/9913ac8dbb96ed9bcffa6eba75b886cd4ef8c20b/pkg/cloud/services/elb/loadbalancer.go#L263-L269, we can work off of that |
@JoelSpeed any updates on this PR? |
@vincepri Sorry, had got stuck on the health checking, was more challenging to work out what to do and how to test it than I thought it would be. I've pushed some health checking code though I'm not 100% sure about it, PTAL |
LGTM, thanks! |
859329b
to
4b227a4
Compare
That was quick, thanks @ncdc! Squashed |
/lgtm |
/test pull-cluster-api-test |
/hold |
Something is flaky in the new tests:
|
Ack, will investigate, I didn't run the tests too much after fixing the races so may have introduced a problem |
/lgtm cancel |
I've spent the last 90 mins investigating this to work out where this could have flaked. For this test to fail, something had to call If it was the health checker, this would imply the health check failed, but it's health checking the testenv API so that shouldn't have happened and I would expect it to have manifested before this point. If it was the reconciler, this would imply the cluster didn't exist when we called reconcile, again, I would have thought this would have come up before, but, I've added an extra check to make sure the cluster exists in the cache before we call that. I played around with timings for this and added a big sleep at the beginning and was unable to reproduce the flake, so I think this should fix it. I'd suggest we run the tests a few times on CI and see if it flakes again before squash/merge |
Investigating this too, this might be of help: https://gist.github.com/benmoss/a83b50cbc38d7b1c99a1a6387b43d0ad edit: nevermind that was based on an older checkout of this PR |
I ran the tests 20 times and they didn't fail, there's a compilation error making it fail right now though
|
Have you rebased everything on master? CI does that for us, just want to make sure your last commit is also rebased |
Oops, I messed up my commit staging, will fix up, rebase on latest and re-push |
f5c9f5d
to
f617947
Compare
/test pull-cluster-api-test |
/test pull-cluster-api-test Just want to run this a few times to see if it flakes again in CI |
LGTM @JoelSpeed need to squash |
Squash away! |
Add a remote cluster cache manager that owns the full lifecycle of caches against workload clusters. Controllers that need to watch resources in workload clusters should use this functionality. Signed-off-by: Andy Goldstein <[email protected]>
f617947
to
090ebe6
Compare
/lgtm |
What this PR does / why we need it:
Add a remote cluster cache manager that owns the full lifecycle of
caches against workload clusters. Controllers that need to watch
resources in workload clusters should use this functionality.
This is the foundation for #2414 and #2577.
It does not currently check for connectivity failures to workload clusters. It only stops the workload cluster caches if there is an error retrieving the Cluster from the management cluster. If desired, I could add something that does a minimal "ping" to each workload cluster periodically, removing any caches that have connectivity issues. Or we could do that in a follow-up.
I have not updated MachineHealthCheck to use this. I could do that in this PR, or a follow-up.
I have not added anything for KubeadmControlPlane to use this. I could do that in this PR, or a follow-up (cc @detiber).
/priority important-soon
cc @vincepri @ncdc
--
This replaces #2835. I've taken the code that @ncdc wrote, addressed some feedback from the previous PR and added a test suite for this code. I have not done anything regarding this comment https://github.com/kubernetes-sigs/cluster-api/pull/2835/files#r401783111 as I was not sure how to interpret the conversation
With the code as is, it is not actually being used anywhere. To be used, we need to create a
ClusterCacheTracker
and callNewClusterCacheReconciler
withinmain.go
, pass theClusterCacheTracker
into controllers that will need access to remote caches (eg MHC), and then, from within the controller, callClusterCacheTracker.Watch
when reconciling their object. Tracking which of the clusters have already been watched and deduplication is handled within theClusterCacheTracker
, so no tracking is needed within individual controllers.The deduplication will not work properly however with a different set of predicates for each Watch, I'm not sure if that's a major problem or not, open to suggestions for rewrites if there's a good way to fix that if we think it's important