[WIP]✨ Add remote cluster cache manager #2835

ncdc · 2020-03-31T21:24:57Z

What this PR does / why we need it:
Add a remote cluster cache manager that owns the full lifecycle of
caches against workload clusters. Controllers that need to watch
resources in workload clusters should use this functionality.

This is the foundation for #2414 and #2577.

It does not currently check for connectivity failures to workload clusters. It only stops the workload cluster caches if there is an error retrieving the Cluster from the management cluster. If desired, I could add something that does a minimal "ping" to each workload cluster periodically, removing any caches that have connectivity issues. Or we could do that in a follow-up.

I have not updated MachineHealthCheck to use this. I could do that in this PR, or a follow-up.

I have not added anything for KubeadmControlPlane to use this. I could do that in this PR, or a follow-up (cc @detiber).

/priority important-soon

cc @vincepri @JoelSpeed

k8s-ci-robot · 2020-03-31T21:25:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ncdc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ncdc]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Add a remote cluster cache manager that owns the full lifecycle of caches against workload clusters. Controllers that need to watch resources in workload clusters should use this functionality. Signed-off-by: Andy Goldstein <[email protected]>

ncdc · 2020-03-31T21:50:14Z

I just realized I'm not plumbing a scheme into cache.Options in the cache.New call. Do you think this should be a single scheme for the entire manager, shared by all cluster caches, or something you can specify per cluster?

vincepri · 2020-04-01T03:31:11Z

I'd probably just use mgr.GetScheme

vincepri · 2020-04-01T03:35:10Z

controllers/remote/cluster_cache.go

@@ -0,0 +1,203 @@
+/*


Maybe we rename this to cluster_cache_controller?

See my comment below about manager vs reconciler. I'd prefer to keep this as-is, or rename to cluster_cache_manager.go. But again, if you feel strongly, I can rename it.

vincepri · 2020-04-01T03:35:28Z

controllers/remote/cluster_cache.go

+}
+
+// ClusterCacheManager manages client caches for workload clusters.
+type ClusterCacheManager struct {


Suggested change

type ClusterCacheManager struct {

type ClusterCacheReconciler struct {

I'd prefer to call this a manager. I know it's watching Clusters and reacting to them, but that's only as a maintenance task to stop and remove stale caches. I wouldn't consider this struct to be doing the same sort of reconciliation that the others are. But if you feel strongly enough, I can make the change.

I'm with @ncdc on this, I think this is more of a management task than a full on controller, it's not changing the state of the world anywhere, I'd personally stick with ClusterCacheManager

Manager is an overloaded word and we're potentially introducing a new naming convention here, how about ClusterCacheTracker? We already have ObjectTracker in this package and it'd go hand in hand.

This isn't blocking, if you all want to use Manager is fine, can be changed later if we find a better naming solution.

+1 to both avoiding Reconciler and Manager to avoid confusion.

Also, +1 to Tracker.

I'm splitting the "tracker" part from the "reconciler" part (2 different structs). Will be in next push.

vincepri · 2020-04-01T03:36:06Z

controllers/remote/cluster_cache.go

+	// For testing
+	newRESTConfig func(ctx context.Context, cluster client.ObjectKey) (*rest.Config, error)


non blocking: could we use the kubeconfig secret to create a client to envtest instead of mocking it?

Do you mean build up a dummy secret in the unit test?

Do you mean build up a dummy secret in the unit test?

This is what I did for the MHC tests and it seemed to work

cluster-api/controllers/machinehealthcheck_controller_test.go

Lines 72 to 73 in 09f87a2

By("Creating the remote Cluster kubeconfig")

Expect(kubeconfig.CreateEnvTestSecret(k8sClient, cfg, testCluster)).To(Succeed())

I think it's preferential to do this, but could definitely be a follow up PR

vincepri · 2020-04-01T03:37:01Z

controllers/remote/cluster_cache.go

+	}
+	m.clusterCaches[cluster] = cc
+
+	// Start the cache!!!


😄 I love the enthusiasm

vincepri · 2020-04-01T03:40:01Z

controllers/remote/cluster_cache.go

+
+	var cluster clusterv1.Cluster
+
+	err := m.managementClusterClient.Get(ctx, req.NamespacedName, &cluster)


Couldn't we have client.Client as part of the struct like elsewhere and use mgr.GetClient() to get it during init?

Yes, I can make that change

JoelSpeed

Had a read through and the following things came to mind:

When this is used in controllers, they will also have to track which clusters they have called watch for if I've understood this correctly?
Given the above, what happens if a cluster is deleted, and then recreated without the manager restarting, is this something we are expecting to happen/support?
How will this be consumed by other controllers, are they expected to have a ClusterCacheManager as a member field for fetching the caches? Should there be an ClusterCacheGetter interface that consists of just the ClusterCache method that is expected by the controllers? That is then given to them at the time SetupWithManager is called?

JoelSpeed · 2020-04-01T11:49:20Z

controllers/remote/cluster_cache.go

+}
+
+// ClusterCacheManager manages client caches for workload clusters.
+type ClusterCacheManager struct {


I'm with @ncdc on this, I think this is more of a management task than a full on controller, it's not changing the state of the world anywhere, I'd personally stick with ClusterCacheManager

JoelSpeed · 2020-04-01T11:51:42Z

controllers/remote/cluster_cache.go

+	// For testing
+	newRESTConfig func(ctx context.Context, cluster client.ObjectKey) (*rest.Config, error)


Do you mean build up a dummy secret in the unit test?

This is what I did for the MHC tests and it seemed to work

cluster-api/controllers/machinehealthcheck_controller_test.go

Lines 72 to 73 in 09f87a2

By("Creating the remote Cluster kubeconfig")

Expect(kubeconfig.CreateEnvTestSecret(k8sClient, cfg, testCluster)).To(Succeed())

I think it's preferential to do this, but could definitely be a follow up PR

JoelSpeed · 2020-04-01T11:57:40Z

controllers/remote/cluster_cache.go

+	if err == nil {
+		log.V(4).Info("Cluster still exists")
+		return reconcile.Result{}, nil
+	}


Might it be worth checking if the cluster deletion timestamp is set and stop the cache earlier? My concern is that the manager may miss the delete event and Reconcile may not actually be called on a cluster as it is finally removed from the API. Checking the deletion timestamp on Cluster objects while they still exist gives the manager more chance to catch a cluster being deleted if that makes sense?

If the manager is running, it will not miss the deletion event. One of two situations will happen:

The deletion occurs and the delete event comes straight through

The watch drops and/or history is compacted, the Cluster is deleted in the interim, then the watch is reestablished. In this case, the shared informer will send a simulated delete event ("DeletionFinalStateUnknown", because the final state of the object at the time of deletion is unknown)

However, I can add code to check for deletion timestamps too, if you think it'll be beneficial.

I was unaware the the informer would send a simulated delete event, TIL!

In that case, I'm content with it as is, thanks for the explanation

ncdc · 2020-04-01T12:58:29Z

@JoelSpeed

When this is used in controllers, they will also have to track which clusters they have called watch for if I've understood this correctly?

That's a good point. Maybe controller-runtime dedupes? If not, I'll have to extend this in some way to account for that.

Given the above, what happens if a cluster is deleted, and then recreated without the manager restarting, is this something we are expecting to happen/support?

If the cluster is deleted, the Reconcile function should see that, and stop the cache for it. When the cluster is recreated with the same name, a new cache would be created for it. But I think I should probably extend the map key to include the uid of the cluster, in addition to dealing with deduping.

How will this be consumed by other controllers, are they expected to have a ClusterCacheManager as a member field for fetching the caches? Should there be an ClusterCacheGetter interface that consists of just the ClusterCache method that is expected by the controllers? That is then given to them at the time SetupWithManager is called?

The ClusterCacheManager is meant to be a singleton, constructed in main.go. The other controllers will have a field for it (or an interface, as you suggested), filled in when they're created in main.go

ncdc · 2020-04-01T13:43:31Z

Oh, I am also considering splitting the ClusterCacheManager into 2 structs - 1 to manage the caches, and a 2nd to be the controller/reconciler that watches for deleted Clusters (separation of concerns). Would that be useful?

JoelSpeed · 2020-04-01T13:52:55Z

Oh, I am also considering splitting the ClusterCacheManager into 2 structs - 1 to manage the caches, and a 2nd to be the controller/reconciler that watches for deleted Clusters (separation of concerns). Would that be useful?

If it's likely to get more complicated (eg due to deduplication of watches) then I think this would be beneficial

detiber · 2020-04-01T16:13:21Z

controllers/remote/cluster_cache.go

+}
+
+// ClusterCacheManager manages client caches for workload clusters.
+type ClusterCacheManager struct {


+1 to both avoiding Reconciler and Manager to avoid confusion.

Also, +1 to Tracker.

detiber · 2020-04-01T16:29:03Z

controllers/remote/cluster_cache.go

+		return nil, errors.Wrap(err, "error creating REST client config for remote cluster")
+	}
+
+	remoteCache, err := cache.New(config, cache.Options{})


Would it make sense to expose at least a subset of cache.Options here? It might be good to at least allow for setting of Resync.

detiber · 2020-04-01T17:13:47Z

controllers/remote/cluster_cache.go

+	var cluster clusterv1.Cluster
+
+	err := m.managementClusterClient.Get(ctx, req.NamespacedName, &cluster)
+	if err == nil {


Should this also attempt to guard against intermittent errors here as well and only continue on to the cache deletion if apierrors.IsNotFound(err)?

I'd have to do some (re)digging, but given that the client is doing cached reads, I think the only way you can get a non-404 error is if the client's cache hasn't been started yet.

detiber · 2020-04-01T17:20:03Z

controllers/remote/cluster_cache.go

+	log.V(4).Info("Stopping cluster cache")
+	c.Stop()
+
+	delete(m.clusterCaches, req.NamespacedName)


Should there be a method on ClusterCacheManager that wraps this with a lock? It seems a bit weird that we are locking on insertion and reading but not deletion.

Yeah I realized this last night. I have it fixed locally and it'll be in the next push.

detiber · 2020-04-01T17:23:00Z

controllers/remote/cluster_cache.go

@@ -0,0 +1,203 @@
+/*


Would it make sense to move the bulk of the implementation here to a library so that it would be easier to consume outside of the core controller manager? I wouldn't expect any providers (especially out of tree) to import anything under controllers

This is actually a library. I've (privately) questioned why the remote package is in controllers/remote and not somewhere else. I'd say we should file an issue to move remote out of controllers for 0.4.x. WDYT?

i initially put it under there because it was code to be used within capi controllers, it expanded over time though

ncdc · 2020-04-01T20:10:53Z

I'm thinking about adjusting things around so the only exported function on ClusterCacheTracker is:

func (m *ClusterCacheTracker) Watch(
	ctx context.Context,
	watcher Watcher,
	cluster client.ObjectKey,
	kind runtime.Object,
	cacheOptions cache.Options,
	eventHandler handler.EventHandler,
	predicates ...predicate.Predicate,
) error

(Watcher is a slimmed down interface for a Controller.)

(although I'll probably change the cluster ObjectKey to a namespace/name/uid struct.)

This is not pretty (lots of params). But it'll dedupe on (watcher, cluster, kind, eventHandler). WDYT?

vincepri · 2020-04-01T20:55:50Z

I'm fine having a bunch of parameters if there is a need for me, maybe do a WatchInput struct?

JoelSpeed · 2020-04-02T08:56:07Z

#2835 (comment)

This is basically the conclusion I came to last week so +1 from me, also +1 to the suggestion of WatchInput if there are lots of params

Signed-off-by: Andy Goldstein <[email protected]>

ncdc · 2020-04-07T13:18:28Z

I just pushed up a WIP commit with the code to add the Watch() support. I unfortunately won't have any significant amount of time to carry this forward, but @JoelSpeed has graciously offered to take it over. Thanks Joel!

k8s-ci-robot · 2020-04-07T13:27:52Z

@ncdc: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-cluster-api-make	`121d00e`	link	`/test pull-cluster-api-make`
pull-cluster-api-test	`121d00e`	link	`/test pull-cluster-api-test`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

JoelSpeed · 2020-04-08T13:58:05Z

I've opened a PR to replace this one #2880

vincepri · 2020-04-08T14:21:41Z

Thanks @JoelSpeed and @ncdc ! I'll go ahead and close this one

/close

k8s-ci-robot · 2020-04-08T14:21:50Z

@vincepri: Closed this PR.

In response to this:

Thanks @JoelSpeed and @ncdc ! I'll go ahead and close this one

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot requested review from CecileRobertMichon and chuckha March 31, 2020 21:25

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 31, 2020

Add remote cluster cache manager

819479b

Add a remote cluster cache manager that owns the full lifecycle of caches against workload clusters. Controllers that need to watch resources in workload clusters should use this functionality. Signed-off-by: Andy Goldstein <[email protected]>

ncdc force-pushed the workload-cluster-cache branch from e0d7394 to 819479b Compare March 31, 2020 21:33

vincepri reviewed Apr 1, 2020

View reviewed changes

JoelSpeed reviewed Apr 1, 2020

View reviewed changes

detiber reviewed Apr 1, 2020

View reviewed changes

wip

121d00e

Signed-off-by: Andy Goldstein <[email protected]>

JoelSpeed mentioned this pull request Apr 8, 2020

✨ Add remote cluster cache manager #2880

Merged

k8s-ci-robot closed this Apr 8, 2020

ncdc deleted the workload-cluster-cache branch October 1, 2020 15:10

	type ClusterCacheManager struct {
	type ClusterCacheReconciler struct {

		// For testing
		newRESTConfig func(ctx context.Context, cluster client.ObjectKey) (*rest.Config, error)

	By("Creating the remote Cluster kubeconfig")
	Expect(kubeconfig.CreateEnvTestSecret(k8sClient, cfg, testCluster)).To(Succeed())


		var cluster clusterv1.Cluster

		err := m.managementClusterClient.Get(ctx, req.NamespacedName, &cluster)

[WIP]✨ Add remote cluster cache manager #2835

[WIP]✨ Add remote cluster cache manager #2835

Conversation

ncdc commented Mar 31, 2020

k8s-ci-robot commented Mar 31, 2020

ncdc commented Mar 31, 2020

vincepri commented Apr 1, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoelSpeed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ncdc commented Apr 1, 2020

ncdc commented Apr 1, 2020

JoelSpeed commented Apr 1, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ncdc commented Apr 1, 2020

vincepri commented Apr 1, 2020 • edited Loading

JoelSpeed commented Apr 2, 2020

ncdc commented Apr 7, 2020

k8s-ci-robot commented Apr 7, 2020

JoelSpeed commented Apr 8, 2020

vincepri commented Apr 8, 2020

k8s-ci-robot commented Apr 8, 2020

vincepri commented Apr 1, 2020 •

edited

Loading