fix: remediator missing custom resource events #1441

karlkfi · 2024-09-26T23:47:28Z

Prior to this change, the remediator watches were only being started
for new custom resources after the apply attempt had fully completed.
This left some time after the object was applied that the remediator
could miss events made by third-parties. Normally, this would be fine,
because the remediator would revert any change after the watch was
started. But if a DELETE event was missed, the object wouldn't be
recreated until the next apply attempt.

This change adds a CRD Controller to the remediator that watches CRDs
and executes any registered handlers when the CRD is established,
unestablished, or deleted. The remediator now registers CRD handlers
for each resource type it watches, starting watchers as soon as
possible, without waiting for the next apply attempt.

This change also adds a ClusterRole and ClusterRoleBinding specifically
for RepoSync reconcilers, to allow watching of CRDs. RootSync
reconcilers now also watch CRDs, but they already have a CR & CRD.

Fixes: b/355532135

Extracted:

karlkfi · 2024-10-03T17:39:34Z

/retest

pkg/reconcilermanager/controllers/crd_controller.go

nan-yu · 2024-10-08T16:42:07Z

pkg/reconcilermanager/controllers/rootsync_controller.go

@@ -416,6 +418,14 @@ func (r *RootSyncReconciler) deleteManagedObjects(ctx context.Context, reconcile

 // Register RootSync controller with reconciler-manager.
 func (r *RootSyncReconciler) Register(mgr controllerruntime.Manager, watchFleetMembership bool) error {
+	r.lock.Lock()
+	defer r.lock.Unlock()


Do we need the lock? The Register function is only invoked once and won't be re-registered.

The lock protects the controller from read/write race condition. But the way we're calling Register today it probably won't ever be called in parallel.

We don't technically need it today, it's just good defensive practice to protect variables that could be used in parallel.

Locking is cheap. Lock contention is what's expensive.

pkg/remediator/watch/manager.go

Prior to this change, the remediator watches were only being started for new custom resources after the apply attempt had fully completed. This left some time after the object was applied that the remediator could miss events made by third-parties. Normally, this would be fine, because the remediator would revert any change after the watch was started. But if a DELETE event was missed, the object wouldn't be recreated until the next apply attempt. This change adds a CRD Controller to the remediator that watches CRDs and executes any registered handlers when the CRD is established, unestablished, or deleted. The remediator now registers CRD handlers for each resource type it watches, starting watchers as soon as possible, without waiting for the next apply attempt.

karlkfi · 2024-10-09T20:49:30Z

For posterity, since inline comment threads can get lost...

Unfortunately, neither the DynamicRESTMapper in controller-runtime nor the DeferredDiscoveryRESTMapper/CachedDiscoveryClient in client-go implement auto-invalidation of resources, unless aggregated discovery is enabled on the server. While aggregated discovery is beta in 1.27+ and thus enabled by default, that doesn't guarantee that it will be enabled on non-GKE clusters.

DynamicRESTMapper only recently added support for aggregated discovery, but it's in a newer controller-runtime than we're using today. While it does handle auto-discovery when aggregated discovery is disabled, it does not handle auto-invalidation.
CachedDiscoveryClient does support with aggregated discovery, but when disabled it doesn't handle auto-discovery or auto-invalidation. It just has a cache TTL.

So I've added a new ReplaceOnResetRESTMapper that can be used to replace the RESTMapper when Reset is called. Then I added some code to the watch.Manager to handle calling Reset on the mapper when a CRD is established or unestablished, if the mapper still knows about the deleted resource or doesn't know about a new resource. This handles auto-discovery and auto-invalidation of resources in the RESTMapper, but it's a relatively inefficient.

Hopefully at some point in the future we can make aggregated discovery a requirement and use a simplified DiscoveryClient to handle both auto-discovery and auto-invalidation.

pkg/remediator/watch/manager.go

nan-yu

/lgtm

google-oss-prow · 2024-10-09T22:37:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nan-yu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [nan-yu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot added the do-not-merge/work-in-progress label Sep 26, 2024

google-oss-prow bot requested review from Camila-B and nan-yu September 26, 2024 23:47

google-oss-prow bot added the size/L label Sep 26, 2024

karlkfi force-pushed the karl-crd-watcher branch 4 times, most recently from 528238f to 7abd4b0 Compare September 27, 2024 20:13

karlkfi force-pushed the karl-crd-watcher branch from 7abd4b0 to 7e46681 Compare October 3, 2024 18:55

google-oss-prow bot added size/XL and removed size/L labels Oct 3, 2024

karlkfi force-pushed the karl-crd-watcher branch 8 times, most recently from 85b0b0a to d3bf55d Compare October 4, 2024 22:15

karlkfi requested review from sdowell and tiffanny29631 and removed request for Camila-B October 4, 2024 22:16

karlkfi changed the title ~~[WIP] fix: remediator missing custom resource events~~ fix: remediator missing custom resource events Oct 4, 2024

google-oss-prow bot removed the do-not-merge/work-in-progress label Oct 4, 2024

karlkfi force-pushed the karl-crd-watcher branch from d3bf55d to d7ba23e Compare October 4, 2024 22:29

karlkfi changed the title ~~fix: remediator missing custom resource events~~ [WIP] fix: remediator missing custom resource events Oct 7, 2024

google-oss-prow bot added the do-not-merge/work-in-progress label Oct 7, 2024

karlkfi force-pushed the karl-crd-watcher branch 2 times, most recently from 22cc5cd to eb6b50f Compare October 7, 2024 22:55

karlkfi changed the title ~~[WIP] fix: remediator missing custom resource events~~ fix: remediator missing custom resource events Oct 7, 2024

google-oss-prow bot removed the do-not-merge/work-in-progress label Oct 7, 2024

karlkfi force-pushed the karl-crd-watcher branch 2 times, most recently from b6090a4 to fe7bb01 Compare October 7, 2024 23:03

nan-yu reviewed Oct 8, 2024

View reviewed changes

karlkfi force-pushed the karl-crd-watcher branch from fe7bb01 to 11a7118 Compare October 8, 2024 17:36

nan-yu reviewed Oct 8, 2024

View reviewed changes

pkg/remediator/watch/manager.go Show resolved Hide resolved

karlkfi force-pushed the karl-crd-watcher branch from 11a7118 to f4e5560 Compare October 9, 2024 20:09

nan-yu reviewed Oct 9, 2024

View reviewed changes

pkg/remediator/watch/manager.go Show resolved Hide resolved

nan-yu approved these changes Oct 9, 2024

View reviewed changes

google-oss-prow bot assigned nan-yu Oct 9, 2024

google-oss-prow bot added the lgtm label Oct 9, 2024

google-oss-prow bot added the approved label Oct 9, 2024

google-oss-prow bot merged commit 89f6973 into GoogleContainerTools:main Oct 9, 2024
6 checks passed

karlkfi deleted the karl-crd-watcher branch October 9, 2024 22:47

karlkfi mentioned this pull request Oct 9, 2024

chore: share mapper between controllers #1450

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: remediator missing custom resource events #1441

fix: remediator missing custom resource events #1441

karlkfi commented Sep 26, 2024 •

edited

Loading

karlkfi commented Oct 3, 2024

nan-yu Oct 8, 2024

karlkfi Oct 8, 2024

karlkfi Oct 8, 2024

karlkfi commented Oct 9, 2024

nan-yu left a comment

google-oss-prow bot commented Oct 9, 2024

fix: remediator missing custom resource events #1441

fix: remediator missing custom resource events #1441

Conversation

karlkfi commented Sep 26, 2024 • edited Loading

karlkfi commented Oct 3, 2024

nan-yu Oct 8, 2024

Choose a reason for hiding this comment

karlkfi Oct 8, 2024

Choose a reason for hiding this comment

karlkfi Oct 8, 2024

Choose a reason for hiding this comment

karlkfi commented Oct 9, 2024

nan-yu left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Oct 9, 2024

karlkfi commented Sep 26, 2024 •

edited

Loading