Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remediator missing custom resource events #1441

Merged

Conversation

karlkfi
Copy link
Contributor

@karlkfi karlkfi commented Sep 26, 2024

Prior to this change, the remediator watches were only being started
for new custom resources after the apply attempt had fully completed.
This left some time after the object was applied that the remediator
could miss events made by third-parties. Normally, this would be fine,
because the remediator would revert any change after the watch was
started. But if a DELETE event was missed, the object wouldn't be
recreated until the next apply attempt.

This change adds a CRD Controller to the remediator that watches CRDs
and executes any registered handlers when the CRD is established,
unestablished, or deleted. The remediator now registers CRD handlers
for each resource type it watches, starting watchers as soon as
possible, without waiting for the next apply attempt.

This change also adds a ClusterRole and ClusterRoleBinding specifically
for RepoSync reconcilers, to allow watching of CRDs. RootSync
reconcilers now also watch CRDs, but they already have a CR & CRD.

Fixes: b/355532135

Extracted:

@karlkfi
Copy link
Contributor Author

karlkfi commented Oct 3, 2024

/retest

@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Oct 3, 2024
@karlkfi karlkfi force-pushed the karl-crd-watcher branch 8 times, most recently from 85b0b0a to d3bf55d Compare October 4, 2024 22:15
@karlkfi karlkfi requested review from sdowell and tiffanny29631 and removed request for Camila-B October 4, 2024 22:16
@karlkfi karlkfi changed the title [WIP] fix: remediator missing custom resource events fix: remediator missing custom resource events Oct 4, 2024
@karlkfi karlkfi changed the title fix: remediator missing custom resource events [WIP] fix: remediator missing custom resource events Oct 7, 2024
@karlkfi karlkfi force-pushed the karl-crd-watcher branch 2 times, most recently from 22cc5cd to eb6b50f Compare October 7, 2024 22:55
@karlkfi karlkfi changed the title [WIP] fix: remediator missing custom resource events fix: remediator missing custom resource events Oct 7, 2024
@karlkfi karlkfi force-pushed the karl-crd-watcher branch 2 times, most recently from b6090a4 to fe7bb01 Compare October 7, 2024 23:03
pkg/reconcilermanager/controllers/crd_controller.go Outdated Show resolved Hide resolved
pkg/reconcilermanager/controllers/crd_controller.go Outdated Show resolved Hide resolved
@@ -416,6 +418,14 @@ func (r *RootSyncReconciler) deleteManagedObjects(ctx context.Context, reconcile

// Register RootSync controller with reconciler-manager.
func (r *RootSyncReconciler) Register(mgr controllerruntime.Manager, watchFleetMembership bool) error {
r.lock.Lock()
defer r.lock.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the lock? The Register function is only invoked once and won't be re-registered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock protects the controller from read/write race condition. But the way we're calling Register today it probably won't ever be called in parallel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't technically need it today, it's just good defensive practice to protect variables that could be used in parallel.

Locking is cheap. Lock contention is what's expensive.

Prior to this change, the remediator watches were only being started
for new custom resources after the apply attempt had fully completed.
This left some time after the object was applied that the remediator
could miss events made by third-parties. Normally, this would be fine,
because the remediator would revert any change after the watch was
started. But if a DELETE event was missed, the object wouldn't be
recreated until the next apply attempt.

This change adds a CRD Controller to the remediator that watches CRDs
and executes any registered handlers when the CRD is established,
unestablished, or deleted. The remediator now registers CRD handlers
for each resource type it watches, starting watchers as soon as
possible, without waiting for the next apply attempt.
@karlkfi
Copy link
Contributor Author

karlkfi commented Oct 9, 2024

For posterity, since inline comment threads can get lost...

Unfortunately, neither the DynamicRESTMapper in controller-runtime nor the DeferredDiscoveryRESTMapper/CachedDiscoveryClient in client-go implement auto-invalidation of resources, unless aggregated discovery is enabled on the server. While aggregated discovery is beta in 1.27+ and thus enabled by default, that doesn't guarantee that it will be enabled on non-GKE clusters.

  1. DynamicRESTMapper only recently added support for aggregated discovery, but it's in a newer controller-runtime than we're using today. While it does handle auto-discovery when aggregated discovery is disabled, it does not handle auto-invalidation.
  2. CachedDiscoveryClient does support with aggregated discovery, but when disabled it doesn't handle auto-discovery or auto-invalidation. It just has a cache TTL.

So I've added a new ReplaceOnResetRESTMapper that can be used to replace the RESTMapper when Reset is called. Then I added some code to the watch.Manager to handle calling Reset on the mapper when a CRD is established or unestablished, if the mapper still knows about the deleted resource or doesn't know about a new resource. This handles auto-discovery and auto-invalidation of resources in the RESTMapper, but it's a relatively inefficient.

Hopefully at some point in the future we can make aggregated discovery a requirement and use a simplified DiscoveryClient to handle both auto-discovery and auto-invalidation.

Copy link
Contributor

@nan-yu nan-yu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nan-yu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 89f6973 into GoogleContainerTools:main Oct 9, 2024
6 checks passed
@karlkfi karlkfi deleted the karl-crd-watcher branch October 9, 2024 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants