Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1743345: clean up service account, cluster roles, and cluster role bindings after CSV deletion #970

Merged

Conversation

jpeeler
Copy link

@jpeeler jpeeler commented Jul 29, 2019

The approach I have here works, but takes a while as the resync interval has to pass. I was trying avoid creating a new queue for requeueing, though that may not have worked anyway since really it'd need to be requeued after CSV deletion. The commented out section did not work due to CSV requeueing and in progress CSV processing.

(The above is now outdated.)

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 29, 2019
@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 29, 2019
Copy link
Member

@njhale njhale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left some feedback.

a.requeueOwnerCSVs(metaObj)
switch metaObj.(type) {
case *corev1.ServiceAccount:
if err := a.opClient.DeleteServiceAccount(metaObj.GetNamespace(), metaObj.GetName(), &metav1.DeleteOptions{}); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

if syncError = a.opClient.DeleteServiceAccount(...

case *corev1.ServiceAccount:
if err := a.opClient.DeleteServiceAccount(metaObj.GetNamespace(), metaObj.GetName(), &metav1.DeleteOptions{}); err != nil {
logger.WithError(err).Warn("cannot delete service account")
syncError = err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You may want to put a break (or else) here or you'll get the "Deleted" log output even when the deletion fails.

"name": metaObj.GetName(),
"namespace": metaObj.GetNamespace(),
"self": metaObj.GetSelfLink(),
})

// Requeues objects that can't have ownerrefs (cluster -> namespace, cross-namespace)
if ownerutil.IsOwnedByKindLabel(metaObj, v1alpha1.ClusterServiceVersionKind) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be some duplicated work. I wonder if we can rearrange things to make it a little cleaner. Does it make sense to move some of this to the requeueOwnerCSVs method; particularly the existence check and GC enqueuing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the code here was already present. The goal was to avoid requeueing CSVs in a deletion scenario. Maybe I can circle back to this later.

@@ -105,6 +105,13 @@ func (q *QueueInformer) metricHandlers() *cache.ResourceEventHandlerFuncs {
}
}

func NewQueue(ctx context.Context, options ...Option) (*QueueInformer, error) {
Copy link
Member

@njhale njhale Aug 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor name makes lot more sense than NewQueueInformer for the use-case, which I think is to have a QueueInformer that doesn't have an informer or indexer behind it. If that's the case, then I think what you want to do is remove the non-nil informer/indexer test in the queueInformerConfig.validate() method, or create a new validate method that ignores that check. Additionally, you would need to detect a nil indexer and attempt to use the resource nested in the event, rather than getting it from the indexer.

Copy link
Author

@jpeeler jpeeler Aug 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up creating a ResourceEvent (with type updated) and putting that on the queue.

ctx,
queueinformer.WithLogger(op.logger),
queueinformer.WithQueue(objGCQueue),
queueinformer.WithSyncer(queueinformer.LegacySyncHandler(op.syncGCObject).ToSyncer()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is using the legacy adapter more convenient than implementing a new Syncer? If that's the case then we may want to remove that interface going forward.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I'm familiar with the "old" way, I was going to implement that first.

} else {
switch metaObj.(type) {
case *corev1.ServiceAccount, *rbacv1.ClusterRole, *rbacv1.ClusterRoleBinding:
a.objGCQueueSet.Requeue(ns, metaObj.GetName())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to send a ResourceEvent, this will not work as intended since it constructs a key manually. It may be helpful to add a RequeueEvent method to the ResourceQueue type.

@jpeeler jpeeler force-pushed the cleanup-roles-sa branch 2 times, most recently from fc43c40 to 1a59a97 Compare August 14, 2019 21:07
Copy link
Author

@jpeeler jpeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit message is worth reading, didn't repeat it here.

I counted at least six different places CSVs are requeued (seven if you count operator group requeues, which also requeue CSVs). This fact is what requires handling this in a queue rather than just deleting in the CSV deletion handler.

defer r.mutex.RUnlock()

if queue, ok := r.queueSet[namespace]; ok {
queue.AddRateLimited(resourceEvent)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really what I wanted to do here is queue.AddAfter(resourceEvent, 10 * time.Second), but I think what is here works okay at least. I figured a timeout wouldn't be allowed, but it would avoid some potential requeueing due to lingering CSVs.

@jpeeler jpeeler changed the title WIP: clean up service account, cluster roles, and cluster role bindings after CSV deletion clean up service account, cluster roles, and cluster role bindings after CSV deletion Aug 14, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 14, 2019
@jpeeler
Copy link
Author

jpeeler commented Aug 15, 2019

/test e2e-aws-olm

Copy link
Member

@ecordell ecordell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good!

I had some questions mostly about code organization, but I like this approach and the testing for it.

pkg/controller/operators/olm/operator.go Show resolved Hide resolved
pkg/controller/operators/olm/operator.go Show resolved Hide resolved
pkg/controller/operators/olm/operatorgroup.go Show resolved Hide resolved
pkg/controller/registry/resolver/rbac.go Show resolved Hide resolved
pkg/lib/queueinformer/config.go Outdated Show resolved Hide resolved
@shawn-hurley
Copy link
Member

Please add a bug to the title that this is fixing

Jeff Peeler added 2 commits August 19, 2019 12:43
This ensures proper resource deletion is done upon CSV deletion. Since
this touches a lot of different places, here's a summary of changes
made:

The RBAC has been modified to be owned by CSV instead of the operator
group. An operator group may remain after a CSV is deleted, but the
associated resources shouldn't. Similarly, created service accounts were
missing an owner reference to the CSV.

Due to the large amount of CSV requeueing and potential in progress
handling of a CSV, RBAC couldn't be deleted in
handleClusterServiceVersionDeletion (because sometimes the RBAC would be
recreated by another CSV sync). Instead, a new queue was created for
GC-ing resources. The sync loop specifically is used to do deletes so
that the loop can return an error (an error being if the CSV is not yet
deleted) and will be scheduled to try again later. The requeueing code
has been changed to not requeue if the CSV is not in the cache to help
not delay the new GC sync loop.

The new queue does not utilize an informer or indexer, so the event and
the resource are placed directly on the queue rather than relying on the
indexer to retrieve by key in the processing loop (processNextWorkItem).
@jpeeler jpeeler changed the title clean up service account, cluster roles, and cluster role bindings after CSV deletion Bug 1729385: clean up service account, cluster roles, and cluster role bindings after CSV deletion Aug 19, 2019
@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Aug 19, 2019
@openshift-ci-robot
Copy link
Collaborator

@jpeeler: This pull request references Bugzilla bug 1729385, which is invalid:

  • expected the bug to target the "4.2.0" release, but it targets "4.1.z" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1729385: clean up service account, cluster roles, and cluster role bindings after CSV deletion

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jpeeler jpeeler changed the title Bug 1729385: clean up service account, cluster roles, and cluster role bindings after CSV deletion Bug 1743345: clean up service account, cluster roles, and cluster role bindings after CSV deletion Aug 19, 2019
@openshift-ci-robot
Copy link
Collaborator

@jpeeler: This pull request references Bugzilla bug 1743345, which is invalid:

  • expected dependent Bugzilla bug 1729385 to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), but it is POST instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1743345: clean up service account, cluster roles, and cluster role bindings after CSV deletion

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jpeeler
Copy link
Author

jpeeler commented Aug 19, 2019

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Aug 19, 2019
@openshift-ci-robot
Copy link
Collaborator

@jpeeler: This pull request references Bugzilla bug 1743345, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ecordell
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 20, 2019
@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ecordell, jpeeler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jpeeler
Copy link
Author

jpeeler commented Aug 20, 2019

/test e2e-aws-olm

3 similar comments
@jpeeler
Copy link
Author

jpeeler commented Aug 21, 2019

/test e2e-aws-olm

@jpeeler
Copy link
Author

jpeeler commented Aug 21, 2019

/test e2e-aws-olm

@jpeeler
Copy link
Author

jpeeler commented Aug 21, 2019

/test e2e-aws-olm

@jpeeler
Copy link
Author

jpeeler commented Aug 21, 2019

/retest

@jpeeler
Copy link
Author

jpeeler commented Aug 21, 2019

/test unit

@openshift-merge-robot openshift-merge-robot merged commit 7d6665d into operator-framework:master Aug 22, 2019
@openshift-ci-robot
Copy link
Collaborator

@jpeeler: All pull requests linked via external trackers have merged. Bugzilla bug 1743345 has been moved to the MODIFIED state.

In response to this:

Bug 1743345: clean up service account, cluster roles, and cluster role bindings after CSV deletion

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ecordell
Copy link
Member

/cherry-pick release-4.1

@openshift-cherrypick-robot

@ecordell: #970 failed to apply on top of branch "release-4.1":

Using index info to reconstruct a base tree...
M	test/e2e/installplan_e2e_test.go
Falling back to patching base and 3-way merge...
Auto-merging test/e2e/installplan_e2e_test.go
Applying: fix(olm): clean up resources on CSV deletion
Using index info to reconstruct a base tree...
M	pkg/controller/operators/olm/operator.go
M	pkg/controller/operators/olm/operatorgroup.go
A	pkg/lib/queueinformer/config.go
M	pkg/lib/queueinformer/queueinformer.go
M	pkg/lib/queueinformer/queueinformer_operator.go
M	pkg/lib/queueinformer/resourcequeue.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/lib/queueinformer/resourcequeue.go
Auto-merging pkg/lib/queueinformer/queueinformer_operator.go
CONFLICT (content): Merge conflict in pkg/lib/queueinformer/queueinformer_operator.go
Auto-merging pkg/lib/queueinformer/queueinformer.go
CONFLICT (content): Merge conflict in pkg/lib/queueinformer/queueinformer.go
CONFLICT (modify/delete): pkg/lib/queueinformer/config.go deleted in HEAD and modified in fix(olm): clean up resources on CSV deletion. Version fix(olm): clean up resources on CSV deletion of pkg/lib/queueinformer/config.go left in tree.
Auto-merging pkg/controller/operators/olm/operatorgroup.go
CONFLICT (content): Merge conflict in pkg/controller/operators/olm/operatorgroup.go
Auto-merging pkg/controller/operators/olm/operator.go
CONFLICT (content): Merge conflict in pkg/controller/operators/olm/operator.go
error: Failed to merge in the changes.
Patch failed at 0002 fix(olm): clean up resources on CSV deletion

In response to this:

/cherry-pick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants