AUTH-543: Add optional operand deletion condition #1902

liouk · 2024-12-03T09:54:22Z

There might be cases (as demonstrated in OCPBUGS-44937) where we might want to gracefully delete the operand workload of the workload controller, and keep the operator status available (instead of unavailable or degraded).

This PR adds an optional way of specifying a deletion condition which will trigger the deletion of the operand gracefully, keeping the operator's status as Available=True. To avoid breaking changes, I've added new setup functions to be used for wiring this particular case.

This is needed in the scope of openshift/cluster-authentication-operator#740.

openshift-ci-robot · 2024-12-03T09:54:29Z

@liouk: This pull request references Jira Issue OCPBUGS-44937, which is invalid:

expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

There might be cases (as demonstrated in OCPBUGS-44937) where we might want to gracefully delete the operand workload of the workload controller, and keep the operator status available (instead of unavailable or degraded).

This PR adds an optional way of specifying a deletion condition which will trigger the deletion of the operand gracefully, keeping the operator's status as Available=True.

This is needed in the scope of openshift/cluster-authentication-operator#740.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-12-03T14:11:10Z

@liouk: This pull request explicitly references no jira issue.

In response to this:

There might be cases (as demonstrated in OCPBUGS-44937) where we might want to gracefully delete the operand workload of the workload controller, and keep the operator status available (instead of unavailable or degraded).

This PR adds an optional way of specifying a deletion condition which will trigger the deletion of the operand gracefully, keeping the operator's status as Available=True.

This is needed in the scope of openshift/cluster-authentication-operator#740.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

everettraven · 2024-12-03T17:32:21Z

pkg/operator/apiserver/controller/workload/workload.go

+		}
+	}()
+
+	if _, err := c.deploymentLister.Deployments(c.targetNamespace).Get(workloadName); err != nil && !apierrors.IsNotFound(err) {


Is the Get() call necessary before attempting the Delete() call? What does the extra call buy us?

The Get() happens on the lister (cache), which means that if the deployment has already been deleted, it'll save us an extra API call to Delete(). This will be happening on every sync.

Ah, I totally missed that the Get() would be on the cache - that makes sense :). Thanks!

everettraven · 2024-12-03T17:46:06Z

pkg/operator/apiserver/controller/workload/workload.go

Would it be beneficial to add some unit tests for the new deletion behavior?

The unit tests would end up using mocks for most of the stuff that the deletion does; but on second thought it might be beneficial to test the operator status, so I'll add some 👍

It would be good to import and use this change in one of the consumers (e.g. cluster-authentication-operator) and show that the CI is passing there.

everettraven · 2024-12-04T13:17:38Z

pkg/operator/apiserver/controller/workload/workload.go

+		deploymentAvailableCondition = deploymentAvailableCondition.
+			WithStatus(operatorv1.ConditionFalse).
+			WithReason("DeletionError")


Should we include a message in all conditions? I'm not familiar with how these conditions have been constructed in the past, but I have seen logs that say not setting the message in a condition will eventually be fatal:

W1203 04:13:07.671272 1 dynamic_operator_client.go:355] .status.conditions["APIServerDeploymentAvailable"].message is missing; this will eventually be fatal

This was pulled from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.18-ocp-e2e-ovn-remote-libvirt-s390x/1863791759423705088/artifacts/ocp-e2e-ovn-remote-libvirt-s390x/gather-extra/artifacts/pods/openshift-apiserver-operator_openshift-apiserver-operator-dfc988b89-rsh2n_openshift-apiserver-operator.log

Even though it would be a repetitive message in this case, with ^ in mind maybe it is worth it to future-proof this implementation as much as we can for when that does get flipped to fatal?

atiratree · 2024-12-04T12:18:57Z

pkg/operator/apiserver/controller/workload/workload.go

+//
+// the "deletionConditionFn" will be used to check whether the workload specified by the
+// returned name which is part of targetNamespace must be deleted
+func NewControllerWithDeletion(instanceName, operatorNamespace, targetNamespace, targetOperandVersion, operandNamePrefix, conditionsPrefix string,


Is there sufficient benefit of creating a 2nd constructor for this? We could just pass the deletion option as nil in the normal use case.

Still applies, I think we do not need any change in the constructor. Let's drive such logic through the delegate.

atiratree · 2024-12-04T12:54:28Z

pkg/operator/apiserver/controller/workload/workload.go

+		if conditionMet, workload, err := c.deletionConditionFn(); err != nil {
+			return err
+		} else if conditionMet {
+			return c.deleteWorkload(ctx, workload)


Shouldn't the delegate be responsible for the lifecycle of the Deployment? I am not sure if it makes sense to fragment the logic.

This still applies, why do we have to manage the deletion in here?

atiratree · 2024-12-04T12:56:09Z

pkg/operator/apiserver/controller/workload/workload.go

+			return c.deleteWorkload(ctx, workload)
+		}
+	}
+
 	if fulfilled, err := c.delegate.PreconditionFulfilled(ctx); err != nil {


wouldn't it be better to let the delegate communicate everything as we do with the preconditions?

atiratree · 2024-12-04T12:58:30Z

pkg/operator/apiserver/controllerset/apiservercontrollerset.go

@@ -295,6 +295,46 @@ func (cs *APIServerControllerSet) WithWorkloadController(
 	return cs
 }

+func (cs *APIServerControllerSet) WithWorkloadControllerWithDeletion(


if we let delegate controller we do not need another method

atiratree · 2024-12-04T13:04:09Z

pkg/operator/apiserver/controller/workload/workload.go

@@ -356,6 +424,55 @@ func (c *Controller) updateOperatorStatus(ctx context.Context, previousStatus *o
 	return nil
 }

+func (c *Controller) deleteWorkload(ctx context.Context, workloadName string) (err error) {


Do we need to split it from the updateOperatorStatus reconciliation logic? It would be nicer to keep the concern in the same place and consider all 4 conditions.

I do not have a problem with calling a util function in updateOperatorStatus if needed though.

still applies, if we need to manage the status/conditions during scale down of the workload, it would be better to do it in a single place (updateOperatorStatus)

atiratree · 2024-12-04T13:14:02Z

pkg/operator/apiserver/controller/workload/workload.go

+	if _, getErr := c.deploymentLister.Deployments(c.targetNamespace).Get(workloadName); getErr != nil && !apierrors.IsNotFound(getErr) {
+		deploymentAvailableCondition = deploymentAvailableCondition.
+			WithStatus(operatorv1.ConditionFalse).
+			WithReason("DeletionError")


nit: this isn't a deletion error

atiratree · 2024-12-04T13:40:17Z

pkg/operator/apiserver/controller/workload/workload.go

+	deploymentAvailableCondition = deploymentAvailableCondition.
+		WithStatus(operatorv1.ConditionTrue).
+		WithReason("AsExpected")
+	workloadDegradedCondition = workloadDegradedCondition.
+		WithStatus(operatorv1.ConditionFalse)


What is the value of preserving these conditions (Available=True, Degraded=False) if no workload exists for them?

Or are we interested in measuring the availability of OIDC provider? Is it possible? And even then this probably isn't the right place to indicate it, right?

atiratree · 2024-12-04T13:42:42Z

pkg/operator/apiserver/controller/workload/workload.go

It would be good to import and use this change in one of the consumers (e.g. cluster-authentication-operator) and show that the CI is passing there.

openshift-ci-robot · 2024-12-05T13:07:18Z

@liouk: This pull request references AUTH-543 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.19." or "openshift-4.19.", but it targets "openshift-4.18" instead.

In response to this:

There might be cases (as demonstrated in OCPBUGS-44937) where we might want to gracefully delete the operand workload of the workload controller, and keep the operator status available (instead of unavailable or degraded).

This PR adds an optional way of specifying a deletion condition which will trigger the deletion of the operand gracefully, keeping the operator's status as Available=True. To avoid breaking changes, I've added new setup functions to be used for wiring this particular case.

This is needed in the scope of openshift/cluster-authentication-operator#740.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

liouk · 2024-12-05T13:07:21Z

/jira refresh

openshift-ci-robot · 2024-12-05T13:07:25Z

@liouk: This pull request references AUTH-543 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.19." or "openshift-4.19.", but it targets "openshift-4.18" instead.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2024-12-05T17:59:18Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: liouk
Once this PR has been reviewed and has the lgtm label, please assign jsafrane for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/operator/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

atiratree · 2024-12-05T20:09:06Z

pkg/operator/apiserver/controller/workload/workload.go

-	deploymentDegradedCondition := applyoperatorv1.OperatorCondition().
-		WithType(fmt.Sprintf("%sDeploymentDegraded", c.conditionsPrefix))
+	if removeConditions {
+		jsonPatch := v1helpers.RemoveConditionsJSONPatch(previousStatus, []string{typeAvailable, typeDegraded, typeProgressing, typeWorkloadDegraded})


Seems we cannot use SSA for removing conditions, but I am not sure if patch is better than v1helpers.UpdateStatus here. Would be good to add an additional opinion on this.

The advantage of the patch here in my opinion is that we're only adding to the patch the specific conditions that we want to remove, so it's more concise -- we won't have to manage the whole status object to perform the update, so maybe it's less prone to mistakes.

What would you think @bertinatto?

atiratree · 2024-12-05T20:12:21Z

pkg/operator/apiserver/controller/workload/workload.go


-	deploymentDegradedCondition := applyoperatorv1.OperatorCondition().
-		WithType(fmt.Sprintf("%sDeploymentDegraded", c.conditionsPrefix))
+	if removeConditions {


we should check if workload == nil to make sure the delegate has deleted (and not recreated) the workload

atiratree · 2024-12-05T20:13:58Z

pkg/operator/apiserver/controller/workload/workload.go


-	deploymentDegradedCondition := applyoperatorv1.OperatorCondition().
-		WithType(fmt.Sprintf("%sDeploymentDegraded", c.conditionsPrefix))
+	if removeConditions {


even the workload should be gone and we should not get any errs, we should at least log them if they passed on by the delegate

atiratree · 2024-12-05T20:16:59Z

pkg/operator/apiserver/controller/workload/workload.go

+	deploymentAvailableCondition := applyoperatorv1.OperatorCondition().WithType(typeAvailable)
+	workloadDegradedCondition := applyoperatorv1.OperatorCondition().WithType(typeWorkloadDegraded)
+	deploymentDegradedCondition := applyoperatorv1.OperatorCondition().WithType(typeDegraded)
+	deploymentProgressingCondition := applyoperatorv1.OperatorCondition().WithType(typeProgressing)



It might make more sense to still consider preconditions even when the workload will be deleted later. Thoughts?

Given that the delete would happen during the delegate's sync, we shouldn't normally reach the point of removing the conditions if preconditions are failing. But you are right, we should safe-guard against this -- I'll add a check before removing conditions.

atiratree · 2024-12-05T20:18:54Z

pkg/operator/apiserver/controller/workload/workload.go

+		if conditionMet, workload, err := c.deletionConditionFn(); err != nil {
+			return err
+		} else if conditionMet {
+			return c.deleteWorkload(ctx, workload)


This still applies, why do we have to manage the deletion in here?

atiratree · 2024-12-05T20:20:03Z

pkg/operator/apiserver/controller/workload/workload.go

+//
+// the "deletionConditionFn" will be used to check whether the workload specified by the
+// returned name which is part of targetNamespace must be deleted
+func NewControllerWithDeletion(instanceName, operatorNamespace, targetNamespace, targetOperandVersion, operandNamePrefix, conditionsPrefix string,


Still applies, I think we do not need any change in the constructor. Let's drive such logic through the delegate.

atiratree · 2024-12-05T20:23:04Z

pkg/operator/apiserver/controller/workload/workload.go

@@ -356,6 +424,55 @@ func (c *Controller) updateOperatorStatus(ctx context.Context, previousStatus *o
 	return nil
 }

+func (c *Controller) deleteWorkload(ctx context.Context, workloadName string) (err error) {


still applies, if we need to manage the status/conditions during scale down of the workload, it would be better to do it in a single place (updateOperatorStatus)

…orkload

openshift-ci · 2024-12-10T11:11:23Z

@liouk: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

liouk · 2024-12-10T13:27:11Z

Proof PR: openshift/cluster-authentication-operator#747

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 3, 2024

openshift-ci bot requested review from p0lyn0mial and tkashem December 3, 2024 09:55

liouk force-pushed the workload-deletion-condition branch 3 times, most recently from 2daf3c8 to ee16aa0 Compare December 3, 2024 12:21

liouk changed the title ~~WIP: OCPBUGS-44937: Add optional operand deletion condition~~ OCPBUGS-44937: Add optional operand deletion condition Dec 3, 2024

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 3, 2024

liouk changed the title ~~OCPBUGS-44937: Add optional operand deletion condition~~ NO-JIRA: Add optional operand deletion condition Dec 3, 2024

openshift-ci-robot removed jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 3, 2024

liouk force-pushed the workload-deletion-condition branch from ee16aa0 to a9ce87e Compare December 3, 2024 14:48

everettraven reviewed Dec 3, 2024

View reviewed changes

liouk force-pushed the workload-deletion-condition branch from a9ce87e to 520c3d2 Compare December 4, 2024 11:41

everettraven reviewed Dec 4, 2024

View reviewed changes

atiratree reviewed Dec 4, 2024

View reviewed changes

liouk changed the title ~~NO-JIRA: Add optional operand deletion condition~~ AUTH-543: Add optional operand deletion condition Dec 5, 2024

liouk force-pushed the workload-deletion-condition branch from 520c3d2 to 8367a93 Compare December 5, 2024 17:58

atiratree reviewed Dec 5, 2024

View reviewed changes

workload: add deletion condition func to decide whether to delete a w…

f07c37b

…orkload

liouk force-pushed the workload-deletion-condition branch from 8367a93 to f07c37b Compare December 10, 2024 10:52

liouk added a commit to liouk/cluster-authentication-operator that referenced this pull request Dec 10, 2024

vendor: pull in openshift/library-go#1902

9f4c1b2

liouk mentioned this pull request Dec 10, 2024

DO-NOT-MERGE: Prove library-go#1902 openshift/cluster-authentication-operator#747

Open

AUTH-543: Add optional operand deletion condition #1902

Are you sure you want to change the base?

AUTH-543: Add optional operand deletion condition #1902

Conversation

liouk commented Dec 3, 2024 • edited Loading

openshift-ci-robot commented Dec 3, 2024

openshift-ci-robot commented Dec 3, 2024

Choose a reason for hiding this comment

liouk Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

everettraven Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci-robot commented Dec 5, 2024 • edited by openshift-ci bot Loading

liouk commented Dec 5, 2024

openshift-ci-robot commented Dec 5, 2024 • edited by openshift-ci bot Loading

openshift-ci bot commented Dec 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Dec 10, 2024

liouk commented Dec 10, 2024

liouk commented Dec 3, 2024 •

edited

Loading

liouk Dec 4, 2024 •

edited

Loading

everettraven Dec 4, 2024 •

edited

Loading

openshift-ci-robot commented Dec 5, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Dec 5, 2024 •

edited by openshift-ci bot

Loading