Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AUTH-543: Add optional operand deletion condition #1902

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

liouk
Copy link
Member

@liouk liouk commented Dec 3, 2024

There might be cases (as demonstrated in OCPBUGS-44937) where we might want to gracefully delete the operand workload of the workload controller, and keep the operator status available (instead of unavailable or degraded).

This PR adds an optional way of specifying a deletion condition which will trigger the deletion of the operand gracefully, keeping the operator's status as Available=True. To avoid breaking changes, I've added new setup functions to be used for wiring this particular case.

This is needed in the scope of openshift/cluster-authentication-operator#740.

@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 3, 2024
@openshift-ci-robot
Copy link

@liouk: This pull request references Jira Issue OCPBUGS-44937, which is invalid:

  • expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

There might be cases (as demonstrated in OCPBUGS-44937) where we might want to gracefully delete the operand workload of the workload controller, and keep the operator status available (instead of unavailable or degraded).

This PR adds an optional way of specifying a deletion condition which will trigger the deletion of the operand gracefully, keeping the operator's status as Available=True.

This is needed in the scope of openshift/cluster-authentication-operator#740.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 3, 2024
@openshift-ci openshift-ci bot requested review from p0lyn0mial and tkashem December 3, 2024 09:55
@liouk liouk force-pushed the workload-deletion-condition branch 3 times, most recently from 2daf3c8 to ee16aa0 Compare December 3, 2024 12:21
@liouk liouk changed the title WIP: OCPBUGS-44937: Add optional operand deletion condition OCPBUGS-44937: Add optional operand deletion condition Dec 3, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 3, 2024
@liouk liouk changed the title OCPBUGS-44937: Add optional operand deletion condition NO-JIRA: Add optional operand deletion condition Dec 3, 2024
@openshift-ci-robot openshift-ci-robot removed jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 3, 2024
@openshift-ci-robot
Copy link

@liouk: This pull request explicitly references no jira issue.

In response to this:

There might be cases (as demonstrated in OCPBUGS-44937) where we might want to gracefully delete the operand workload of the workload controller, and keep the operator status available (instead of unavailable or degraded).

This PR adds an optional way of specifying a deletion condition which will trigger the deletion of the operand gracefully, keeping the operator's status as Available=True.

This is needed in the scope of openshift/cluster-authentication-operator#740.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@liouk liouk force-pushed the workload-deletion-condition branch from ee16aa0 to a9ce87e Compare December 3, 2024 14:48
}
}()

if _, err := c.deploymentLister.Deployments(c.targetNamespace).Get(workloadName); err != nil && !apierrors.IsNotFound(err) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the Get() call necessary before attempting the Delete() call? What does the extra call buy us?

Copy link
Member Author

@liouk liouk Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Get() happens on the lister (cache), which means that if the deployment has already been deleted, it'll save us an extra API call to Delete(). This will be happening on every sync.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I totally missed that the Get() would be on the cache - that makes sense :). Thanks!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be beneficial to add some unit tests for the new deletion behavior?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit tests would end up using mocks for most of the stuff that the deletion does; but on second thought it might be beneficial to test the operator status, so I'll add some 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to import and use this change in one of the consumers (e.g. cluster-authentication-operator) and show that the CI is passing there.

@liouk liouk force-pushed the workload-deletion-condition branch from a9ce87e to 520c3d2 Compare December 4, 2024 11:41
Comment on lines 443 to 452
deploymentAvailableCondition = deploymentAvailableCondition.
WithStatus(operatorv1.ConditionFalse).
WithReason("DeletionError")
Copy link

@everettraven everettraven Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include a message in all conditions? I'm not familiar with how these conditions have been constructed in the past, but I have seen logs that say not setting the message in a condition will eventually be fatal:

W1203 04:13:07.671272       1 dynamic_operator_client.go:355] .status.conditions["APIServerDeploymentAvailable"].message is missing; this will eventually be fatal

This was pulled from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.18-ocp-e2e-ovn-remote-libvirt-s390x/1863791759423705088/artifacts/ocp-e2e-ovn-remote-libvirt-s390x/gather-extra/artifacts/pods/openshift-apiserver-operator_openshift-apiserver-operator-dfc988b89-rsh2n_openshift-apiserver-operator.log

Even though it would be a repetitive message in this case, with ^ in mind maybe it is worth it to future-proof this implementation as much as we can for when that does get flipped to fatal?

//
// the "deletionConditionFn" will be used to check whether the workload specified by the
// returned name which is part of targetNamespace must be deleted
func NewControllerWithDeletion(instanceName, operatorNamespace, targetNamespace, targetOperandVersion, operandNamePrefix, conditionsPrefix string,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there sufficient benefit of creating a 2nd constructor for this? We could just pass the deletion option as nil in the normal use case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still applies, I think we do not need any change in the constructor. Let's drive such logic through the delegate.

if conditionMet, workload, err := c.deletionConditionFn(); err != nil {
return err
} else if conditionMet {
return c.deleteWorkload(ctx, workload)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the delegate be responsible for the lifecycle of the Deployment? I am not sure if it makes sense to fragment the logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still applies, why do we have to manage the deletion in here?

return c.deleteWorkload(ctx, workload)
}
}

if fulfilled, err := c.delegate.PreconditionFulfilled(ctx); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be better to let the delegate communicate everything as we do with the preconditions?

@@ -295,6 +295,46 @@ func (cs *APIServerControllerSet) WithWorkloadController(
return cs
}

func (cs *APIServerControllerSet) WithWorkloadControllerWithDeletion(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we let delegate controller we do not need another method

@@ -356,6 +424,55 @@ func (c *Controller) updateOperatorStatus(ctx context.Context, previousStatus *o
return nil
}

func (c *Controller) deleteWorkload(ctx context.Context, workloadName string) (err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to split it from the updateOperatorStatus reconciliation logic? It would be nicer to keep the concern in the same place and consider all 4 conditions.

I do not have a problem with calling a util function in updateOperatorStatus if needed though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still applies, if we need to manage the status/conditions during scale down of the workload, it would be better to do it in a single place (updateOperatorStatus)

if _, getErr := c.deploymentLister.Deployments(c.targetNamespace).Get(workloadName); getErr != nil && !apierrors.IsNotFound(getErr) {
deploymentAvailableCondition = deploymentAvailableCondition.
WithStatus(operatorv1.ConditionFalse).
WithReason("DeletionError")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this isn't a deletion error

Comment on lines 467 to 478
deploymentAvailableCondition = deploymentAvailableCondition.
WithStatus(operatorv1.ConditionTrue).
WithReason("AsExpected")
workloadDegradedCondition = workloadDegradedCondition.
WithStatus(operatorv1.ConditionFalse)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the value of preserving these conditions (Available=True, Degraded=False) if no workload exists for them?

Or are we interested in measuring the availability of OIDC provider? Is it possible? And even then this probably isn't the right place to indicate it, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to import and use this change in one of the consumers (e.g. cluster-authentication-operator) and show that the CI is passing there.

@liouk liouk changed the title NO-JIRA: Add optional operand deletion condition AUTH-543: Add optional operand deletion condition Dec 5, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 5, 2024

@liouk: This pull request references AUTH-543 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.19." or "openshift-4.19.", but it targets "openshift-4.18" instead.

In response to this:

There might be cases (as demonstrated in OCPBUGS-44937) where we might want to gracefully delete the operand workload of the workload controller, and keep the operator status available (instead of unavailable or degraded).

This PR adds an optional way of specifying a deletion condition which will trigger the deletion of the operand gracefully, keeping the operator's status as Available=True. To avoid breaking changes, I've added new setup functions to be used for wiring this particular case.

This is needed in the scope of openshift/cluster-authentication-operator#740.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@liouk
Copy link
Member Author

liouk commented Dec 5, 2024

/jira refresh

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 5, 2024

@liouk: This pull request references AUTH-543 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.19." or "openshift-4.19.", but it targets "openshift-4.18" instead.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@liouk liouk force-pushed the workload-deletion-condition branch from 520c3d2 to 8367a93 Compare December 5, 2024 17:58
Copy link
Contributor

openshift-ci bot commented Dec 5, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: liouk
Once this PR has been reviewed and has the lgtm label, please assign jsafrane for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

deploymentDegradedCondition := applyoperatorv1.OperatorCondition().
WithType(fmt.Sprintf("%sDeploymentDegraded", c.conditionsPrefix))
if removeConditions {
jsonPatch := v1helpers.RemoveConditionsJSONPatch(previousStatus, []string{typeAvailable, typeDegraded, typeProgressing, typeWorkloadDegraded})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we cannot use SSA for removing conditions, but I am not sure if patch is better than v1helpers.UpdateStatus here. Would be good to add an additional opinion on this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage of the patch here in my opinion is that we're only adding to the patch the specific conditions that we want to remove, so it's more concise -- we won't have to manage the whole status object to perform the update, so maybe it's less prone to mistakes.

What would you think @bertinatto?


deploymentDegradedCondition := applyoperatorv1.OperatorCondition().
WithType(fmt.Sprintf("%sDeploymentDegraded", c.conditionsPrefix))
if removeConditions {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should check if workload == nil to make sure the delegate has deleted (and not recreated) the workload


deploymentDegradedCondition := applyoperatorv1.OperatorCondition().
WithType(fmt.Sprintf("%sDeploymentDegraded", c.conditionsPrefix))
if removeConditions {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even the workload should be gone and we should not get any errs, we should at least log them if they passed on by the delegate

deploymentAvailableCondition := applyoperatorv1.OperatorCondition().WithType(typeAvailable)
workloadDegradedCondition := applyoperatorv1.OperatorCondition().WithType(typeWorkloadDegraded)
deploymentDegradedCondition := applyoperatorv1.OperatorCondition().WithType(typeDegraded)
deploymentProgressingCondition := applyoperatorv1.OperatorCondition().WithType(typeProgressing)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make more sense to still consider preconditions even when the workload will be deleted later. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the delete would happen during the delegate's sync, we shouldn't normally reach the point of removing the conditions if preconditions are failing. But you are right, we should safe-guard against this -- I'll add a check before removing conditions.

if conditionMet, workload, err := c.deletionConditionFn(); err != nil {
return err
} else if conditionMet {
return c.deleteWorkload(ctx, workload)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still applies, why do we have to manage the deletion in here?

//
// the "deletionConditionFn" will be used to check whether the workload specified by the
// returned name which is part of targetNamespace must be deleted
func NewControllerWithDeletion(instanceName, operatorNamespace, targetNamespace, targetOperandVersion, operandNamePrefix, conditionsPrefix string,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still applies, I think we do not need any change in the constructor. Let's drive such logic through the delegate.

@@ -356,6 +424,55 @@ func (c *Controller) updateOperatorStatus(ctx context.Context, previousStatus *o
return nil
}

func (c *Controller) deleteWorkload(ctx context.Context, workloadName string) (err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still applies, if we need to manage the status/conditions during scale down of the workload, it would be better to do it in a single place (updateOperatorStatus)

@liouk liouk force-pushed the workload-deletion-condition branch from 8367a93 to f07c37b Compare December 10, 2024 10:52
Copy link
Contributor

openshift-ci bot commented Dec 10, 2024

@liouk: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

liouk added a commit to liouk/cluster-authentication-operator that referenced this pull request Dec 10, 2024
@liouk
Copy link
Member Author

liouk commented Dec 10, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants