OrphanMitigation condition and different handling of retry timeout #1789

nilebox · 2018-03-05T03:44:56Z

Switching from OrphanMitigationInProgress boolean flag to OrphanMitigation condition is actually more of a cosmetic change, it just allows us to have a human-readable reason and message along with the boolean flag.

The more important change is that now the OrphanMitigation condition gets reset only after we have successfully completed the orphan mitigation.
I.e. even we exceed the retry timeout, the OrphanMitigation remains set to True.
This will allow us to properly support retries in #1765.

nilebox · 2018-03-05T03:50:21Z

pkg/controller/controller_instance.go

@@ -1238,7 +1300,6 @@ func clearServiceInstanceCurrentOperation(toUpdate *v1beta1.ServiceInstance) {
 	toUpdate.Status.CurrentOperation = ""
 	toUpdate.Status.OperationStartTime = nil
 	toUpdate.Status.AsyncOpInProgress = false
-	toUpdate.Status.OrphanMitigationInProgress = false


Important change 1: Don't reset the orphan mitigation flag automatically anymore. Only explicitly reset it after we have successfully finished orphan mitigation.

One thing to note is that there is no way to recover if the user changes the plan on the ServiceInstance while orphan mitigation is pending. The plan ref is cleared by the API server when the plan name is changed. The delete reconciliation process does not resolve plan refs (nor should it). We need some way of storing the plan that was sent in the unsuccessful provision request to make sure that we use that same plan in the orphan mitigation.

@staebler the quick fix could be a part of #1790

instead of removing a condition completely, check if the plan has changed.

add a check for instance.Status.OrphanMitigationInProgress and reject plan changes.

in the long term it would probably better to store the plan along with "in progress properties" in the status (and don't erase them until orphan mitigation has succeeded).

I don't understand (1).

For (2), that is not sufficient. There is still a window where a plan change could get in and ruin orphan mitigation.

Consider a ServiceInstance with an in-flight Provision request.

User changes plan on ServiceInstance. Orphan mitigation is not in progress, so plan change is accepted.

Provision request fails in a way that requires orphan mitigation.

Controller starts orphan mitigation.

Orphan mitigation reconciliation fails perpetually since there is no plan to use.

User changes plan on ServiceInstance. Orphan mitigation is not in progress, so plan change is accepted

Currently any spec updates (inculding plan changes) are rejected if there is an operation in progress.
While I think we should stop blocking parameter updates (see #1755 and #1790), we might still reject plan updates while there is an operation or orphan mitigation in progress (as this is a rare usecase and hence not as annoying UX as blocking parameter updates).

Sorry, I see the problem now.
For normal deletion (after instance was successfully provisioned), we already read plan from ExternalProperties:
controller_instance.go#L1419-L1423
For orphan mitigation, I think we can read it from InProgressProperties, just need stop overwriting them with nil in controller_instance.go#L565
i.e. we can just pass existing InProgressProperties there:

instance, err = c.recordStartOfServiceInstanceOperation(instance, v1beta1.ServiceInstanceOperationDeprovision, instance.Status.InProgressProperties)

@staebler Would it solve the problem?

@staebler please review #1803 that should address this problem.

nilebox · 2018-03-05T03:50:54Z

pkg/controller/controller_instance.go


 	reason := successDeprovisionReason
 	msg := successDeprovisionMessage
 	if mitigatingOrphan {
+		removeServiceInstanceCondition(instance, v1beta1.ServiceInstanceConditionOrphanMitigation)


Important change 2: Explicitly reset the orphan mitigation condition (and flag) after we have successfully finished orphan mitigation.

staebler · 2018-03-05T20:04:00Z

pkg/apis/servicecatalog/types.go

@@ -519,6 +519,7 @@ type ServiceInstanceStatus struct {

 	// OrphanMitigationInProgress is set to true if there is an ongoing orphan
 	// mitigation operation against this ServiceInstance in progress.
+	// Deprecated: Use OrphanMitigation condition instead.


I don't think that we should deprecate OprhanMitigationInProgress. We should continue to use that field as the primary means of determining whether an orphan mitigation is in progress. The condition is added to store the reason/message and not to supersede the field.

For supporting arguments, see kubernetes/kubernetes#7856 (comment).

@staebler initially I planned to do the same but then I realized there is little value of that:

if OrphanMitigationInProgress is true, the OrphanMitigation condition is always set.

as soon as OrphanMitigation condition is seen by controller, OrphanMitigationInProgress will be set to true as well, and will be reset only in case if retry timeout (a relatively rare case).

For Async operations we already have AsyncOpInProgress.
Did I miss something?

@nilebox The kube preferred way is to use a field. I don't see a co.pelling reason here to go against the kube preferred way. The question then becomes whether there is value in having a condition for orphan mitigation. The value I see there is having a place to store the reason and message for why orphan mitigation is occurring. If that reason and message is generic, then there is no value in having an orphan mitigation condition. There is still value in not wiping out the reason and message from the Ready and Failed conditions with generic orphan mitigation reasons and messages.

@staebler ok. do we even need to reset the OrphanMitigationInProgress in case of exceeding retry limit then, or just keep the flag and condition always in sync?

@nilebox Definitely keep OrphanMitigationInProgress in sync with the condition. I am good with the change that you have as far as keeping OrphanMitigationInProgress set true when the retry limit is exceeded.

staebler · 2018-03-05T20:08:43Z

pkg/controller/controller_instance.go

-		readyCond := newServiceInstanceReadyCondition(v1beta1.ConditionFalse, startingInstanceOrphanMitigationReason, startingInstanceOrphanMitigationMessage)
-		c.recorder.Event(instance, corev1.EventTypeWarning, readyCond.Reason, readyCond.Message)
-		setServiceInstanceCondition(instance, v1beta1.ServiceInstanceConditionReady, readyCond.Status, readyCond.Reason, readyCond.Message)
+		reason := startingInstanceOrphanMitigationReason


The reason and message should be more informative than this. It should come from the failed or ready condition. If we are only going to use a generic reason/message, then there is no point in using a condition at all.

staebler · 2018-03-05T20:19:34Z

pkg/controller/controller_instance.go

@@ -1238,7 +1300,6 @@ func clearServiceInstanceCurrentOperation(toUpdate *v1beta1.ServiceInstance) {
 	toUpdate.Status.CurrentOperation = ""
 	toUpdate.Status.OperationStartTime = nil
 	toUpdate.Status.AsyncOpInProgress = false
-	toUpdate.Status.OrphanMitigationInProgress = false


One thing to note is that there is no way to recover if the user changes the plan on the ServiceInstance while orphan mitigation is pending. The plan ref is cleared by the API server when the plan name is changed. The delete reconciliation process does not resolve plan refs (nor should it). We need some way of storing the plan that was sent in the unsuccessful provision request to make sure that we use that same plan in the orphan mitigation.

nilebox · 2018-03-06T00:08:36Z

pkg/controller/controller_instance.go

@@ -1539,9 +1608,19 @@ func (c *controller) processProvisionFailure(instance *v1beta1.ServiceInstance,
 	// requeue this resource.
 	var err error
 	if shouldMitigateOrphan {
+		// Copy failure reason/message to a new OrphanMitigation condition


@staebler please check below if that's what you had in mind for preserving reason/message in OrphanMitigation condition (as well as overwriting the Ready condition)

@nilebox Yep, that is what I had in mind.

As an aside, I am not sure why we are creating orphanMitigationCond and readyCond just to read the status, reason, and message from the conditions.

nilebox · 2018-03-06T01:21:52Z

pkg/controller/controller_instance.go

@@ -496,6 +527,8 @@ func (c *controller) reconcileServiceInstanceUpdate(instance *v1beta1.ServiceIns
 func (c *controller) reconcileServiceInstanceDelete(instance *v1beta1.ServiceInstance) error {
 	// nothing to do...
 	if instance.DeletionTimestamp == nil && !instance.Status.OrphanMitigationInProgress {
+		// TODO nilebox: shouldn't we throw an error instead?


What do people think?

nil return is success, yes?

Don't know what error to return. Perhaps a log statement, for the record of it happening.

Currently it is success, yes, or rather "nothing to do".
But we should never end up in this condition being true IMO.

nilebox · 2018-03-09T12:23:26Z

@staebler addressed your comments and resolved conflicts, please review.

nilebox · 2018-03-19T08:03:17Z

@staebler labeling with LGTM1since you approved the PR.

…ing orphan mitigation with timeout

MHBauer

some thoughts

Do we need to adjust the comment of OrphanMitigationInProgress as deprecated?

We see the old bool, transition it to the new condition, and remove all processing based on the old bool, besides that necessary for the transition to the condition.

MHBauer · 2018-03-19T22:43:07Z

pkg/controller/controller_instance.go

@@ -278,6 +286,29 @@ func (c *controller) initObservedGeneration(instance *v1beta1.ServiceInstance) (
 	return false, nil
 }

+// initOrphanMitigationCondition implements OrphanMitigation condition initialization
+// based on OrphanMitigationInProgress field for status API migration.


this code is allowing update from old versions of controller before this change to new versions after, and uses the status field OrphanMitigationInProgress to do so, yes?

MHBauer · 2018-03-19T22:46:07Z

pkg/controller/controller_instance.go

@@ -496,6 +527,8 @@ func (c *controller) reconcileServiceInstanceUpdate(instance *v1beta1.ServiceIns
 func (c *controller) reconcileServiceInstanceDelete(instance *v1beta1.ServiceInstance) error {
 	// nothing to do...
 	if instance.DeletionTimestamp == nil && !instance.Status.OrphanMitigationInProgress {
+		// TODO nilebox: shouldn't we throw an error instead?


nil return is success, yes?

Don't know what error to return. Perhaps a log statement, for the record of it happening.

MHBauer · 2018-03-19T22:52:57Z

pkg/controller/controller_instance.go

+		setServiceInstanceCondition(instance, v1beta1.ServiceInstanceConditionReady,
+			v1beta1.ConditionFalse,
+			startingInstanceOrphanMitigationReason,
+			startingInstanceOrphanMitigationMessage)

 		instance.Status.OperationStartTime = nil
 		instance.Status.AsyncOpInProgress = false


the next line here is instance.Status.OrphanMitigationInProgress = true.

should we be setting this at all after this goes in?

OrphanMitigationInProgress is just a flag that means "Orphan mitigation is required" (bad property name, but can't change it since it's part of the API), i.e. this flag doesn't say whether the orphan mitigation has actually started or not.

MHBauer · 2018-03-19T22:54:12Z

pkg/controller/controller_instance_test.go

@@ -3846,7 +3848,7 @@ func TestReconcileServiceInstanceOrphanMitigation(t *testing.T) {
 				},
 			},
 			async: true,
-			finishedOrphanMitigation:     true,
+			finishedOrphanMitigation:     false,


seems weird that all of these failure test cases had this boolean set to true.

🤷‍♂️ because there was a bug in the behavior. As written in the description to PR: "The more important change is that now the OrphanMitigation condition gets reset only after we have successfully completed the orphan mitigation."

MHBauer · 2018-03-19T23:00:04Z

pkg/controller/controller_instance.go

+				// from the normal deletion
+				removeServiceInstanceCondition(instance, v1beta1.ServiceInstanceConditionOrphanMitigation)
+				instance.Status.OrphanMitigationInProgress = false
+			}


I don't understand.

we're in the delete flow, with a timestamp, and not in deprovisioning state,
the OM bool is set, ( ? so the source object is set from previous controller ? )

@MHBauer this is the case when instance.DeletionTimestamp != nil i.e. the instance is marked for deletion.
Having instance.Status.OrphanMitigationInProgress = true means that before user deleted an instance, we have marked it for orphan mitigation but haven't finished yet.
Given that deletion has the highest priority over all other operations, keeping the orphan mitigation information is redundant at this point, so we clear it.

nilebox · 2018-03-20T01:43:37Z

Do we need to adjust the comment of OrphanMitigationInProgress as deprecated?

@MHBauer see @staebler's comments. I initially marked OrphanMitigationInProgress as deprecated, but he linked some Kubernetes issue where using a single boolean is preferable for "control logic".
In other words, the new OrphanMitigation condition is used mostly to show the original error reason/message to the user, and we continue using OrphanMitigationInProgress for all control logic.

nilebox · 2018-03-20T02:37:52Z

nil return is success, yes?

@MHBauer I just removed this check, given that the only place for invoking reconcileServiceInstanceDelete is

	case instance.ObjectMeta.DeletionTimestamp != nil || instance.Status.OrphanMitigationInProgress:
		return reconcileDelete

so this check is redundant.
Also we don't have such checks for reconcileServiceInstanceAdd and reconcileServiceInstanceUpdate.

MHBauer

I don't like the thought of "add a new bool" everytime we have a state.

I see the guidance from upstream, and I either don't agree with it, or I don't see the full picture.

If there were more of these flags, how could we be in more than one state?

What is the point of adding a condition, if we're not going to use it as state?

If we're past the point of strictly enforcing a state machine and state transitions, I don't even know where to go.

The code looks fine, I'm not sure what the final answer is.

nilebox · 2018-03-21T01:39:24Z

What is the point of adding a condition, if we're not going to use it as state?

As I said already it's useful for communicating to the user about the original error that lead to orphan mitigation. The reason/message in the Ready condition will be overwritten with the next error occuring at the next iteration, but orphan mitigation condition will remain untouched untill we successfully mitigate.

As of whether we want to mark the OrphanMitigationInProgress flag as deprecated or keep it - you can argue with @staebler, but it doesn't change the behavior, so I think we can merge it as it is, and discuss a possible deprecation as a follow-up.

MHBauer

Ya, sure. I cannot think of anything I want changed.

LGTM

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 5, 2018

nilebox commented Mar 5, 2018

View reviewed changes

nilebox changed the title ~~WIP: OrphanMitigation condition and different handling of retry timeout~~ OrphanMitigation condition and different handling of retry timeout Mar 5, 2018

nilebox assigned kibbles-n-bytes and staebler Mar 5, 2018

staebler reviewed Mar 5, 2018

View reviewed changes

nilebox commented Mar 6, 2018

View reviewed changes

nilebox mentioned this pull request Mar 6, 2018

Pass correct plan ID in deprovision request (for both deleting and orphan mitigation) #1803

Merged

nilebox added the non-happy-path label Mar 7, 2018

staebler approved these changes Mar 14, 2018

View reviewed changes

nilebox added the LGTM1 label Mar 19, 2018

Nail Islamov added 6 commits March 19, 2018 19:34

Introduce OrphanMitigation condition and change the behavior of handl…

4b7ecb1

…ing orphan mitigation with timeout

Fix tests

92326f9

Add test for OrphanMitigation condition data migration

9c3e55a

Undeprecated OrphanMitigationInProgress flag + fix behavior and tests

3bb5426

Tests cleanup

81d19ac

Fix review comments

ed1133a

MHBauer reviewed Mar 19, 2018

View reviewed changes

Remove redundant check in reconcileServiceInstanceDelete

a788aa3

MHBauer reviewed Mar 21, 2018

View reviewed changes

MHBauer approved these changes Mar 21, 2018

View reviewed changes

MHBauer added the LGTM2 label Mar 21, 2018

nilebox merged commit 6a59ada into kubernetes-retired:master Mar 21, 2018

kibbles-n-bytes mentioned this pull request Mar 23, 2018

4xx, 5xx and Connection timeout should be retriable (not terminal errors) #1765

Merged

cblecker unassigned kibbles-n-bytes and staebler Jun 4, 2019

OrphanMitigation condition and different handling of retry timeout #1789

OrphanMitigation condition and different handling of retry timeout #1789

Conversation

nilebox commented Mar 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nilebox Mar 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nilebox Mar 19, 2018 • edited Loading

Choose a reason for hiding this comment

nilebox commented Mar 9, 2018

nilebox commented Mar 19, 2018

MHBauer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nilebox commented Mar 20, 2018

nilebox commented Mar 20, 2018

MHBauer left a comment

Choose a reason for hiding this comment

nilebox commented Mar 21, 2018

MHBauer left a comment

Choose a reason for hiding this comment

nilebox commented Mar 5, 2018 •

edited

Loading

nilebox Mar 5, 2018 •

edited

Loading

nilebox Mar 19, 2018 •

edited

Loading