Skip to content
This repository has been archived by the owner on May 6, 2022. It is now read-only.

OrphanMitigation condition and different handling of retry timeout #1789

Merged
merged 7 commits into from
Mar 21, 2018
Merged

OrphanMitigation condition and different handling of retry timeout #1789

merged 7 commits into from
Mar 21, 2018

Conversation

nilebox
Copy link
Contributor

@nilebox nilebox commented Mar 5, 2018

Fixes #1771

Switching from OrphanMitigationInProgress boolean flag to OrphanMitigation condition is actually more of a cosmetic change, it just allows us to have a human-readable reason and message along with the boolean flag.

The more important change is that now the OrphanMitigation condition gets reset only after we have successfully completed the orphan mitigation.
I.e. even we exceed the retry timeout, the OrphanMitigation remains set to True.
This will allow us to properly support retries in #1765.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 5, 2018
@@ -1238,7 +1300,6 @@ func clearServiceInstanceCurrentOperation(toUpdate *v1beta1.ServiceInstance) {
toUpdate.Status.CurrentOperation = ""
toUpdate.Status.OperationStartTime = nil
toUpdate.Status.AsyncOpInProgress = false
toUpdate.Status.OrphanMitigationInProgress = false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important change 1: Don't reset the orphan mitigation flag automatically anymore. Only explicitly reset it after we have successfully finished orphan mitigation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note is that there is no way to recover if the user changes the plan on the ServiceInstance while orphan mitigation is pending. The plan ref is cleared by the API server when the plan name is changed. The delete reconciliation process does not resolve plan refs (nor should it). We need some way of storing the plan that was sent in the unsuccessful provision request to make sure that we use that same plan in the orphan mitigation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@staebler the quick fix could be a part of #1790

  1. instead of removing a condition completely, check if the plan has changed.
  2. add a check for instance.Status.OrphanMitigationInProgress and reject plan changes.

in the long term it would probably better to store the plan along with "in progress properties" in the status (and don't erase them until orphan mitigation has succeeded).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand (1).

For (2), that is not sufficient. There is still a window where a plan change could get in and ruin orphan mitigation.

  1. Consider a ServiceInstance with an in-flight Provision request.
  2. User changes plan on ServiceInstance. Orphan mitigation is not in progress, so plan change is accepted.
  3. Provision request fails in a way that requires orphan mitigation.
  4. Controller starts orphan mitigation.
  5. Orphan mitigation reconciliation fails perpetually since there is no plan to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. User changes plan on ServiceInstance. Orphan mitigation is not in progress, so plan change is accepted

Currently any spec updates (inculding plan changes) are rejected if there is an operation in progress.
While I think we should stop blocking parameter updates (see #1755 and #1790), we might still reject plan updates while there is an operation or orphan mitigation in progress (as this is a rare usecase and hence not as annoying UX as blocking parameter updates).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I see the problem now.
For normal deletion (after instance was successfully provisioned), we already read plan from ExternalProperties:
controller_instance.go#L1419-L1423
For orphan mitigation, I think we can read it from InProgressProperties, just need stop overwriting them with nil in controller_instance.go#L565
i.e. we can just pass existing InProgressProperties there:

instance, err = c.recordStartOfServiceInstanceOperation(instance, v1beta1.ServiceInstanceOperationDeprovision, instance.Status.InProgressProperties)

@staebler Would it solve the problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@staebler please review #1803 that should address this problem.


reason := successDeprovisionReason
msg := successDeprovisionMessage
if mitigatingOrphan {
removeServiceInstanceCondition(instance, v1beta1.ServiceInstanceConditionOrphanMitigation)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important change 2: Explicitly reset the orphan mitigation condition (and flag) after we have successfully finished orphan mitigation.

@nilebox nilebox changed the title WIP: OrphanMitigation condition and different handling of retry timeout OrphanMitigation condition and different handling of retry timeout Mar 5, 2018
@@ -519,6 +519,7 @@ type ServiceInstanceStatus struct {

// OrphanMitigationInProgress is set to true if there is an ongoing orphan
// mitigation operation against this ServiceInstance in progress.
// Deprecated: Use OrphanMitigation condition instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we should deprecate OprhanMitigationInProgress. We should continue to use that field as the primary means of determining whether an orphan mitigation is in progress. The condition is added to store the reason/message and not to supersede the field.

For supporting arguments, see kubernetes/kubernetes#7856 (comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@staebler initially I planned to do the same but then I realized there is little value of that:

  • if OrphanMitigationInProgress is true, the OrphanMitigation condition is always set.
  • as soon as OrphanMitigation condition is seen by controller, OrphanMitigationInProgress will be set to true as well, and will be reset only in case if retry timeout (a relatively rare case).

For Async operations we already have AsyncOpInProgress.
Did I miss something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nilebox The kube preferred way is to use a field. I don't see a co.pelling reason here to go against the kube preferred way. The question then becomes whether there is value in having a condition for orphan mitigation. The value I see there is having a place to store the reason and message for why orphan mitigation is occurring. If that reason and message is generic, then there is no value in having an orphan mitigation condition. There is still value in not wiping out the reason and message from the Ready and Failed conditions with generic orphan mitigation reasons and messages.

Copy link
Contributor Author

@nilebox nilebox Mar 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@staebler ok. do we even need to reset the OrphanMitigationInProgress in case of exceeding retry limit then, or just keep the flag and condition always in sync?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nilebox Definitely keep OrphanMitigationInProgress in sync with the condition. I am good with the change that you have as far as keeping OrphanMitigationInProgress set true when the retry limit is exceeded.

readyCond := newServiceInstanceReadyCondition(v1beta1.ConditionFalse, startingInstanceOrphanMitigationReason, startingInstanceOrphanMitigationMessage)
c.recorder.Event(instance, corev1.EventTypeWarning, readyCond.Reason, readyCond.Message)
setServiceInstanceCondition(instance, v1beta1.ServiceInstanceConditionReady, readyCond.Status, readyCond.Reason, readyCond.Message)
reason := startingInstanceOrphanMitigationReason
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason and message should be more informative than this. It should come from the failed or ready condition. If we are only going to use a generic reason/message, then there is no point in using a condition at all.

@@ -1238,7 +1300,6 @@ func clearServiceInstanceCurrentOperation(toUpdate *v1beta1.ServiceInstance) {
toUpdate.Status.CurrentOperation = ""
toUpdate.Status.OperationStartTime = nil
toUpdate.Status.AsyncOpInProgress = false
toUpdate.Status.OrphanMitigationInProgress = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note is that there is no way to recover if the user changes the plan on the ServiceInstance while orphan mitigation is pending. The plan ref is cleared by the API server when the plan name is changed. The delete reconciliation process does not resolve plan refs (nor should it). We need some way of storing the plan that was sent in the unsuccessful provision request to make sure that we use that same plan in the orphan mitigation.

@@ -1539,9 +1608,19 @@ func (c *controller) processProvisionFailure(instance *v1beta1.ServiceInstance,
// requeue this resource.
var err error
if shouldMitigateOrphan {
// Copy failure reason/message to a new OrphanMitigation condition
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@staebler please check below if that's what you had in mind for preserving reason/message in OrphanMitigation condition (as well as overwriting the Ready condition)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nilebox Yep, that is what I had in mind.

As an aside, I am not sure why we are creating orphanMitigationCond and readyCond just to read the status, reason, and message from the conditions.

@@ -496,6 +527,8 @@ func (c *controller) reconcileServiceInstanceUpdate(instance *v1beta1.ServiceIns
func (c *controller) reconcileServiceInstanceDelete(instance *v1beta1.ServiceInstance) error {
// nothing to do...
if instance.DeletionTimestamp == nil && !instance.Status.OrphanMitigationInProgress {
// TODO nilebox: shouldn't we throw an error instead?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do people think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nil return is success, yes?

Don't know what error to return. Perhaps a log statement, for the record of it happening.

Copy link
Contributor Author

@nilebox nilebox Mar 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it is success, yes, or rather "nothing to do".
But we should never end up in this condition being true IMO.

@nilebox
Copy link
Contributor Author

nilebox commented Mar 9, 2018

@staebler addressed your comments and resolved conflicts, please review.

@nilebox nilebox added the LGTM1 label Mar 19, 2018
@nilebox
Copy link
Contributor Author

nilebox commented Mar 19, 2018

@staebler labeling with LGTM1since you approved the PR.

Copy link
Contributor

@MHBauer MHBauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some thoughts

Do we need to adjust the comment of OrphanMitigationInProgress as deprecated?

We see the old bool, transition it to the new condition, and remove all processing based on the old bool, besides that necessary for the transition to the condition.

@@ -278,6 +286,29 @@ func (c *controller) initObservedGeneration(instance *v1beta1.ServiceInstance) (
return false, nil
}

// initOrphanMitigationCondition implements OrphanMitigation condition initialization
// based on OrphanMitigationInProgress field for status API migration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is allowing update from old versions of controller before this change to new versions after, and uses the status field OrphanMitigationInProgress to do so, yes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

@@ -496,6 +527,8 @@ func (c *controller) reconcileServiceInstanceUpdate(instance *v1beta1.ServiceIns
func (c *controller) reconcileServiceInstanceDelete(instance *v1beta1.ServiceInstance) error {
// nothing to do...
if instance.DeletionTimestamp == nil && !instance.Status.OrphanMitigationInProgress {
// TODO nilebox: shouldn't we throw an error instead?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nil return is success, yes?

Don't know what error to return. Perhaps a log statement, for the record of it happening.

setServiceInstanceCondition(instance, v1beta1.ServiceInstanceConditionReady,
v1beta1.ConditionFalse,
startingInstanceOrphanMitigationReason,
startingInstanceOrphanMitigationMessage)

instance.Status.OperationStartTime = nil
instance.Status.AsyncOpInProgress = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the next line here is instance.Status.OrphanMitigationInProgress = true.

should we be setting this at all after this goes in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OrphanMitigationInProgress is just a flag that means "Orphan mitigation is required" (bad property name, but can't change it since it's part of the API), i.e. this flag doesn't say whether the orphan mitigation has actually started or not.

@@ -3846,7 +3848,7 @@ func TestReconcileServiceInstanceOrphanMitigation(t *testing.T) {
},
},
async: true,
finishedOrphanMitigation: true,
finishedOrphanMitigation: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems weird that all of these failure test cases had this boolean set to true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷‍♂️ because there was a bug in the behavior. As written in the description to PR: "The more important change is that now the OrphanMitigation condition gets reset only after we have successfully completed the orphan mitigation."

// from the normal deletion
removeServiceInstanceCondition(instance, v1beta1.ServiceInstanceConditionOrphanMitigation)
instance.Status.OrphanMitigationInProgress = false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand.

we're in the delete flow, with a timestamp, and not in deprovisioning state,
the OM bool is set, ( ? so the source object is set from previous controller ? )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MHBauer this is the case when instance.DeletionTimestamp != nil i.e. the instance is marked for deletion.
Having instance.Status.OrphanMitigationInProgress = true means that before user deleted an instance, we have marked it for orphan mitigation but haven't finished yet.
Given that deletion has the highest priority over all other operations, keeping the orphan mitigation information is redundant at this point, so we clear it.

@nilebox
Copy link
Contributor Author

nilebox commented Mar 20, 2018

Do we need to adjust the comment of OrphanMitigationInProgress as deprecated?

@MHBauer see @staebler's comments. I initially marked OrphanMitigationInProgress as deprecated, but he linked some Kubernetes issue where using a single boolean is preferable for "control logic".
In other words, the new OrphanMitigation condition is used mostly to show the original error reason/message to the user, and we continue using OrphanMitigationInProgress for all control logic.

@nilebox
Copy link
Contributor Author

nilebox commented Mar 20, 2018

nil return is success, yes?

@MHBauer I just removed this check, given that the only place for invoking reconcileServiceInstanceDelete is

	case instance.ObjectMeta.DeletionTimestamp != nil || instance.Status.OrphanMitigationInProgress:
		return reconcileDelete

so this check is redundant.
Also we don't have such checks for reconcileServiceInstanceAdd and reconcileServiceInstanceUpdate.

Copy link
Contributor

@MHBauer MHBauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the thought of "add a new bool" everytime we have a state.

I see the guidance from upstream, and I either don't agree with it, or I don't see the full picture.

If there were more of these flags, how could we be in more than one state?

What is the point of adding a condition, if we're not going to use it as state?

If we're past the point of strictly enforcing a state machine and state transitions, I don't even know where to go.

The code looks fine, I'm not sure what the final answer is.

@nilebox
Copy link
Contributor Author

nilebox commented Mar 21, 2018

What is the point of adding a condition, if we're not going to use it as state?

As I said already it's useful for communicating to the user about the original error that lead to orphan mitigation. The reason/message in the Ready condition will be overwritten with the next error occuring at the next iteration, but orphan mitigation condition will remain untouched untill we successfully mitigate.

As of whether we want to mark the OrphanMitigationInProgress flag as deprecated or keep it - you can argue with @staebler, but it doesn't change the behavior, so I think we can merge it as it is, and discuss a possible deprecation as a follow-up.

Copy link
Contributor

@MHBauer MHBauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, sure. I cannot think of anything I want changed.

LGTM

@MHBauer MHBauer added the LGTM2 label Mar 21, 2018
@nilebox nilebox merged commit 6a59ada into kubernetes-retired:master Mar 21, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. LGTM1 LGTM2 non-happy-path size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Separate OrphanMitigation condition?
5 participants