-
Notifications
You must be signed in to change notification settings - Fork 382
OrphanMitigation condition and different handling of retry timeout #1789
OrphanMitigation condition and different handling of retry timeout #1789
Conversation
@@ -1238,7 +1300,6 @@ func clearServiceInstanceCurrentOperation(toUpdate *v1beta1.ServiceInstance) { | |||
toUpdate.Status.CurrentOperation = "" | |||
toUpdate.Status.OperationStartTime = nil | |||
toUpdate.Status.AsyncOpInProgress = false | |||
toUpdate.Status.OrphanMitigationInProgress = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important change 1: Don't reset the orphan mitigation flag automatically anymore. Only explicitly reset it after we have successfully finished orphan mitigation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to note is that there is no way to recover if the user changes the plan on the ServiceInstance while orphan mitigation is pending. The plan ref is cleared by the API server when the plan name is changed. The delete reconciliation process does not resolve plan refs (nor should it). We need some way of storing the plan that was sent in the unsuccessful provision request to make sure that we use that same plan in the orphan mitigation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@staebler the quick fix could be a part of #1790
- instead of removing a condition completely, check if the plan has changed.
- add a check for
instance.Status.OrphanMitigationInProgress
and reject plan changes.
in the long term it would probably better to store the plan along with "in progress properties" in the status (and don't erase them until orphan mitigation has succeeded).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand (1).
For (2), that is not sufficient. There is still a window where a plan change could get in and ruin orphan mitigation.
- Consider a ServiceInstance with an in-flight Provision request.
- User changes plan on ServiceInstance. Orphan mitigation is not in progress, so plan change is accepted.
- Provision request fails in a way that requires orphan mitigation.
- Controller starts orphan mitigation.
- Orphan mitigation reconciliation fails perpetually since there is no plan to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- User changes plan on ServiceInstance. Orphan mitigation is not in progress, so plan change is accepted
Currently any spec updates (inculding plan changes) are rejected if there is an operation in progress.
While I think we should stop blocking parameter updates (see #1755 and #1790), we might still reject plan updates while there is an operation or orphan mitigation in progress (as this is a rare usecase and hence not as annoying UX as blocking parameter updates).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I see the problem now.
For normal deletion (after instance was successfully provisioned), we already read plan from ExternalProperties
:
controller_instance.go#L1419-L1423
For orphan mitigation, I think we can read it from InProgressProperties
, just need stop overwriting them with nil
in controller_instance.go#L565
i.e. we can just pass existing InProgressProperties
there:
instance, err = c.recordStartOfServiceInstanceOperation(instance, v1beta1.ServiceInstanceOperationDeprovision, instance.Status.InProgressProperties)
@staebler Would it solve the problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
reason := successDeprovisionReason | ||
msg := successDeprovisionMessage | ||
if mitigatingOrphan { | ||
removeServiceInstanceCondition(instance, v1beta1.ServiceInstanceConditionOrphanMitigation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important change 2: Explicitly reset the orphan mitigation condition (and flag) after we have successfully finished orphan mitigation.
pkg/apis/servicecatalog/types.go
Outdated
@@ -519,6 +519,7 @@ type ServiceInstanceStatus struct { | |||
|
|||
// OrphanMitigationInProgress is set to true if there is an ongoing orphan | |||
// mitigation operation against this ServiceInstance in progress. | |||
// Deprecated: Use OrphanMitigation condition instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that we should deprecate OprhanMitigationInProgress
. We should continue to use that field as the primary means of determining whether an orphan mitigation is in progress. The condition is added to store the reason/message and not to supersede the field.
For supporting arguments, see kubernetes/kubernetes#7856 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@staebler initially I planned to do the same but then I realized there is little value of that:
- if OrphanMitigationInProgress is true, the OrphanMitigation condition is always set.
- as soon as OrphanMitigation condition is seen by controller, OrphanMitigationInProgress will be set to true as well, and will be reset only in case if retry timeout (a relatively rare case).
For Async operations we already have AsyncOpInProgress.
Did I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nilebox The kube preferred way is to use a field. I don't see a co.pelling reason here to go against the kube preferred way. The question then becomes whether there is value in having a condition for orphan mitigation. The value I see there is having a place to store the reason and message for why orphan mitigation is occurring. If that reason and message is generic, then there is no value in having an orphan mitigation condition. There is still value in not wiping out the reason and message from the Ready and Failed conditions with generic orphan mitigation reasons and messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@staebler ok. do we even need to reset the OrphanMitigationInProgress
in case of exceeding retry limit then, or just keep the flag and condition always in sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nilebox Definitely keep OrphanMitigationInProgress
in sync with the condition. I am good with the change that you have as far as keeping OrphanMitigationInProgress
set true when the retry limit is exceeded.
readyCond := newServiceInstanceReadyCondition(v1beta1.ConditionFalse, startingInstanceOrphanMitigationReason, startingInstanceOrphanMitigationMessage) | ||
c.recorder.Event(instance, corev1.EventTypeWarning, readyCond.Reason, readyCond.Message) | ||
setServiceInstanceCondition(instance, v1beta1.ServiceInstanceConditionReady, readyCond.Status, readyCond.Reason, readyCond.Message) | ||
reason := startingInstanceOrphanMitigationReason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason and message should be more informative than this. It should come from the failed or ready condition. If we are only going to use a generic reason/message, then there is no point in using a condition at all.
@@ -1238,7 +1300,6 @@ func clearServiceInstanceCurrentOperation(toUpdate *v1beta1.ServiceInstance) { | |||
toUpdate.Status.CurrentOperation = "" | |||
toUpdate.Status.OperationStartTime = nil | |||
toUpdate.Status.AsyncOpInProgress = false | |||
toUpdate.Status.OrphanMitigationInProgress = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to note is that there is no way to recover if the user changes the plan on the ServiceInstance while orphan mitigation is pending. The plan ref is cleared by the API server when the plan name is changed. The delete reconciliation process does not resolve plan refs (nor should it). We need some way of storing the plan that was sent in the unsuccessful provision request to make sure that we use that same plan in the orphan mitigation.
@@ -1539,9 +1608,19 @@ func (c *controller) processProvisionFailure(instance *v1beta1.ServiceInstance, | |||
// requeue this resource. | |||
var err error | |||
if shouldMitigateOrphan { | |||
// Copy failure reason/message to a new OrphanMitigation condition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@staebler please check below if that's what you had in mind for preserving reason/message in OrphanMitigation condition (as well as overwriting the Ready condition)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nilebox Yep, that is what I had in mind.
As an aside, I am not sure why we are creating orphanMitigationCond
and readyCond
just to read the status, reason, and message from the conditions.
@@ -496,6 +527,8 @@ func (c *controller) reconcileServiceInstanceUpdate(instance *v1beta1.ServiceIns | |||
func (c *controller) reconcileServiceInstanceDelete(instance *v1beta1.ServiceInstance) error { | |||
// nothing to do... | |||
if instance.DeletionTimestamp == nil && !instance.Status.OrphanMitigationInProgress { | |||
// TODO nilebox: shouldn't we throw an error instead? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do people think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nil return is success, yes?
Don't know what error to return. Perhaps a log statement, for the record of it happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it is success, yes, or rather "nothing to do".
But we should never end up in this condition being true IMO.
@staebler addressed your comments and resolved conflicts, please review. |
@staebler labeling with |
…ing orphan mitigation with timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some thoughts
Do we need to adjust the comment of OrphanMitigationInProgress as deprecated?
We see the old bool, transition it to the new condition, and remove all processing based on the old bool, besides that necessary for the transition to the condition.
@@ -278,6 +286,29 @@ func (c *controller) initObservedGeneration(instance *v1beta1.ServiceInstance) ( | |||
return false, nil | |||
} | |||
|
|||
// initOrphanMitigationCondition implements OrphanMitigation condition initialization | |||
// based on OrphanMitigationInProgress field for status API migration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is allowing update from old versions of controller before this change to new versions after, and uses the status field OrphanMitigationInProgress to do so, yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
@@ -496,6 +527,8 @@ func (c *controller) reconcileServiceInstanceUpdate(instance *v1beta1.ServiceIns | |||
func (c *controller) reconcileServiceInstanceDelete(instance *v1beta1.ServiceInstance) error { | |||
// nothing to do... | |||
if instance.DeletionTimestamp == nil && !instance.Status.OrphanMitigationInProgress { | |||
// TODO nilebox: shouldn't we throw an error instead? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nil return is success, yes?
Don't know what error to return. Perhaps a log statement, for the record of it happening.
setServiceInstanceCondition(instance, v1beta1.ServiceInstanceConditionReady, | ||
v1beta1.ConditionFalse, | ||
startingInstanceOrphanMitigationReason, | ||
startingInstanceOrphanMitigationMessage) | ||
|
||
instance.Status.OperationStartTime = nil | ||
instance.Status.AsyncOpInProgress = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the next line here is instance.Status.OrphanMitigationInProgress = true
.
should we be setting this at all after this goes in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OrphanMitigationInProgress
is just a flag that means "Orphan mitigation is required" (bad property name, but can't change it since it's part of the API), i.e. this flag doesn't say whether the orphan mitigation has actually started or not.
@@ -3846,7 +3848,7 @@ func TestReconcileServiceInstanceOrphanMitigation(t *testing.T) { | |||
}, | |||
}, | |||
async: true, | |||
finishedOrphanMitigation: true, | |||
finishedOrphanMitigation: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems weird that all of these failure test cases had this boolean set to true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷♂️ because there was a bug in the behavior. As written in the description to PR: "The more important change is that now the OrphanMitigation condition gets reset only after we have successfully completed the orphan mitigation."
// from the normal deletion | ||
removeServiceInstanceCondition(instance, v1beta1.ServiceInstanceConditionOrphanMitigation) | ||
instance.Status.OrphanMitigationInProgress = false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand.
we're in the delete flow, with a timestamp, and not in deprovisioning state,
the OM bool is set, ( ? so the source object is set from previous controller ? )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MHBauer this is the case when instance.DeletionTimestamp != nil
i.e. the instance is marked for deletion.
Having instance.Status.OrphanMitigationInProgress = true
means that before user deleted an instance, we have marked it for orphan mitigation but haven't finished yet.
Given that deletion has the highest priority over all other operations, keeping the orphan mitigation information is redundant at this point, so we clear it.
@MHBauer see @staebler's comments. I initially marked |
@MHBauer I just removed this check, given that the only place for invoking case instance.ObjectMeta.DeletionTimestamp != nil || instance.Status.OrphanMitigationInProgress:
return reconcileDelete so this check is redundant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the thought of "add a new bool" everytime we have a state.
I see the guidance from upstream, and I either don't agree with it, or I don't see the full picture.
If there were more of these flags, how could we be in more than one state?
What is the point of adding a condition, if we're not going to use it as state?
If we're past the point of strictly enforcing a state machine and state transitions, I don't even know where to go.
The code looks fine, I'm not sure what the final answer is.
As I said already it's useful for communicating to the user about the original error that lead to orphan mitigation. The reason/message in the Ready condition will be overwritten with the next error occuring at the next iteration, but orphan mitigation condition will remain untouched untill we successfully mitigate. As of whether we want to mark the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, sure. I cannot think of anything I want changed.
LGTM
Fixes #1771
Switching from
OrphanMitigationInProgress
boolean flag toOrphanMitigation
condition is actually more of a cosmetic change, it just allows us to have a human-readablereason
andmessage
along with the boolean flag.The more important change is that now the
OrphanMitigation
condition gets reset only after we have successfully completed the orphan mitigation.I.e. even we exceed the retry timeout, the
OrphanMitigation
remains set toTrue
.This will allow us to properly support retries in #1765.