-
Notifications
You must be signed in to change notification settings - Fork 382
Instances: 4xx, 5xx and Connection timeout should be retriable (not terminal errors) #1715
Comments
Agree this should not be a terminal error
…On Sun, Feb 4, 2018 at 7:44 PM Nail Islamov ***@***.***> wrote:
If the OSB request to the broker times out, Service Catalog executes the
orphan mitigation and leaves the instance in the TerminalError status:
message: 'readiness check failed: ErrorCallingProvision: Communication with
the ClusterServiceBroker timed out; operation will not be retried: Put https://micros--platform.ap-southeast-2.dev.atl-paas.net/osb/v2/service_instances/e00dfeb3-a3ce-4ec2
-b2bb-8f1232cc48cc?accepts_incomplete=true <https://micros--platform.ap-southeast-2.dev.atl-paas.net/osb/v2/service_instances/e00dfeb3-a3ce-4ec2-b2bb-8f1232cc48cc?accepts_incomplete=true>:
net/http: request canceled (Client.Timeout exceeded while awaiting headers)'
reason: TerminalError
status: "True"
type: Error
It means that if OSB broker was temporarily unavailable (or had some other
temporary issue leading to slow request processing), Service Catalog won't
retry provisioning. So to retry, the user needs to whether delete the
instance and create it again, or mutate the spec (updateRequest++).
It's probably not the best UX. Shall we retry a certain number of times
before giving up?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1715>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAWXmFy6TlKdOGrMfRK6N7_mxuJo4IGxks5tRk78gaJpZM4R4zGI>
.
|
By looking at the code the current behavior was actually is by-design added by @kibbles-n-bytes: // A timeout error is considered a terminal failure and we
// should initiate orphan mitigation.
if urlErr, ok := err.(*url.Error); ok && urlErr.Timeout() {
msg := fmt.Sprintf("Communication with the ClusterServiceBroker timed out; operation will not be retried: %v", urlErr)
readyCond := newServiceInstanceReadyCondition(v1beta1.ConditionFalse, reason, msg)
failedCond := newServiceInstanceFailedCondition(v1beta1.ConditionTrue, reason, msg)
return c.processProvisionFailure(instance, readyCond, failedCond, true)
}
// All other errors should be retried, unless the
// reconciliation retry time limit has passed. I see 2 options for resolving this issue:
@pmorie @kibbles-n-bytes what do you think is a better solution? |
After scanning through the OSB spec again, it seems that Option 2 is the only one that is considered valid by the spec. The problem seems to be more serious than just connection timeout though. So it looks to me that we should start a whole new process of categorizing errors that are currently considered "terminal". Do we even need "forever terminal" errors at all, i.e. not retrying even after the spec has changed? /cc @ash2k @duglin @vaikas-google I would like to hear your thoughts as well. |
I took a look at the OSB provisioning errors, and what IMO should be changed, marked those in bold with asterisk (*):
*The current behavior is "No", and should be changed to "Yes" |
Proposed changes in Service Catalog code to resolve this issue: 1. Move all non-terminal errors from
2. Always retry if the spec has changed (even if the
3. Don’t set the
@pmorie @kibbles-n-bytes @jboyd01 @MHBauer @arschles please review the checklist above. Does it look reasonable to you? |
If the OSB request to the broker times out, Service Catalog executes the orphan mitigation and leaves the instance in the TerminalError status:
It means that if OSB broker was temporarily unavailable (or had some other temporary issue leading to slow request processing), Service Catalog won't retry provisioning. So to retry, the user needs to whether delete the instance and create it again, or mutate the spec (
updateRequest++
).It's probably not the best UX. Shall we retry a certain number of times before giving up?
P.S. The scope of this issue is bigger, i.e. it also applies to 4xx and 5xx errors (with orphan mitigation required or without). The Kubernetes way of handling errors is to retry after failures (with exponential backoff).
The text was updated successfully, but these errors were encountered: