Instances: 4xx, 5xx and Connection timeout should be retriable (not terminal errors) #1715

nilebox · 2018-02-05T00:44:43Z

If the OSB request to the broker times out, Service Catalog executes the orphan mitigation and leaves the instance in the TerminalError status:

      message: 'readiness check failed: ErrorCallingProvision: Communication with
        the ClusterServiceBroker timed out; operation will not be retried: Put https://example.com/v2/service_instances/e00dfeb3-a3ce-4ec2
-b2bb-8f1232cc48cc?accepts_incomplete=true:
        net/http: request canceled (Client.Timeout exceeded while awaiting headers)'
      reason: TerminalError
      status: "True"
      type: Error

It means that if OSB broker was temporarily unavailable (or had some other temporary issue leading to slow request processing), Service Catalog won't retry provisioning. So to retry, the user needs to whether delete the instance and create it again, or mutate the spec (updateRequest++).

It's probably not the best UX. Shall we retry a certain number of times before giving up?

P.S. The scope of this issue is bigger, i.e. it also applies to 4xx and 5xx errors (with orphan mitigation required or without). The Kubernetes way of handling errors is to retry after failures (with exponential backoff).

The text was updated successfully, but these errors were encountered:

pmorie · 2018-02-05T01:24:55Z

Agree this should not be a terminal error

…

On Sun, Feb 4, 2018 at 7:44 PM Nail Islamov ***@***.***> wrote: If the OSB request to the broker times out, Service Catalog executes the orphan mitigation and leaves the instance in the TerminalError status: message: 'readiness check failed: ErrorCallingProvision: Communication with the ClusterServiceBroker timed out; operation will not be retried: Put https://micros--platform.ap-southeast-2.dev.atl-paas.net/osb/v2/service_instances/e00dfeb3-a3ce-4ec2 -b2bb-8f1232cc48cc?accepts_incomplete=true <https://micros--platform.ap-southeast-2.dev.atl-paas.net/osb/v2/service_instances/e00dfeb3-a3ce-4ec2-b2bb-8f1232cc48cc?accepts_incomplete=true>: net/http: request canceled (Client.Timeout exceeded while awaiting headers)' reason: TerminalError status: "True" type: Error It means that if OSB broker was temporarily unavailable (or had some other temporary issue leading to slow request processing), Service Catalog won't retry provisioning. So to retry, the user needs to whether delete the instance and create it again, or mutate the spec (updateRequest++). It's probably not the best UX. Shall we retry a certain number of times before giving up? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1715>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAWXmFy6TlKdOGrMfRK6N7_mxuJo4IGxks5tRk78gaJpZM4R4zGI> .

nilebox · 2018-02-06T00:33:32Z

By looking at the code the current behavior was actually is by-design added by @kibbles-n-bytes:

// A timeout error is considered a terminal failure and we
// should initiate orphan mitigation.
if urlErr, ok := err.(*url.Error); ok && urlErr.Timeout() {
	msg := fmt.Sprintf("Communication with the ClusterServiceBroker timed out; operation will not be retried: %v", urlErr)
	readyCond := newServiceInstanceReadyCondition(v1beta1.ConditionFalse, reason, msg)
	failedCond := newServiceInstanceFailedCondition(v1beta1.ConditionTrue, reason, msg)
	return c.processProvisionFailure(instance, readyCond, failedCond, true)
}

// All other errors should be retried, unless the
// reconciliation retry time limit has passed.

I see 2 options for resolving this issue:

just remove this block and treat the connection timeout the same way as the other errors, i.e. as retriable straight-away and no orphan mitigation
leave the orphan mitigation for connection timeout and implement retry loop at the end of orphan mitigation.

@pmorie @kibbles-n-bytes what do you think is a better solution?

nilebox · 2018-02-06T03:20:25Z

After scanning through the OSB spec again, it seems that Option 2 is the only one that is considered valid by the spec.

The problem seems to be more serious than just connection timeout though.
400 Bad Request is also considered a terminal error, for example. And it certainly doesn't make sense to retry after receiving this error (unlike request timeout). But the current implementation of Service Catalog won't retry after receiving 400 Bad Request even if the ServiceInstance spec gets updated (which might include correct parameters).

So it looks to me that we should start a whole new process of categorizing errors that are currently considered "terminal". Do we even need "forever terminal" errors at all, i.e. not retrying even after the spec has changed?

/cc @ash2k @duglin @vaikas-google I would like to hear your thoughts as well.

nilebox · 2018-02-06T04:13:57Z

I took a look at the OSB provisioning errors, and what IMO should be changed, marked those in bold with asterisk (*):

Error	Orphan mitigation required?	Retriable automatically (after orphan mitigation if required)?	Retriable after `ServiceInstance` spec updated?
Timeout	Yes	Yes*	Yes*
4xx	No	No	Yes*
5xx	Yes	Yes*	Yes*

*The current behavior is "No", and should be changed to "Yes"

nilebox · 2018-02-20T03:28:15Z

Proposed changes in Service Catalog code to resolve this issue:

1. Move all non-terminal errors from Failed:True to Ready:False with Reason set (the only error that should be kept in Failed:True then is for 400 Bad Request which will require retry only after the spec has changed)

this allows to retry immediately without requiring to update the spec
this also means that we will have cases where we need to perform orphan mitigation before retrying

2. Always retry if the spec has changed (even if the Failed:True condition is set), i.e. no more "terminal" errors, just temporary errors that require fixing the spec, see #1751

addresses the issue for retrying after receiving 400 Bad Request and fixing the spec

3. Don’t set the ReconciledGeneration after orphan mitigation (i.e. if DeletionTimestamp == nil), or switch to ObservedGeneration as described in #1747

this allows to retry after orphan mitigation has succeeded

@pmorie @kibbles-n-bytes @jboyd01 @MHBauer @arschles please review the checklist above. Does it look reasonable to you?

This was referenced Feb 12, 2018

We should retry Provisioning/Binding when the user corrects the spec #1672

Closed

Can instance_id be reused after orphan mitigation finished? openservicebrokerapi/servicebroker#447

Closed

nilebox changed the title ~~Connection timeout should not be a terminal error?~~ 4xx, 5xx and Connection timeout should be retriable (not terminal errors) Feb 20, 2018

nilebox mentioned this issue Feb 28, 2018

Separate OrphanMitigation condition? #1771

Closed

nilebox changed the title ~~4xx, 5xx and Connection timeout should be retriable (not terminal errors)~~ Instances: 4xx, 5xx and Connection timeout should be retriable (not terminal errors) Mar 4, 2018

nilebox mentioned this issue Mar 4, 2018

Bindings: 4xx, 5xx and Connection timeout should be retriable (not terminal errors) #1787

Closed

nilebox closed this as completed in #1765 Mar 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instances: 4xx, 5xx and Connection timeout should be retriable (not terminal errors) #1715

Instances: 4xx, 5xx and Connection timeout should be retriable (not terminal errors) #1715

nilebox commented Feb 5, 2018 •

edited

Loading

pmorie commented Feb 5, 2018 via email

nilebox commented Feb 6, 2018 •

edited

Loading

nilebox commented Feb 6, 2018 •

edited

Loading

nilebox commented Feb 6, 2018

nilebox commented Feb 20, 2018 •

edited

Loading

Instances: 4xx, 5xx and Connection timeout should be retriable (not terminal errors) #1715

Instances: 4xx, 5xx and Connection timeout should be retriable (not terminal errors) #1715

Comments

nilebox commented Feb 5, 2018 • edited Loading

pmorie commented Feb 5, 2018 via email

nilebox commented Feb 6, 2018 • edited Loading

nilebox commented Feb 6, 2018 • edited Loading

nilebox commented Feb 6, 2018

nilebox commented Feb 20, 2018 • edited Loading

nilebox commented Feb 5, 2018 •

edited

Loading

nilebox commented Feb 6, 2018 •

edited

Loading

nilebox commented Feb 6, 2018 •

edited

Loading

nilebox commented Feb 20, 2018 •

edited

Loading