Skip to content
This repository has been archived by the owner on May 6, 2022. It is now read-only.

Instances: 4xx, 5xx and Connection timeout should be retriable (not terminal errors) #1715

Closed
nilebox opened this issue Feb 5, 2018 · 5 comments

Comments

@nilebox
Copy link
Contributor

nilebox commented Feb 5, 2018

If the OSB request to the broker times out, Service Catalog executes the orphan mitigation and leaves the instance in the TerminalError status:

      message: 'readiness check failed: ErrorCallingProvision: Communication with
        the ClusterServiceBroker timed out; operation will not be retried: Put https://example.com/v2/service_instances/e00dfeb3-a3ce-4ec2
-b2bb-8f1232cc48cc?accepts_incomplete=true:
        net/http: request canceled (Client.Timeout exceeded while awaiting headers)'
      reason: TerminalError
      status: "True"
      type: Error

It means that if OSB broker was temporarily unavailable (or had some other temporary issue leading to slow request processing), Service Catalog won't retry provisioning. So to retry, the user needs to whether delete the instance and create it again, or mutate the spec (updateRequest++).

It's probably not the best UX. Shall we retry a certain number of times before giving up?

P.S. The scope of this issue is bigger, i.e. it also applies to 4xx and 5xx errors (with orphan mitigation required or without). The Kubernetes way of handling errors is to retry after failures (with exponential backoff).

@pmorie
Copy link
Contributor

pmorie commented Feb 5, 2018 via email

@nilebox
Copy link
Contributor Author

nilebox commented Feb 6, 2018

By looking at the code the current behavior was actually is by-design added by @kibbles-n-bytes:

// A timeout error is considered a terminal failure and we
// should initiate orphan mitigation.
if urlErr, ok := err.(*url.Error); ok && urlErr.Timeout() {
	msg := fmt.Sprintf("Communication with the ClusterServiceBroker timed out; operation will not be retried: %v", urlErr)
	readyCond := newServiceInstanceReadyCondition(v1beta1.ConditionFalse, reason, msg)
	failedCond := newServiceInstanceFailedCondition(v1beta1.ConditionTrue, reason, msg)
	return c.processProvisionFailure(instance, readyCond, failedCond, true)
}

// All other errors should be retried, unless the
// reconciliation retry time limit has passed.

I see 2 options for resolving this issue:

  1. just remove this block and treat the connection timeout the same way as the other errors, i.e. as retriable straight-away and no orphan mitigation
  2. leave the orphan mitigation for connection timeout and implement retry loop at the end of orphan mitigation.

@pmorie @kibbles-n-bytes what do you think is a better solution?

@nilebox
Copy link
Contributor Author

nilebox commented Feb 6, 2018

After scanning through the OSB spec again, it seems that Option 2 is the only one that is considered valid by the spec.

The problem seems to be more serious than just connection timeout though.
400 Bad Request is also considered a terminal error, for example. And it certainly doesn't make sense to retry after receiving this error (unlike request timeout). But the current implementation of Service Catalog won't retry after receiving 400 Bad Request even if the ServiceInstance spec gets updated (which might include correct parameters).

So it looks to me that we should start a whole new process of categorizing errors that are currently considered "terminal". Do we even need "forever terminal" errors at all, i.e. not retrying even after the spec has changed?

/cc @ash2k @duglin @vaikas-google I would like to hear your thoughts as well.

@nilebox
Copy link
Contributor Author

nilebox commented Feb 6, 2018

I took a look at the OSB provisioning errors, and what IMO should be changed, marked those in bold with asterisk (*):

Error Orphan mitigation required? Retriable automatically (after orphan mitigation if required)? Retriable after ServiceInstance spec updated?
Timeout Yes Yes* Yes*
4xx No No Yes*
5xx Yes Yes* Yes*

*The current behavior is "No", and should be changed to "Yes"

@nilebox nilebox changed the title Connection timeout should not be a terminal error? 4xx, 5xx and Connection timeout should be retriable (not terminal errors) Feb 20, 2018
@nilebox
Copy link
Contributor Author

nilebox commented Feb 20, 2018

Proposed changes in Service Catalog code to resolve this issue:

1. Move all non-terminal errors from Failed:True to Ready:False with Reason set (the only error that should be kept in Failed:True then is for 400 Bad Request which will require retry only after the spec has changed)

  • this allows to retry immediately without requiring to update the spec
  • this also means that we will have cases where we need to perform orphan mitigation before retrying

2. Always retry if the spec has changed (even if the Failed:True condition is set), i.e. no more "terminal" errors, just temporary errors that require fixing the spec, see #1751

  • addresses the issue for retrying after receiving 400 Bad Request and fixing the spec

3. Don’t set the ReconciledGeneration after orphan mitigation (i.e. if DeletionTimestamp == nil), or switch to ObservedGeneration as described in #1747

  • this allows to retry after orphan mitigation has succeeded

@pmorie @kibbles-n-bytes @jboyd01 @MHBauer @arschles please review the checklist above. Does it look reasonable to you?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants