delay between provisioning retries #2065

jboyd01 · 2018-05-24T04:50:08Z

fixes #2025 - at least it prevents Controller from blasting Brokers with constant provision retry/orphan mitigation loop.

In addition to #2025, see #2076 for additional details on problems with the instance reconciliation. This PR is a near term band-aid to prevent pain at the brokers.

A new in memory list of instances with provision retry times provides a delay to prevent a tight provision/provision-failure/orphan mitigation loop that pounds brokers without pause when provisioning ends with non-terminal failure.

This short term fix adds the retryQueue where we introduce additional delay on top of what the reconciler loop was originally doing. Before doing a broker operation (Provision or Update) we check the backoffBeforeRetrying and retryTime maps to see if we should first delay. If the instance has an entry in backoffBeforeRetrying that indicates the operation previously failed and we need to get a new exponential backoff delay from the RateLimter. This is calculated and a new time is set in the retryTime and then then the instance is put back into the main worker queue with an AddAfter(time) which causes the worker to reprocess after the delay. On this next pass, verification is done to ensure the time delay has expired and we then set the dirty bit in the backoffBeforeRetrying in anticipation of another failure. If the operation is successful, the entries will be cleared from the retryQueue. A failure runs through the process again. A background tasks periodically purges old entries from retryQueue.

The main point above is that all the backoff processing happens just before we execute the broker operation. There are a lot of edge cases and post error processing that otherwise makes it difficult to instrument the backoff properly in all the necessary places. This is a best effort band aid, a temporary work around until the overall reconcilation flow can be reworked properly (issue #2076).

We do not do additional backoff for Deprovisioning (regular or Orphan Mitigation) as the OM status handling is embedded with in the delete and its difficult to be certain if a delete is part of OM and ensuring we don't clear out overall backoff if there are failures and then success within OM.

Binding Reconcilation likely has the exact same issues.

jboyd01 · 2018-05-30T13:50:41Z

@luksa, @kibbles-n-bytes could you review? Travis error is with building docsite, unrelated failure that I'm attempting to address, but this PR is ready to go I believe.

luksa · 2018-05-30T15:04:59Z

pkg/controller/controller_instance.go

@@ -91,6 +92,17 @@ const (
 	clusterIdentifierKey string = "clusterid"
 )

+type retryQueue struct {
+	mutex sync.RWMutex // lock to be used with provisionRetryTime


comment references provisionRetryTime instead of retryTime

luksa · 2018-05-30T15:06:38Z

cmd/controller-manager/app/options/options.go

@@ -60,6 +60,7 @@ const (
 	defaultLeaderElectionNamespace                = "kube-system"
 	defaultReconciliationRetryDuration            = 7 * 24 * time.Hour
 	defaultOperationPollingMaximumBackoffDuration = 20 * time.Minute
+	defaultProvisioningRetryInterval              = 5 * time.Minute


My gut feeling is that this is way too long. Especially when the error during provisioning is just a temporary glitch.

It's way too long when you have a single failure, but when the broker is in a bad state and just sits there returning a retry-able error it's pretty frequent. Optimally we'd do a set of quick retries and then fall back to every 5 or 10 minutes, but I don't think the effort for that is worth it as we need to rework this anyway (#2076). Suggestions on a better backoff? Once a minute? Note that this is overridable with the flag --provision-retry-interval

luksa · 2018-05-30T15:18:39Z

pkg/controller/controller_instance.go

@@ -99,8 +111,16 @@ func (c *controller) instanceAdd(obj interface{}) {
 		glog.Errorf("Couldn't get key for object %+v: %v", obj, err)
 		return
 	}
+	c.instanceQueue.AddRateLimited(key)


I'm not a 100% on this, but I think this is a bad idea. The instanceAdd() method gets called in instanceUpdate(), which gets called when the user changes the ServiceInstance object.

Now imagine what happens when the user initially posts a bad ServiceInstance, which causes the reconcile method to back off. Eventually, the backoff time delay is a few minutes. If the user then updates the ServiceInstance, the controller won't process the update immediately, but minutes later.

Also, I believe every call to AddRateLimited increases the delay. So, with this change, every update to the object increases the delay.

I'm fine with reverting this so it isn't rate limited, I did it per #2025 (comment), but I do want to note that even if we do a backoff/retry the max # of retry times (15), they are all completed in under 3 minutes. Check out the backoff times: https://github.com/kubernetes-incubator/service-catalog/blob/6a693a7661654faa2dffa2d50d44f8d676207b7c/pkg/controller/controller.go#L57

I'm very confident this needs to be reverted. Even if the maximum backoff time is "just" 82s, that's too much if it means that's how long it takes the controller to start acting on a change made by the user to the ServiceInstance.

luksa · 2018-05-30T15:23:01Z

pkg/controller/controller_instance.go

+		return
+	}
+	c.provisionRetryQueue.mutex.Lock()
+	c.provisionRetryQueue.retryTime[key] = time.Now().Add(c.provisionRetryQueue.retryInterval)


Can you take a look at NewItemExponentialFailureRateLimiter() in github.com/kubernetes-incubator/service-catalog/vendor/k8s.io/client-go/util/workqueue/default_rate_limiters.go ?

I think adding a nice exponential backoff instead of a constant delay should be fairly simple. The rate limiter has a When() method you can call to get the next time.

But I haven't really looked at the rate limiters closely. So I may be wrong :)

An exponential backoff is ideal, it is what we were supposed to be doing here, but we constantly reset the notion of failure/success when we do status updates or complete the orphan mitigation and we never quit after max retries, it just restarts all over.

To use exponential backoff we need to maintain more state, including the # of retries.

Ideally we would have a queue much like today that has very short backoff/retry for kube reconciliation, but we probably need another queue that is for remote broker operations that has a larger backoff. I 100% agree that exponential backoff is the right fix, but its going to take a large rework. The only goal of this PR is to stop us from pounding on the broker.

No, I meant just create a new rate limiter instead of the retryTime map. The rate limiter keeps track of everything. All you need to do is call rateLimiter.When(instance) here and it'll keep track of the individual delays for each ServiceInstance.

I see what you are saying Marko, great idea. I think I'll have to keep the retryMap in addition to adding the rateLimter. The rateLimiter doesn't have support for checking if an item is in queue and getting the next execution time. But I really like the idea of exponential backoff.

luksa · 2018-05-30T15:36:33Z

pkg/controller/controller_instance.go

@@ -339,6 +421,18 @@ func (c *controller) reconcileServiceInstanceAdd(instance *v1beta1.ServiceInstan
 		return nil
 	}

+	// don't DOS the broker.  If we already did a provision attemp that ended with a non-terminal


s/attemp/attempt/

Thanks for the review @luksa, I've updated the typo and incorrect comment, I'm open to suggestions on a better retry interval. I'd be ok with 60 seconds vs 5 minutes. In discussion with @pmorie we were thinking 20 minutes, but that really focuses on constant failure cases.

luksa · 2018-05-31T05:17:13Z

pkg/controller/controller_instance.go

+	// with repeated requests.
+	retryTime        map[string]time.Time
+	queueLastCleaned time.Time
+	rateLimter       workqueue.RateLimiter


missing 'i' in rateLimter

luksa · 2018-05-31T05:24:07Z

pkg/controller/controller_instance.go

@@ -99,8 +111,16 @@ func (c *controller) instanceAdd(obj interface{}) {
 		glog.Errorf("Couldn't get key for object %+v: %v", obj, err)
 		return
 	}
+	c.instanceQueue.AddRateLimited(key)


I'm very confident this needs to be reverted. Even if the maximum backoff time is "just" 82s, that's too much if it means that's how long it takes the controller to start acting on a change made by the user to the ServiceInstance.

luksa · 2018-05-31T05:27:57Z

test/util/util.go

+		return
+	}
+	glog.Infof("Instance %v/%v: %+v", namespace, name, instance)
+}


I'm guessing you used this method during testing. It's not used anywhere.

Thanks @luksa, I've reverted back to non rate limited add and corrected the other two issues.

luksa · 2018-05-31T13:41:15Z

LGTM

I'd /lgtm but I'm not in the kubernetes-incubator org yet. Hopefully they'll add me today.

jboyd01 · 2018-05-31T17:10:22Z

hmm... It appears changing the c.instanceQueue.AddRateLimited(key) back to c.instanceQueue.Add(key) in InstanceAdd() breaks integration tests. Even greatly increasing overall wait timeout still fails. Investigating.

jboyd01 · 2018-06-01T01:20:11Z

@kibbles-n-bytes, @nilebox could you please review?

MHBauer

seems okay so far. I'd like to take another pass at it later. needs some more comments, I think.

MHBauer · 2018-06-01T22:13:02Z

pkg/controller/controller_instance.go

 )

+type retryQueue struct {


this strikes me as the wrong place to define this, or it needs a better name, as it seems very provision+instance specific.

granted I'm not familiar with the existing queues, but is there a reason to avoid using some version of "k8s.io/client-go/util/workqueue" ?

This is very instance & provision oriented. This isn't meant to be long term - it's a long read, but #2076 discusses many of the existing issues we're seeing in here. The instance controller really needs a major refactoring - I don't picture this new provisionRetryQueue code being part of it. This is purely to plug a hole specific instance provisioning retry.

okay, thanks

MHBauer · 2018-06-01T22:14:23Z

pkg/controller/controller.go

@@ -221,7 +222,8 @@ type controller struct {
 	// clusterIDLock protects access to clusterID between the
 	// monitor writing the value from the configmap, and any
 	// readers passing the clusterID to a broker.
-	clusterIDLock sync.RWMutex
+	clusterIDLock       sync.RWMutex
+	provisionRetryQueue retryQueue


new struct defined in controller_instance.

MHBauer · 2018-06-01T22:24:49Z

pkg/controller/controller.go

@@ -167,7 +167,8 @@ func NewController(
 		//DeleteFunc: controller.servicePlanDelete,
 		//})
 	}
-
+	controller.provisionRetryQueue.retryTime = make(map[string]time.Time)
+	controller.provisionRetryQueue.rateLimiter = workqueue.NewItemExponentialFailureRateLimiter(minBrokerProvisioningRetryDelay, maxBrokerProvisioningRetryDelay)


the base controller.provisionRetryQueue is just the empty struct from basic object instantiation, right? Nothing extra we need to do before we start touching it's pieces?

The OO/Java guy in me wants this done in a constructor of retryQueue, but we're in a controntroller constructor for an internal private object, so I think we're fine for now.

I don't like the constants being defined in the other file.

Right, I hear you. But this is very specific to the instance controller, I originally had it all contained there with initialization in an init method, but I think this is clearer.

please no inits

MHBauer · 2018-06-01T22:25:46Z

pkg/controller/controller_instance.go

+	// provision reattempt shoud be made to ensure we don't overwhelm broker
+	// with repeated requests.
+	retryTime        map[string]time.Time
+	queueLastCleaned time.Time


needs comment.

MHBauer · 2018-06-01T22:28:35Z

pkg/controller/controller_instance.go

+// retry time.
+func (c *controller) purgeExpiredRetryEntries() {
+	now := time.Now()
+	if now.Before(c.provisionRetryQueue.queueLastCleaned.Add(maxBrokerProvisioningRetryDelay * 2)) {


why * 2 ?

I don't want to prematurely remove any entries, but I agree * 2 is arbitrary without comments. I'll update it.

MHBauer · 2018-06-01T22:29:54Z

pkg/controller/controller_instance.go

+
+	// Ensure we only purge items that aren't being acted on by retries, this shouldn't
+	// have any work to do but we want to be certain this queue doesn't get overly large.
+	overDue := now.Add(-maxBrokerProvisioningRetryDelay)


add negative seems weird.

Yeah, but I have a duration and Time.Sub() takes another Time type. Adding a negative duration is the prescribed way AFAIK.

MHBauer · 2018-06-01T22:31:53Z

pkg/controller/controller_instance.go

+	overDue := now.Add(-maxBrokerProvisioningRetryDelay)
+	for k := range c.provisionRetryQueue.retryTime {
+		if c.provisionRetryQueue.retryTime[k].Before(overDue) {
+			glog.V(5).Infof("removed %s from provisionRetry map which had retry time of %v", k, c.provisionRetryQueue.retryTime[k])


is there a log message before this to let us know what time "now" is so that we can see how backed off this is?

Initial calculation is is done ihere and logged: https://github.com/kubernetes-incubator/service-catalog/blob/cb9ae55ef70a75e26f4344f64c5a8c9639e0a5d3/pkg/controller/controller_instance.go#L367

when we backoff vs doing a regularly scheduled provision attempt we log it as well: https://github.com/kubernetes-incubator/service-catalog/blob/cb9ae55ef70a75e26f4344f64c5a8c9639e0a5d3/pkg/controller/controller_instance.go#L439

MHBauer · 2018-06-01T22:33:00Z

pkg/controller/controller_instance.go

+			c.provisionRetryQueue.rateLimiter.Forget(k)
+		}
+	}
+	glog.V(5).Infof("purged expired entries, provisionRetry queue length is %v", len(c.provisionRetryQueue.retryTime))


before and after count? or how many were purged?

My real concern was ensuring the map didn't grow - I'll add additional logging indicating how many remain in the map.

MHBauer · 2018-06-01T22:36:18Z

pkg/controller/controller_instance.go

@@ -339,6 +433,18 @@ func (c *controller) reconcileServiceInstanceAdd(instance *v1beta1.ServiceInstan
 		return nil
 	}

+	// don't DOS the broker.  If we already did a provision attempt that ended with a non-terminal
+	// error then we set a next retry time.  Observe that.


more context for "observe that".

I'll update. @MHBauer thanks for the review and comments, I'll update accordingly.

MHBauer

needs one tiny comment tweak.

MHBauer · 2018-06-04T18:21:40Z

pkg/controller/controller_instance.go

+	// Ensure we only purge items that aren't being acted on by retries, this
+	// shouldn't have much work to do but we want to be certain this queue
+	// doesn't get overly large. Entries are removed one by one when deleted
+	// (not orphan mitigation)of successfully provisioned, this function ensures


is of supposed to be or?

you got it, thanks, fixed. And to your comment below - yes, as I updated this comment I realized I wasn't thinking of the happy-path... I updated the provision success code to remove the entry from the map and verified the change.

MHBauer · 2018-06-04T18:22:08Z

pkg/controller/controller_instance.go

+	// (not orphan mitigation)of successfully provisioned, this function ensures
+	// all others get purged eventually.  Due to queues and potential delays,
+	// only remove entries that are at least maxBrokerProvisioningRetryDelay
+	// past next retry time to ensure entries are not prematurely removed


Good comment. Thanks.

MHBauer · 2018-06-04T18:22:26Z

pkg/controller/controller_instance.go

-	if now.Before(c.provisionRetryQueue.queueLastCleaned.Add(maxBrokerProvisioningRetryDelay * 2)) {
+
+	// run periodically ensuring we don't prematurely purge any entries
+	timeToPurge := c.provisionRetryQueue.queueLastCleaned.Add(maxBrokerProvisioningRetryDelay * 2)


thanks for the variable name.

MHBauer · 2018-06-04T18:24:12Z

pkg/controller/controller_instance.go

@@ -1685,7 +1809,7 @@ func (c *controller) processProvisionSuccess(instance *v1beta1.ServiceInstance,
 	if _, err := c.updateServiceInstanceStatus(instance); err != nil {
 		return err
 	}
-
+	c.removeInstanceFromRetryMap(instance)


I think this is a new addition mentioned in the big new comment.

I think there should also be a case when we hit a "terminal" error for a particular generation (e.g. 400 Bad Request) and won't retry until the spec is updated. We should invoke c.removeInstanceFromRetryMap(instance) in that case as well. Otherwise, when the spec will be updated, we will still retry with accumulated delay...

I'm talking about processTerminalProvisionFailure (https://github.com/kubernetes-incubator/service-catalog/blob/491f227a4fc7453917e347ccc957da9a312a4774/pkg/controller/controller_instance.go#L1695:22) and processTerminalUpdateServiceInstanceFailure (https://github.com/kubernetes-incubator/service-catalog/blob/491f227a4fc7453917e347ccc957da9a312a4774/pkg/controller/controller_instance.go#L1801)

Agreed, good catch.

nilebox

Overall looks fine, please check if we need to purge item from the retry queue in processTerminalProvisionFailure and processTerminalUpdateServiceInstanceFailure methods, plus some nits.

nilebox · 2018-06-04T23:23:21Z

pkg/controller/controller_instance.go

+
+//  delayProvisionRetry returns a duration which should be observed before attempting to provision
+func (c *controller) getDelayForProvisionRetry(instance *v1beta1.ServiceInstance) time.Duration {
+	c.purgeExpiredRetryEntries()


get... in the method name suggests that this method should not have any side effects.
I would prefer to explicitly invoke purgeExpiredRetryEntries when needed, rather then doing it here hidden from the user.

I created a new worker in the main controller that will periodically invoke the purge.

nilebox · 2018-06-04T23:26:13Z

pkg/controller/controller_instance.go

+	c.provisionRetryQueue.mutex.Lock()
+	c.provisionRetryQueue.retryTime[key] = time.Now().Add(duration)
+	glog.V(4).Infof("provisionRetry for %s after %v", key, duration)
+	c.provisionRetryQueue.mutex.Unlock()


do defer c.provisionRetryQueue.mutex.Unlock() right after acquiring lock instead to be safe?

Agree with @nilebox that defer should be used here to ensure the lock is released.

I thought it was short enough with no early returns, but I've updated it.

nilebox · 2018-06-04T23:32:40Z

pkg/controller/controller_instance.go

+	defer c.provisionRetryQueue.mutex.RUnlock()
+	now := time.Now()
+	if t := c.provisionRetryQueue.retryTime[key]; t.After(now) {
+		return t.Sub(now)


There is still a small chance for this value to be negative (between t.After() and t.Sub() some time has passed).
To be safe, this whole block can be rewritten to:

t := c.provisionRetryQueue.retryTime[key] delay := t.Sub(now) if delay < 0 { return 0 } return delay

If I was using time.Now() I'd agree, but these variables are not going to change between checking and subtracting.

nilebox · 2018-06-04T23:33:24Z

pkg/controller/controller_instance.go

+	c.provisionRetryQueue.mutex.RLock()
+	defer c.provisionRetryQueue.mutex.RUnlock()
+	now := time.Now()
+	if t := c.provisionRetryQueue.retryTime[key]; t.After(now) {


we should probably check for the key to be present in the map? i.e. if t, ok := ...; ok { ... }

ah, it will lead to t being equal to 0 in that case, fine.

nilebox · 2018-06-04T23:36:35Z

pkg/controller/controller_instance.go

+	purgedEntries := 0
+	for k := range c.provisionRetryQueue.retryTime {
+		if c.provisionRetryQueue.retryTime[k].Before(overDue) {
+			glog.V(5).Infof("removed %s from provisionRetry map which had retry time of %v", k, c.provisionRetryQueue.retryTime[k])


I prefer to cache the value in the var if it's being used twice (c.provisionRetryQueue.retryTime[k]).

nilebox · 2018-06-04T23:40:42Z

pkg/controller/controller_instance.go

@@ -339,6 +443,18 @@ func (c *controller) reconcileServiceInstanceAdd(instance *v1beta1.ServiceInstan
 		return nil
 	}

+	// don't DOS the broker.  If we already did a provision attempt that ended with a non-terminal
+	// error then we set a next retry time.
+	if delay := c.getDelayForProvisionRetry(instance); delay > 0 {


What if the instance spec was updated? have we purged the item from the provisionRetryQueue by this time?
We should reset the backoff when we start a new operation with generation incremented?

good point. I've updated the map to use the instance with the generation appended now. This ensures if the instance is updated the backoff will be reset.

nilebox · 2018-06-04T23:43:17Z

pkg/controller/controller_instance.go

@@ -1685,7 +1809,7 @@ func (c *controller) processProvisionSuccess(instance *v1beta1.ServiceInstance,
 	if _, err := c.updateServiceInstanceStatus(instance); err != nil {
 		return err
 	}
-
+	c.removeInstanceFromRetryMap(instance)


I think there should also be a case when we hit a "terminal" error for a particular generation (e.g. 400 Bad Request) and won't retry until the spec is updated. We should invoke c.removeInstanceFromRetryMap(instance) in that case as well. Otherwise, when the spec will be updated, we will still retry with accumulated delay...

nilebox · 2018-06-04T23:46:21Z

pkg/controller/controller_instance.go

@@ -1685,7 +1809,7 @@ func (c *controller) processProvisionSuccess(instance *v1beta1.ServiceInstance,
 	if _, err := c.updateServiceInstanceStatus(instance); err != nil {
 		return err
 	}
-
+	c.removeInstanceFromRetryMap(instance)


I'm talking about processTerminalProvisionFailure (https://github.com/kubernetes-incubator/service-catalog/blob/491f227a4fc7453917e347ccc957da9a312a4774/pkg/controller/controller_instance.go#L1695:22) and processTerminalUpdateServiceInstanceFailure (https://github.com/kubernetes-incubator/service-catalog/blob/491f227a4fc7453917e347ccc957da9a312a4774/pkg/controller/controller_instance.go#L1801)

pmorie

I agree with the comments left by @nilebox.

pmorie · 2018-06-05T13:50:36Z

pkg/controller/controller_instance.go

+	c.provisionRetryQueue.mutex.Lock()
+	c.provisionRetryQueue.retryTime[key] = time.Now().Add(duration)
+	glog.V(4).Infof("provisionRetry for %s after %v", key, duration)
+	c.provisionRetryQueue.mutex.Unlock()


Agree with @nilebox that defer should be used here to ensure the lock is released.

jboyd01

@nilebox thanks very much for the review. I've reworked things a bit: generalized to backoff for both provision retries and update retries. The key in the map is now the instance name along with the generation appended to ensure if there is an update the backoff is reset.

jboyd01 · 2018-06-05T17:56:01Z

pkg/controller/controller_instance.go

+	c.provisionRetryQueue.mutex.Lock()
+	c.provisionRetryQueue.retryTime[key] = time.Now().Add(duration)
+	glog.V(4).Infof("provisionRetry for %s after %v", key, duration)
+	c.provisionRetryQueue.mutex.Unlock()


I thought it was short enough with no early returns, but I've updated it.

jboyd01 · 2018-06-05T17:57:13Z

pkg/controller/controller_instance.go

+
+//  delayProvisionRetry returns a duration which should be observed before attempting to provision
+func (c *controller) getDelayForProvisionRetry(instance *v1beta1.ServiceInstance) time.Duration {
+	c.purgeExpiredRetryEntries()


I created a new worker in the main controller that will periodically invoke the purge.

jboyd01 · 2018-06-05T18:00:30Z

pkg/controller/controller_instance.go

+	defer c.provisionRetryQueue.mutex.RUnlock()
+	now := time.Now()
+	if t := c.provisionRetryQueue.retryTime[key]; t.After(now) {
+		return t.Sub(now)


If I was using time.Now() I'd agree, but these variables are not going to change between checking and subtracting.

jboyd01 · 2018-06-05T18:02:34Z

pkg/controller/controller_instance.go

@@ -339,6 +443,18 @@ func (c *controller) reconcileServiceInstanceAdd(instance *v1beta1.ServiceInstan
 		return nil
 	}

+	// don't DOS the broker.  If we already did a provision attempt that ended with a non-terminal
+	// error then we set a next retry time.
+	if delay := c.getDelayForProvisionRetry(instance); delay > 0 {


good point. I've updated the map to use the instance with the generation appended now. This ensures if the instance is updated the backoff will be reset.

jboyd01 · 2018-06-05T18:03:16Z

pkg/controller/controller_instance.go

@@ -1685,7 +1809,7 @@ func (c *controller) processProvisionSuccess(instance *v1beta1.ServiceInstance,
 	if _, err := c.updateServiceInstanceStatus(instance); err != nil {
 		return err
 	}
-
+	c.removeInstanceFromRetryMap(instance)


Agreed, good catch.

nilebox

A few more changes requested.

nilebox · 2018-06-06T00:42:58Z

pkg/controller/controller_instance.go

@@ -1696,6 +1844,7 @@ func (c *controller) processTerminalProvisionFailure(instance *v1beta1.ServiceIn
 	if failedCond == nil {
 		return fmt.Errorf("failedCond must not be nil")
 	}
+	c.removeInstanceFromRetryMap(instance)
 	return c.processProvisionFailure(instance, readyCond, failedCond, shouldMitigateOrphan)


Here and in other places: there is a chance that we will fail to update the status inside processProvisionFailure, but we have already removed instance from retry map. Which means that operation will be retried without backoff.
Probably fine, given that the probability of this seems low, and it won't break anything.

nilebox · 2018-06-06T00:49:32Z

pkg/controller/controller_instance.go

@@ -1721,6 +1870,10 @@ func (c *controller) processProvisionFailure(instance *v1beta1.ServiceInstance,
 		errorMessage = fmt.Errorf(readyCond.Message)
 	}

+	// assume a provision retry will happen, set a not-before time so we don't pound the Broker
+	// in a constant try to provision/fail/orphan mitigation/repeat loop.
+	c.setNextOperationRetryTime(instance)


This is only needed for non-terminal errors. For terminal errors you have just removed instance from the retry map in processTerminalProvisionFailure.
So the proper place to put this is into the "non-terminal" condition block below: https://github.com/kubernetes-incubator/service-catalog/blob/3837abf27fe5d9b090a7d530a81d6cebc634a79e/pkg/controller/controller_instance.go#L1747-L1752

Agreed, thanks

nilebox · 2018-06-06T00:53:18Z

pkg/controller/controller_instance.go

@@ -1696,6 +1844,7 @@ func (c *controller) processTerminalProvisionFailure(instance *v1beta1.ServiceIn
 	if failedCond == nil {
 		return fmt.Errorf("failedCond must not be nil")
 	}
+	c.removeInstanceFromRetryMap(instance)


We should probably remove instance if !shouldMitigateOrphan. I guess we could just move this line inside processProvisionFailure under condition
https://github.com/kubernetes-incubator/service-catalog/blob/3837abf27fe5d9b090a7d530a81d6cebc634a79e/pkg/controller/controller_instance.go#L1752-L1755

Which makes me wonder if we may just invoke removeInstanceFromRetryMap inside clearServiceInstanceCurrentOperation method and remove this invocation from all other places?

We should probably remove instance if !shouldMitigateOrphan. I guess we could just move this line inside processProvisionFailure under condition

There are cases where we do retries (and want backoff) without doing orphan mitigation in between (retrying a failed update, retrying a provisioning after a 400 or 403). I'll do some additional review, but at this point I'm not convinced doing the removeInstancefromRetryMap within clearServiceInstanceCurrentOperation is the right place.

clearServiceInstanceCurrentOperation is invoked a lot, I believe I saw it invoked as part of the orphan mitigation - from debug output, I am not comfortable doing the removeInstanceFromRetryMap in clearServiceInstanceCurrentOperation as I think it would prematurely clear out the instance. @nilebox if you feel strongly about this let me know and I'll spend more time verifying.

I believe I saw it invoked as part of the orphan mitigation

@jboyd01 We only invoke clearServiceInstanceAsyncOsbOperation in case of orphan mitigation instead of clearServiceInstanceCurrentOperation. Orphan mitigation is considered part of the original Service Catalog operation. The clearServiceInstanceCurrentOperation should only be invoked when we finished operation and no backoff needed anymore.

Please check, I think it would make the code cleaner.

@jboyd01 You're right, we do invoke clearServiceInstanceCurrentOperation from the processDeprovisionSuccess method even it was orphan mitigation. We also invoke it in the beginning of orphan mitigation from recordStartOfServiceInstanceOperation method. So it's safer to invoke removeInstancefromRetryMap explicitly like you do, agreed.

jboyd01 · 2018-06-06T21:02:43Z

I found processServiceInstanceOperationError() is invoked if orphan mitigation hits an error, yet we aren't doing any backoff for orphan mitigation retries... At this point it seems best to not include orphan mitigation as part of this scope, so I wanted to bump up the backoff time only for update errors here otherwise we may incorrectly increase backoff. Very noticeable in integration tests where update or provisioning has failures along with orphan mitigation deprovisioning. Is there a better way to check for update failure vs deprovision failure?

 func (c *controller) processServiceInstanceOperationError(instance *v1beta1.ServiceInstance, readyCond *v1beta1.ServiceInstanceCondition) error {
       // If error is from an update, assume a retry will happen, set a not-before time so we                                                                                                                                               
       // don't pound the Broker in a constant update/fail/update repeat loop.                                                                                                                                                              
       if strings.Contains(readyCond.Reason, "UpdateInstance") { 
               c.setNextOperationRetryTime(instance)
       }

nilebox · 2018-06-07T00:39:04Z

@jboyd01 why don't you want to do backoff for orphan mitigation retries, sorry? Orphan mitigation is part of the original operation. If you don't do backoff for orphan mitigation, you won't do backoff for the operation when you retry provisioning after successful orphan mitigation? Isn't this the whole point of this PR to keep exponential backoff while provision -> provision fail -> orphan mitigation -> provision -> ...?

jboyd01 · 2018-06-07T01:28:45Z

@nilebox The original issue is that provisioning a certain service keeps failing with a specific error. OM (orphan mitigation) worked fine and the successful OM was part of the reason we didn't backoff on the next provision. Given the way the original reconciliation & worker retry loops work, OM success was resetting the worker so there was no backoff on the retry to provision. The solution I had been aiming for was provision -> error-> OM -> backoff -> provision -> error -> OM -> backoff.....

With your questioning, I see it would be pretty straight forward to use the same backoff value (if its set) prior to doing OM (deprovision). Successful OM shouldn't reset the backoff though. If OM fails, it would make sense to increase the backoff prior to retrying OM. Once OM succeeds, the provision would fire immediately (it already spent the backoff prior to OM). I think this means the provision failure loop would look like this as long as OM doesn't get an error:

provision -> error-> backoff -> OM -> provision -> error -> backoff ->OM -> provision...

and if there is an error doing OM I think we add another backoff:

provision -> error-> backoff -> OM -> error -> backoff -> OM (success) -> provision -> error -> backoff ->OM -> provision...

again noting that OM (success) does not reset the backoff.

A bonus to this is that in addition to provision and update, we'd add deprovision & OM to the backoff retry.

WDYT?

nilebox · 2018-06-07T01:41:49Z

Not sure if I fully understand. I don't like waiting before performing OM.

Is your in-memory backoff flexible enough to set the backoff upon provision failure, but respect it only when we retry provisioning? That is what we really want in the end, I think.
i.e. in case of orphan mitigation:

provision -> error + update in-memory delay -> OM -> apply in-memory delay + provision -> ...

In other words, given that we consider OM to be part of the original operation, we don't want to apply your custom backoff there. Would it be possible?

Orphan mitigation could fail itself and should be retried with backoff delay as well... But in that case the normal rate limiting queue should take effect I think?

nilebox · 2018-06-07T01:50:07Z

To word it differently: we should only apply this custom backoff when we retry the operation.
Let me just give two examples:

For sync provisioning the operation we retry is: "provision -> error -> OM". Then we try provisioning again, and that's where delay should be applied.
For async updates the operation is "start updating -> poll last_operation -> failed". Then we try updating again, and that's where delay should be applied.

So we should be very clear about:

When do we set the delay? When a retriable error has occured (so we 100% now we'll retry)
When do we apply the delay? In the beginning of retrying the operation (see above)
When do we reset the delay? When the operation succeeded, or finished with a terminal error.

…rovisioning fails

…ovisioning

…Fixup per review comments

…cessProvisionFailure()

… mitigation

…e key and enclosed time and dirty flag within the entry.

staebler

Much nicer and easier to follow.

staebler · 2018-07-20T18:36:47Z

pkg/controller/controller_instance.go

+	overDue := now.Add(-maxBrokerOperationRetryDelay)
+	purgedEntries := 0
+	for k := range c.instanceOperationRetryQueue.instances {
+		if due := c.instanceOperationRetryQueue.instances[k]; due.calculatedRetryTime.Before(overDue) {


Get the value from the instance map as part of the for statement instead of finding the value using the key.

for k, v := range c.instanceOperatorRetryQueue.instances { if v.calculatedRetryTime.Before(overDue) {

Yes, thanks!

staebler · 2018-07-20T18:37:08Z

pkg/controller/controller_instance.go

+	purgedEntries := 0
+	for k := range c.instanceOperationRetryQueue.instances {
+		if due := c.instanceOperationRetryQueue.instances[k]; due.calculatedRetryTime.Before(overDue) {
+			glog.V(5).Infof("removing %s from instanceOperationRetryQueue which had retry time of %v", k, due)


Should due be due.calculatedRetryTime?

absolutely, good catch

staebler · 2018-07-20T18:40:03Z

pkg/controller/controller_instance.go

+			purgedEntries++
+		}
+	}
+	glog.V(5).Infof("purged %v expired entries, instanceOperationRetryQueue queue length is %v", purgedEntries, len(c.instanceOperationRetryQueue.instances))


The log entry is not accurate since it is not really the queue length that is being shown.

you are objecting to the text of the message right, not the calculation of the number of entries, right?

Yes, just the text of the message.

staebler · 2018-07-20T19:15:02Z

pkg/controller/controller_instance.go

+type backoffEntry struct {
+	generation          int64
+	calculatedRetryTime time.Time // earliest time we should retry
+	dirty               bool      // true indicates new backoff should be calculated


I find it cleaner to calculate the delay from start of request to start of request rather than from end of request to start of request. Asymptotically, it approaches the same delay in either case. I'm not convinced that it makes much difference operationally how the delay is calculated, so long as it is increasing. If the delay were calculated from start-to-start, then you would not need to keep track of a dirty bit. The calculatedRetryTime would be set directly in setRetryBackoffRequired.

One issue I fought with here is that in one of the non-happy path loops, you provision, fail, then have to go through orphan mitigation which may fail multiple times, then you do the retry. On short backoffs the prior calculated retry added no delay because orphan mitigation took "so long".

I agree its a bit convoluted, but this will be a short term solution (measured in months) before its replaced with a re-design for proper backoff/retry.

OK. That makes sense.

jboyd01 · 2018-07-23T15:15:32Z

The build failure in pull-service-catalog-xbuild is a pipeline problem - this builds fine and ready for final review, just need one more LGTM please.

staebler

/lgtm

MHBauer · 2018-07-24T04:43:41Z

/retest

jboyd01 · 2018-07-24T13:25:59Z

/approve

k8s-ci-robot · 2018-07-24T13:26:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jboyd01

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jboyd01]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

nilebox

Just one nit, feel free to ignore.

nilebox · 2018-07-24T13:44:20Z

pkg/controller/controller_instance.go

+		glog.Errorf(pcb.Messagef("Couldn't create a key for object %+v: %v", instance, err))
+		return false
+	}
+	delay := time.Millisecond * 0


You don't need to declare a zero delay var there, you could just do

delay := retryEntry.calculatedRetryTime.Sub(now)

below (replace = with :=)

agreed @nilebox, I think this was left over after a refactor and I missed it. This was just merged, next time I'm in here I'll clean it up.

nilebox · 2018-08-14T23:31:53Z

pkg/controller/controller_instance.go

+	c.instanceOperationRetryQueue.mutex.Lock()
+	defer c.instanceOperationRetryQueue.mutex.Unlock()
+	retryEntry, found := c.instanceOperationRetryQueue.instances[key]
+	if !found || retryEntry.generation != instance.Generation {


@jboyd01 I think we may need to also store and compare instance's UID.
@joshk has reported that when he deletes an old instance and creates a new one under the same name, he gets a RetryBackoff event immediately. I suspect this could happen when the old instance had a Generation = 1 (created and never changed) - and the new instance will get the same.

Interesting. Thanks for digging in on it @nilebox.

nilebox · 2018-08-15T13:06:07Z

@jboyd01 can we simply use UID as a key instead of namespace/name for this rate limiting map?

jboyd01 · 2018-08-15T13:08:41Z

@nilebox Right, I was thinking the same thing. Also looking into purging the entry from the retry maps after successfully doing non-orphan mitigation deprovision.

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 24, 2018

jboyd01 force-pushed the provisioning-backoff branch 5 times, most recently from 093ee08 to 6a0f741 Compare May 30, 2018 12:34

jboyd01 requested a review from kibbles-n-bytes May 30, 2018 13:50

jboyd01 changed the title ~~WIP - delay between provisioning retries~~ delay between provisioning retries May 30, 2018

luksa suggested changes May 30, 2018

View reviewed changes

jboyd01 force-pushed the provisioning-backoff branch 2 times, most recently from e28d57f to 9aed723 Compare May 31, 2018 02:12

luksa reviewed May 31, 2018

View reviewed changes

jboyd01 force-pushed the provisioning-backoff branch from c846d30 to 102dcbe Compare May 31, 2018 14:01

MHBauer approved these changes Jun 1, 2018

View reviewed changes

MHBauer reviewed Jun 4, 2018

View reviewed changes

nilebox reviewed Jun 4, 2018

View reviewed changes

pmorie reviewed Jun 5, 2018

View reviewed changes

jboyd01 commented Jun 5, 2018

View reviewed changes

jboyd01 mentioned this pull request Jun 5, 2018

The Instance & Binding controllers needs to be redesigned #2076

Closed

nilebox reviewed Jun 6, 2018

View reviewed changes

Jay Boyd added 10 commits July 19, 2018 11:32

revert back to non rated limited Add, remove non-used debug code

4a36465

inital provision backoff set to 1 second, calculate backoff at when p…

a84dc3c

…rovisioning fails

revert changes to integration tests

32f1c29

review comment fixups, remove instance from queue after successful pr…

5f918d7

…ovisioning

comment typo fixup and 1 Info() to Infof()

bcab09e

delay for both provisioning and updating. Move purge task to worker. …

55e9fb2

…Fixup per review comments

don't backoff for orphan mitigation, move provisioning backoff to pro…

f651498

…cessProvisionFailure()

introduced pendingDelay map. Calculate backoff delay after any orphan…

6af66fc

… mitigation

simplified backoff. Only do backoff for provision and update

89f3d97

per review, reworked retry map so entries don't have generation in th…

8a137aa

…e key and enclosed time and dirty flag within the entry.

jboyd01 force-pushed the provisioning-backoff branch from 7628528 to 8a137aa Compare July 19, 2018 16:57

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 19, 2018

staebler reviewed Jul 20, 2018

View reviewed changes

minor fixups per review

aa8dc92

k8s-ci-robot assigned staebler Jul 23, 2018

staebler approved these changes Jul 23, 2018

View reviewed changes

k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 23, 2018

staebler added the LGTM2 label Jul 23, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 24, 2018

nilebox reviewed Jul 24, 2018

View reviewed changes

k8s-ci-robot merged commit f85e626 into kubernetes-retired:master Jul 24, 2018

nilebox reviewed Aug 14, 2018

View reviewed changes

jboyd01 mentioned this pull request Aug 15, 2018

Instance provisioning may encounter backoff/retry before even attempting to provision #2279

Closed

cblecker unassigned staebler Jun 4, 2019

delay between provisioning retries #2065

delay between provisioning retries #2065

Conversation

jboyd01 commented May 24, 2018 • edited Loading

jboyd01 commented May 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luksa commented May 31, 2018

jboyd01 commented May 31, 2018

jboyd01 commented Jun 1, 2018

MHBauer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MHBauer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nilebox left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmorie left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jboyd01 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jboyd01 commented May 24, 2018 •

edited

Loading