Skip to content
This repository has been archived by the owner on May 6, 2022. It is now read-only.

delay between provisioning retries #2065

Merged
merged 15 commits into from
Jul 24, 2018

Conversation

jboyd01
Copy link
Contributor

@jboyd01 jboyd01 commented May 24, 2018

fixes #2025 - at least it prevents Controller from blasting Brokers with constant provision retry/orphan mitigation loop.

In addition to #2025, see #2076 for additional details on problems with the instance reconciliation. This PR is a near term band-aid to prevent pain at the brokers.

A new in memory list of instances with provision retry times provides a delay to prevent a tight provision/provision-failure/orphan mitigation loop that pounds brokers without pause when provisioning ends with non-terminal failure.

This short term fix adds the retryQueue where we introduce additional delay on top of what the reconciler loop was originally doing. Before doing a broker operation (Provision or Update) we check the backoffBeforeRetrying and retryTime maps to see if we should first delay. If the instance has an entry in backoffBeforeRetrying that indicates the operation previously failed and we need to get a new exponential backoff delay from the RateLimter. This is calculated and a new time is set in the retryTime and then then the instance is put back into the main worker queue with an AddAfter(time) which causes the worker to reprocess after the delay. On this next pass, verification is done to ensure the time delay has expired and we then set the dirty bit in the backoffBeforeRetrying in anticipation of another failure. If the operation is successful, the entries will be cleared from the retryQueue. A failure runs through the process again. A background tasks periodically purges old entries from retryQueue.

The main point above is that all the backoff processing happens just before we execute the broker operation. There are a lot of edge cases and post error processing that otherwise makes it difficult to instrument the backoff properly in all the necessary places. This is a best effort band aid, a temporary work around until the overall reconcilation flow can be reworked properly (issue #2076).

We do not do additional backoff for Deprovisioning (regular or Orphan Mitigation) as the OM status handling is embedded with in the delete and its difficult to be certain if a delete is part of OM and ensuring we don't clear out overall backoff if there are failures and then success within OM.

Binding Reconcilation likely has the exact same issues.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 24, 2018
@jboyd01 jboyd01 force-pushed the provisioning-backoff branch 5 times, most recently from 093ee08 to 6a0f741 Compare May 30, 2018 12:34
@jboyd01
Copy link
Contributor Author

jboyd01 commented May 30, 2018

@luksa, @kibbles-n-bytes could you review? Travis error is with building docsite, unrelated failure that I'm attempting to address, but this PR is ready to go I believe.

@jboyd01 jboyd01 changed the title WIP - delay between provisioning retries delay between provisioning retries May 30, 2018
@@ -91,6 +92,17 @@ const (
clusterIdentifierKey string = "clusterid"
)

type retryQueue struct {
mutex sync.RWMutex // lock to be used with provisionRetryTime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment references provisionRetryTime instead of retryTime

@@ -60,6 +60,7 @@ const (
defaultLeaderElectionNamespace = "kube-system"
defaultReconciliationRetryDuration = 7 * 24 * time.Hour
defaultOperationPollingMaximumBackoffDuration = 20 * time.Minute
defaultProvisioningRetryInterval = 5 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut feeling is that this is way too long. Especially when the error during provisioning is just a temporary glitch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's way too long when you have a single failure, but when the broker is in a bad state and just sits there returning a retry-able error it's pretty frequent. Optimally we'd do a set of quick retries and then fall back to every 5 or 10 minutes, but I don't think the effort for that is worth it as we need to rework this anyway (#2076). Suggestions on a better backoff? Once a minute? Note that this is overridable with the flag --provision-retry-interval

@@ -99,8 +111,16 @@ func (c *controller) instanceAdd(obj interface{}) {
glog.Errorf("Couldn't get key for object %+v: %v", obj, err)
return
}
c.instanceQueue.AddRateLimited(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a 100% on this, but I think this is a bad idea. The instanceAdd() method gets called in instanceUpdate(), which gets called when the user changes the ServiceInstance object.

Now imagine what happens when the user initially posts a bad ServiceInstance, which causes the reconcile method to back off. Eventually, the backoff time delay is a few minutes. If the user then updates the ServiceInstance, the controller won't process the update immediately, but minutes later.

Also, I believe every call to AddRateLimited increases the delay. So, with this change, every update to the object increases the delay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with reverting this so it isn't rate limited, I did it per #2025 (comment), but I do want to note that even if we do a backoff/retry the max # of retry times (15), they are all completed in under 3 minutes. Check out the backoff times: https://github.com/kubernetes-incubator/service-catalog/blob/6a693a7661654faa2dffa2d50d44f8d676207b7c/pkg/controller/controller.go#L57

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very confident this needs to be reverted. Even if the maximum backoff time is "just" 82s, that's too much if it means that's how long it takes the controller to start acting on a change made by the user to the ServiceInstance.

return
}
c.provisionRetryQueue.mutex.Lock()
c.provisionRetryQueue.retryTime[key] = time.Now().Add(c.provisionRetryQueue.retryInterval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you take a look at NewItemExponentialFailureRateLimiter() in github.com/kubernetes-incubator/service-catalog/vendor/k8s.io/client-go/util/workqueue/default_rate_limiters.go ?

I think adding a nice exponential backoff instead of a constant delay should be fairly simple. The rate limiter has a When() method you can call to get the next time.

But I haven't really looked at the rate limiters closely. So I may be wrong :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An exponential backoff is ideal, it is what we were supposed to be doing here, but we constantly reset the notion of failure/success when we do status updates or complete the orphan mitigation and we never quit after max retries, it just restarts all over.

To use exponential backoff we need to maintain more state, including the # of retries.

Ideally we would have a queue much like today that has very short backoff/retry for kube reconciliation, but we probably need another queue that is for remote broker operations that has a larger backoff. I 100% agree that exponential backoff is the right fix, but its going to take a large rework. The only goal of this PR is to stop us from pounding on the broker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I meant just create a new rate limiter instead of the retryTime map. The rate limiter keeps track of everything. All you need to do is call rateLimiter.When(instance) here and it'll keep track of the individual delays for each ServiceInstance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you are saying Marko, great idea. I think I'll have to keep the retryMap in addition to adding the rateLimter. The rateLimiter doesn't have support for checking if an item is in queue and getting the next execution time. But I really like the idea of exponential backoff.

@@ -339,6 +421,18 @@ func (c *controller) reconcileServiceInstanceAdd(instance *v1beta1.ServiceInstan
return nil
}

// don't DOS the broker. If we already did a provision attemp that ended with a non-terminal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/attemp/attempt/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @luksa, I've updated the typo and incorrect comment, I'm open to suggestions on a better retry interval. I'd be ok with 60 seconds vs 5 minutes. In discussion with @pmorie we were thinking 20 minutes, but that really focuses on constant failure cases.

@jboyd01 jboyd01 force-pushed the provisioning-backoff branch 2 times, most recently from e28d57f to 9aed723 Compare May 31, 2018 02:12
// with repeated requests.
retryTime map[string]time.Time
queueLastCleaned time.Time
rateLimter workqueue.RateLimiter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing 'i' in rateLimter

@@ -99,8 +111,16 @@ func (c *controller) instanceAdd(obj interface{}) {
glog.Errorf("Couldn't get key for object %+v: %v", obj, err)
return
}
c.instanceQueue.AddRateLimited(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very confident this needs to be reverted. Even if the maximum backoff time is "just" 82s, that's too much if it means that's how long it takes the controller to start acting on a change made by the user to the ServiceInstance.

return
}
glog.Infof("Instance %v/%v: %+v", namespace, name, instance)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing you used this method during testing. It's not used anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @luksa, I've reverted back to non rate limited add and corrected the other two issues.

@luksa
Copy link
Contributor

luksa commented May 31, 2018

LGTM

I'd /lgtm but I'm not in the kubernetes-incubator org yet. Hopefully they'll add me today.

@jboyd01
Copy link
Contributor Author

jboyd01 commented May 31, 2018

hmm... It appears changing the c.instanceQueue.AddRateLimited(key) back to c.instanceQueue.Add(key) in InstanceAdd() breaks integration tests. Even greatly increasing overall wait timeout still fails. Investigating.

@jboyd01
Copy link
Contributor Author

jboyd01 commented Jun 1, 2018

@kibbles-n-bytes, @nilebox could you please review?

Copy link
Contributor

@MHBauer MHBauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems okay so far. I'd like to take another pass at it later. needs some more comments, I think.

)

type retryQueue struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this strikes me as the wrong place to define this, or it needs a better name, as it seems very provision+instance specific.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

granted I'm not familiar with the existing queues, but is there a reason to avoid using some version of "k8s.io/client-go/util/workqueue" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very instance & provision oriented. This isn't meant to be long term - it's a long read, but #2076 discusses many of the existing issues we're seeing in here. The instance controller really needs a major refactoring - I don't picture this new provisionRetryQueue code being part of it. This is purely to plug a hole specific instance provisioning retry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, thanks

@@ -221,7 +222,8 @@ type controller struct {
// clusterIDLock protects access to clusterID between the
// monitor writing the value from the configmap, and any
// readers passing the clusterID to a broker.
clusterIDLock sync.RWMutex
clusterIDLock sync.RWMutex
provisionRetryQueue retryQueue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new struct defined in controller_instance.

@@ -167,7 +167,8 @@ func NewController(
//DeleteFunc: controller.servicePlanDelete,
//})
}

controller.provisionRetryQueue.retryTime = make(map[string]time.Time)
controller.provisionRetryQueue.rateLimiter = workqueue.NewItemExponentialFailureRateLimiter(minBrokerProvisioningRetryDelay, maxBrokerProvisioningRetryDelay)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the base controller.provisionRetryQueue is just the empty struct from basic object instantiation, right? Nothing extra we need to do before we start touching it's pieces?

The OO/Java guy in me wants this done in a constructor of retryQueue, but we're in a controntroller constructor for an internal private object, so I think we're fine for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the constants being defined in the other file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I hear you. But this is very specific to the instance controller, I originally had it all contained there with initialization in an init method, but I think this is clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please no inits

// provision reattempt shoud be made to ensure we don't overwhelm broker
// with repeated requests.
retryTime map[string]time.Time
queueLastCleaned time.Time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs comment.

// retry time.
func (c *controller) purgeExpiredRetryEntries() {
now := time.Now()
if now.Before(c.provisionRetryQueue.queueLastCleaned.Add(maxBrokerProvisioningRetryDelay * 2)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why * 2 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to prematurely remove any entries, but I agree * 2 is arbitrary without comments. I'll update it.


// Ensure we only purge items that aren't being acted on by retries, this shouldn't
// have any work to do but we want to be certain this queue doesn't get overly large.
overDue := now.Add(-maxBrokerProvisioningRetryDelay)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add negative seems weird.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but I have a duration and Time.Sub() takes another Time type. Adding a negative duration is the prescribed way AFAIK.

overDue := now.Add(-maxBrokerProvisioningRetryDelay)
for k := range c.provisionRetryQueue.retryTime {
if c.provisionRetryQueue.retryTime[k].Before(overDue) {
glog.V(5).Infof("removed %s from provisionRetry map which had retry time of %v", k, c.provisionRetryQueue.retryTime[k])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a log message before this to let us know what time "now" is so that we can see how backed off this is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c.provisionRetryQueue.rateLimiter.Forget(k)
}
}
glog.V(5).Infof("purged expired entries, provisionRetry queue length is %v", len(c.provisionRetryQueue.retryTime))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before and after count? or how many were purged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My real concern was ensuring the map didn't grow - I'll add additional logging indicating how many remain in the map.

@@ -339,6 +433,18 @@ func (c *controller) reconcileServiceInstanceAdd(instance *v1beta1.ServiceInstan
return nil
}

// don't DOS the broker. If we already did a provision attempt that ended with a non-terminal
// error then we set a next retry time. Observe that.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more context for "observe that".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update. @MHBauer thanks for the review and comments, I'll update accordingly.

Copy link
Contributor

@MHBauer MHBauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs one tiny comment tweak.

// Ensure we only purge items that aren't being acted on by retries, this
// shouldn't have much work to do but we want to be certain this queue
// doesn't get overly large. Entries are removed one by one when deleted
// (not orphan mitigation)of successfully provisioned, this function ensures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is of supposed to be or?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you got it, thanks, fixed. And to your comment below - yes, as I updated this comment I realized I wasn't thinking of the happy-path... I updated the provision success code to remove the entry from the map and verified the change.

// (not orphan mitigation)of successfully provisioned, this function ensures
// all others get purged eventually. Due to queues and potential delays,
// only remove entries that are at least maxBrokerProvisioningRetryDelay
// past next retry time to ensure entries are not prematurely removed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good comment. Thanks.

if now.Before(c.provisionRetryQueue.queueLastCleaned.Add(maxBrokerProvisioningRetryDelay * 2)) {

// run periodically ensuring we don't prematurely purge any entries
timeToPurge := c.provisionRetryQueue.queueLastCleaned.Add(maxBrokerProvisioningRetryDelay * 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the variable name.

@@ -1685,7 +1809,7 @@ func (c *controller) processProvisionSuccess(instance *v1beta1.ServiceInstance,
if _, err := c.updateServiceInstanceStatus(instance); err != nil {
return err
}

c.removeInstanceFromRetryMap(instance)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a new addition mentioned in the big new comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should also be a case when we hit a "terminal" error for a particular generation (e.g. 400 Bad Request) and won't retry until the spec is updated. We should invoke c.removeInstanceFromRetryMap(instance) in that case as well. Otherwise, when the spec will be updated, we will still retry with accumulated delay...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, good catch.

Copy link
Contributor

@nilebox nilebox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks fine, please check if we need to purge item from the retry queue in processTerminalProvisionFailure and processTerminalUpdateServiceInstanceFailure methods, plus some nits.


// delayProvisionRetry returns a duration which should be observed before attempting to provision
func (c *controller) getDelayForProvisionRetry(instance *v1beta1.ServiceInstance) time.Duration {
c.purgeExpiredRetryEntries()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get... in the method name suggests that this method should not have any side effects.
I would prefer to explicitly invoke purgeExpiredRetryEntries when needed, rather then doing it here hidden from the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a new worker in the main controller that will periodically invoke the purge.

c.provisionRetryQueue.mutex.Lock()
c.provisionRetryQueue.retryTime[key] = time.Now().Add(duration)
glog.V(4).Infof("provisionRetry for %s after %v", key, duration)
c.provisionRetryQueue.mutex.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do defer c.provisionRetryQueue.mutex.Unlock() right after acquiring lock instead to be safe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @nilebox that defer should be used here to ensure the lock is released.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was short enough with no early returns, but I've updated it.

defer c.provisionRetryQueue.mutex.RUnlock()
now := time.Now()
if t := c.provisionRetryQueue.retryTime[key]; t.After(now) {
return t.Sub(now)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still a small chance for this value to be negative (between t.After() and t.Sub() some time has passed).
To be safe, this whole block can be rewritten to:

t := c.provisionRetryQueue.retryTime[key]
delay := t.Sub(now)
if delay < 0 {
  return 0
}
return delay

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I was using time.Now() I'd agree, but these variables are not going to change between checking and subtracting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah true.

c.provisionRetryQueue.mutex.RLock()
defer c.provisionRetryQueue.mutex.RUnlock()
now := time.Now()
if t := c.provisionRetryQueue.retryTime[key]; t.After(now) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably check for the key to be present in the map? i.e. if t, ok := ...; ok { ... }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, it will lead to t being equal to 0 in that case, fine.

purgedEntries := 0
for k := range c.provisionRetryQueue.retryTime {
if c.provisionRetryQueue.retryTime[k].Before(overDue) {
glog.V(5).Infof("removed %s from provisionRetry map which had retry time of %v", k, c.provisionRetryQueue.retryTime[k])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to cache the value in the var if it's being used twice (c.provisionRetryQueue.retryTime[k]).

@@ -339,6 +443,18 @@ func (c *controller) reconcileServiceInstanceAdd(instance *v1beta1.ServiceInstan
return nil
}

// don't DOS the broker. If we already did a provision attempt that ended with a non-terminal
// error then we set a next retry time.
if delay := c.getDelayForProvisionRetry(instance); delay > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the instance spec was updated? have we purged the item from the provisionRetryQueue by this time?
We should reset the backoff when we start a new operation with generation incremented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. I've updated the map to use the instance with the generation appended now. This ensures if the instance is updated the backoff will be reset.

@@ -1685,7 +1809,7 @@ func (c *controller) processProvisionSuccess(instance *v1beta1.ServiceInstance,
if _, err := c.updateServiceInstanceStatus(instance); err != nil {
return err
}

c.removeInstanceFromRetryMap(instance)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should also be a case when we hit a "terminal" error for a particular generation (e.g. 400 Bad Request) and won't retry until the spec is updated. We should invoke c.removeInstanceFromRetryMap(instance) in that case as well. Otherwise, when the spec will be updated, we will still retry with accumulated delay...

@@ -1685,7 +1809,7 @@ func (c *controller) processProvisionSuccess(instance *v1beta1.ServiceInstance,
if _, err := c.updateServiceInstanceStatus(instance); err != nil {
return err
}

c.removeInstanceFromRetryMap(instance)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@pmorie pmorie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the comments left by @nilebox.

c.provisionRetryQueue.mutex.Lock()
c.provisionRetryQueue.retryTime[key] = time.Now().Add(duration)
glog.V(4).Infof("provisionRetry for %s after %v", key, duration)
c.provisionRetryQueue.mutex.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @nilebox that defer should be used here to ensure the lock is released.

Copy link
Contributor Author

@jboyd01 jboyd01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nilebox thanks very much for the review. I've reworked things a bit: generalized to backoff for both provision retries and update retries. The key in the map is now the instance name along with the generation appended to ensure if there is an update the backoff is reset.

c.provisionRetryQueue.mutex.Lock()
c.provisionRetryQueue.retryTime[key] = time.Now().Add(duration)
glog.V(4).Infof("provisionRetry for %s after %v", key, duration)
c.provisionRetryQueue.mutex.Unlock()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was short enough with no early returns, but I've updated it.


// delayProvisionRetry returns a duration which should be observed before attempting to provision
func (c *controller) getDelayForProvisionRetry(instance *v1beta1.ServiceInstance) time.Duration {
c.purgeExpiredRetryEntries()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a new worker in the main controller that will periodically invoke the purge.

defer c.provisionRetryQueue.mutex.RUnlock()
now := time.Now()
if t := c.provisionRetryQueue.retryTime[key]; t.After(now) {
return t.Sub(now)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I was using time.Now() I'd agree, but these variables are not going to change between checking and subtracting.

@@ -339,6 +443,18 @@ func (c *controller) reconcileServiceInstanceAdd(instance *v1beta1.ServiceInstan
return nil
}

// don't DOS the broker. If we already did a provision attempt that ended with a non-terminal
// error then we set a next retry time.
if delay := c.getDelayForProvisionRetry(instance); delay > 0 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. I've updated the map to use the instance with the generation appended now. This ensures if the instance is updated the backoff will be reset.

@@ -1685,7 +1809,7 @@ func (c *controller) processProvisionSuccess(instance *v1beta1.ServiceInstance,
if _, err := c.updateServiceInstanceStatus(instance); err != nil {
return err
}

c.removeInstanceFromRetryMap(instance)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, good catch.

Copy link
Contributor

@nilebox nilebox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more changes requested.

@@ -1696,6 +1844,7 @@ func (c *controller) processTerminalProvisionFailure(instance *v1beta1.ServiceIn
if failedCond == nil {
return fmt.Errorf("failedCond must not be nil")
}
c.removeInstanceFromRetryMap(instance)
return c.processProvisionFailure(instance, readyCond, failedCond, shouldMitigateOrphan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in other places: there is a chance that we will fail to update the status inside processProvisionFailure, but we have already removed instance from retry map. Which means that operation will be retried without backoff.
Probably fine, given that the probability of this seems low, and it won't break anything.

@@ -1721,6 +1870,10 @@ func (c *controller) processProvisionFailure(instance *v1beta1.ServiceInstance,
errorMessage = fmt.Errorf(readyCond.Message)
}

// assume a provision retry will happen, set a not-before time so we don't pound the Broker
// in a constant try to provision/fail/orphan mitigation/repeat loop.
c.setNextOperationRetryTime(instance)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only needed for non-terminal errors. For terminal errors you have just removed instance from the retry map in processTerminalProvisionFailure.
So the proper place to put this is into the "non-terminal" condition block below: https://github.com/kubernetes-incubator/service-catalog/blob/3837abf27fe5d9b090a7d530a81d6cebc634a79e/pkg/controller/controller_instance.go#L1747-L1752

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, thanks

@@ -1696,6 +1844,7 @@ func (c *controller) processTerminalProvisionFailure(instance *v1beta1.ServiceIn
if failedCond == nil {
return fmt.Errorf("failedCond must not be nil")
}
c.removeInstanceFromRetryMap(instance)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably remove instance if !shouldMitigateOrphan. I guess we could just move this line inside processProvisionFailure under condition
https://github.com/kubernetes-incubator/service-catalog/blob/3837abf27fe5d9b090a7d530a81d6cebc634a79e/pkg/controller/controller_instance.go#L1752-L1755

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which makes me wonder if we may just invoke removeInstanceFromRetryMap inside clearServiceInstanceCurrentOperation method and remove this invocation from all other places?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably remove instance if !shouldMitigateOrphan. I guess we could just move this line inside processProvisionFailure under condition

There are cases where we do retries (and want backoff) without doing orphan mitigation in between (retrying a failed update, retrying a provisioning after a 400 or 403). I'll do some additional review, but at this point I'm not convinced doing the removeInstancefromRetryMap within clearServiceInstanceCurrentOperation is the right place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clearServiceInstanceCurrentOperation is invoked a lot, I believe I saw it invoked as part of the orphan mitigation - from debug output, I am not comfortable doing the removeInstanceFromRetryMap in clearServiceInstanceCurrentOperation as I think it would prematurely clear out the instance. @nilebox if you feel strongly about this let me know and I'll spend more time verifying.

Copy link
Contributor

@nilebox nilebox Jun 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I saw it invoked as part of the orphan mitigation

@jboyd01 We only invoke clearServiceInstanceAsyncOsbOperation in case of orphan mitigation instead of clearServiceInstanceCurrentOperation. Orphan mitigation is considered part of the original Service Catalog operation. The clearServiceInstanceCurrentOperation should only be invoked when we finished operation and no backoff needed anymore.

Please check, I think it would make the code cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jboyd01 You're right, we do invoke clearServiceInstanceCurrentOperation from the processDeprovisionSuccess method even it was orphan mitigation. We also invoke it in the beginning of orphan mitigation from recordStartOfServiceInstanceOperation method. So it's safer to invoke removeInstancefromRetryMap explicitly like you do, agreed.

@jboyd01
Copy link
Contributor Author

jboyd01 commented Jun 6, 2018

I found processServiceInstanceOperationError() is invoked if orphan mitigation hits an error, yet we aren't doing any backoff for orphan mitigation retries... At this point it seems best to not include orphan mitigation as part of this scope, so I wanted to bump up the backoff time only for update errors here otherwise we may incorrectly increase backoff. Very noticeable in integration tests where update or provisioning has failures along with orphan mitigation deprovisioning. Is there a better way to check for update failure vs deprovision failure?

 func (c *controller) processServiceInstanceOperationError(instance *v1beta1.ServiceInstance, readyCond *v1beta1.ServiceInstanceCondition) error {
       // If error is from an update, assume a retry will happen, set a not-before time so we                                                                                                                                               
       // don't pound the Broker in a constant update/fail/update repeat loop.                                                                                                                                                              
       if strings.Contains(readyCond.Reason, "UpdateInstance") { 
               c.setNextOperationRetryTime(instance)
       }

@nilebox
Copy link
Contributor

nilebox commented Jun 7, 2018

@jboyd01 why don't you want to do backoff for orphan mitigation retries, sorry? Orphan mitigation is part of the original operation. If you don't do backoff for orphan mitigation, you won't do backoff for the operation when you retry provisioning after successful orphan mitigation? Isn't this the whole point of this PR to keep exponential backoff while provision -> provision fail -> orphan mitigation -> provision -> ...?

@jboyd01
Copy link
Contributor Author

jboyd01 commented Jun 7, 2018

@nilebox The original issue is that provisioning a certain service keeps failing with a specific error. OM (orphan mitigation) worked fine and the successful OM was part of the reason we didn't backoff on the next provision. Given the way the original reconciliation & worker retry loops work, OM success was resetting the worker so there was no backoff on the retry to provision. The solution I had been aiming for was provision -> error-> OM -> backoff -> provision -> error -> OM -> backoff.....

With your questioning, I see it would be pretty straight forward to use the same backoff value (if its set) prior to doing OM (deprovision). Successful OM shouldn't reset the backoff though. If OM fails, it would make sense to increase the backoff prior to retrying OM. Once OM succeeds, the provision would fire immediately (it already spent the backoff prior to OM). I think this means the provision failure loop would look like this as long as OM doesn't get an error:

provision -> error-> backoff -> OM -> provision -> error -> backoff ->OM -> provision...

and if there is an error doing OM I think we add another backoff:

provision -> error-> backoff -> OM -> error -> backoff -> OM (success) -> provision -> error -> backoff ->OM -> provision...

again noting that OM (success) does not reset the backoff.

A bonus to this is that in addition to provision and update, we'd add deprovision & OM to the backoff retry.

WDYT?

@nilebox
Copy link
Contributor

nilebox commented Jun 7, 2018

Not sure if I fully understand. I don't like waiting before performing OM.

Is your in-memory backoff flexible enough to set the backoff upon provision failure, but respect it only when we retry provisioning? That is what we really want in the end, I think.
i.e. in case of orphan mitigation:

provision -> error + update in-memory delay -> OM -> apply in-memory delay + provision -> ...

In other words, given that we consider OM to be part of the original operation, we don't want to apply your custom backoff there. Would it be possible?

Orphan mitigation could fail itself and should be retried with backoff delay as well... But in that case the normal rate limiting queue should take effect I think?

@nilebox
Copy link
Contributor

nilebox commented Jun 7, 2018

To word it differently: we should only apply this custom backoff when we retry the operation.
Let me just give two examples:

  • For sync provisioning the operation we retry is: "provision -> error -> OM". Then we try provisioning again, and that's where delay should be applied.
  • For async updates the operation is "start updating -> poll last_operation -> failed". Then we try updating again, and that's where delay should be applied.

So we should be very clear about:

  • When do we set the delay? When a retriable error has occured (so we 100% now we'll retry)
  • When do we apply the delay? In the beginning of retrying the operation (see above)
  • When do we reset the delay? When the operation succeeded, or finished with a terminal error.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 19, 2018
Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much nicer and easier to follow.

overDue := now.Add(-maxBrokerOperationRetryDelay)
purgedEntries := 0
for k := range c.instanceOperationRetryQueue.instances {
if due := c.instanceOperationRetryQueue.instances[k]; due.calculatedRetryTime.Before(overDue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get the value from the instance map as part of the for statement instead of finding the value using the key.

for k, v := range c.instanceOperatorRetryQueue.instances {
   if v.calculatedRetryTime.Before(overDue) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks!

purgedEntries := 0
for k := range c.instanceOperationRetryQueue.instances {
if due := c.instanceOperationRetryQueue.instances[k]; due.calculatedRetryTime.Before(overDue) {
glog.V(5).Infof("removing %s from instanceOperationRetryQueue which had retry time of %v", k, due)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should due be due.calculatedRetryTime?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely, good catch

purgedEntries++
}
}
glog.V(5).Infof("purged %v expired entries, instanceOperationRetryQueue queue length is %v", purgedEntries, len(c.instanceOperationRetryQueue.instances))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log entry is not accurate since it is not really the queue length that is being shown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are objecting to the text of the message right, not the calculation of the number of entries, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just the text of the message.

type backoffEntry struct {
generation int64
calculatedRetryTime time.Time // earliest time we should retry
dirty bool // true indicates new backoff should be calculated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it cleaner to calculate the delay from start of request to start of request rather than from end of request to start of request. Asymptotically, it approaches the same delay in either case. I'm not convinced that it makes much difference operationally how the delay is calculated, so long as it is increasing. If the delay were calculated from start-to-start, then you would not need to keep track of a dirty bit. The calculatedRetryTime would be set directly in setRetryBackoffRequired.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue I fought with here is that in one of the non-happy path loops, you provision, fail, then have to go through orphan mitigation which may fail multiple times, then you do the retry. On short backoffs the prior calculated retry added no delay because orphan mitigation took "so long".

I agree its a bit convoluted, but this will be a short term solution (measured in months) before its replaced with a re-design for proper backoff/retry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. That makes sense.

@jboyd01
Copy link
Contributor Author

jboyd01 commented Jul 23, 2018

The build failure in pull-service-catalog-xbuild is a pipeline problem - this builds fine and ready for final review, just need one more LGTM please.

Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 23, 2018
@staebler staebler added the LGTM2 label Jul 23, 2018
@MHBauer
Copy link
Contributor

MHBauer commented Jul 24, 2018

/retest

@jboyd01
Copy link
Contributor Author

jboyd01 commented Jul 24, 2018

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jboyd01

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 24, 2018
Copy link
Contributor

@nilebox nilebox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one nit, feel free to ignore.

glog.Errorf(pcb.Messagef("Couldn't create a key for object %+v: %v", instance, err))
return false
}
delay := time.Millisecond * 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to declare a zero delay var there, you could just do

delay := retryEntry.calculatedRetryTime.Sub(now) 

below (replace = with :=)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed @nilebox, I think this was left over after a refactor and I missed it. This was just merged, next time I'm in here I'll clean it up.

@k8s-ci-robot k8s-ci-robot merged commit f85e626 into kubernetes-retired:master Jul 24, 2018
c.instanceOperationRetryQueue.mutex.Lock()
defer c.instanceOperationRetryQueue.mutex.Unlock()
retryEntry, found := c.instanceOperationRetryQueue.instances[key]
if !found || retryEntry.generation != instance.Generation {
Copy link
Contributor

@nilebox nilebox Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jboyd01 I think we may need to also store and compare instance's UID.
@joshk has reported that when he deletes an old instance and creates a new one under the same name, he gets a RetryBackoff event immediately. I suspect this could happen when the old instance had a Generation = 1 (created and never changed) - and the new instance will get the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Thanks for digging in on it @nilebox.

@nilebox
Copy link
Contributor

nilebox commented Aug 15, 2018

@jboyd01 can we simply use UID as a key instead of namespace/name for this rate limiting map?

@jboyd01
Copy link
Contributor Author

jboyd01 commented Aug 15, 2018

@nilebox Right, I was thinking the same thing. Also looking into purging the entry from the retry maps after successfully doing non-orphan mitigation deprovision.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm Indicates that a PR is ready to be merged. LGTM1 LGTM2 size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide delay after performing orphan mitigation when Service Instance was failed
8 participants