-
Notifications
You must be signed in to change notification settings - Fork 382
delay between provisioning retries #2065
delay between provisioning retries #2065
Conversation
093ee08
to
6a0f741
Compare
@luksa, @kibbles-n-bytes could you review? Travis error is with building docsite, unrelated failure that I'm attempting to address, but this PR is ready to go I believe. |
@@ -91,6 +92,17 @@ const ( | |||
clusterIdentifierKey string = "clusterid" | |||
) | |||
|
|||
type retryQueue struct { | |||
mutex sync.RWMutex // lock to be used with provisionRetryTime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment references provisionRetryTime
instead of retryTime
@@ -60,6 +60,7 @@ const ( | |||
defaultLeaderElectionNamespace = "kube-system" | |||
defaultReconciliationRetryDuration = 7 * 24 * time.Hour | |||
defaultOperationPollingMaximumBackoffDuration = 20 * time.Minute | |||
defaultProvisioningRetryInterval = 5 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut feeling is that this is way too long. Especially when the error during provisioning is just a temporary glitch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's way too long when you have a single failure, but when the broker is in a bad state and just sits there returning a retry-able error it's pretty frequent. Optimally we'd do a set of quick retries and then fall back to every 5 or 10 minutes, but I don't think the effort for that is worth it as we need to rework this anyway (#2076). Suggestions on a better backoff? Once a minute? Note that this is overridable with the flag --provision-retry-interval
@@ -99,8 +111,16 @@ func (c *controller) instanceAdd(obj interface{}) { | |||
glog.Errorf("Couldn't get key for object %+v: %v", obj, err) | |||
return | |||
} | |||
c.instanceQueue.AddRateLimited(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a 100% on this, but I think this is a bad idea. The instanceAdd()
method gets called in instanceUpdate()
, which gets called when the user changes the ServiceInstance
object.
Now imagine what happens when the user initially posts a bad ServiceInstance
, which causes the reconcile method to back off. Eventually, the backoff time delay is a few minutes. If the user then updates the ServiceInstance
, the controller won't process the update immediately, but minutes later.
Also, I believe every call to AddRateLimited
increases the delay. So, with this change, every update to the object increases the delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with reverting this so it isn't rate limited, I did it per #2025 (comment), but I do want to note that even if we do a backoff/retry the max # of retry times (15), they are all completed in under 3 minutes. Check out the backoff times: https://github.com/kubernetes-incubator/service-catalog/blob/6a693a7661654faa2dffa2d50d44f8d676207b7c/pkg/controller/controller.go#L57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very confident this needs to be reverted. Even if the maximum backoff time is "just" 82s, that's too much if it means that's how long it takes the controller to start acting on a change made by the user to the ServiceInstance.
return | ||
} | ||
c.provisionRetryQueue.mutex.Lock() | ||
c.provisionRetryQueue.retryTime[key] = time.Now().Add(c.provisionRetryQueue.retryInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you take a look at NewItemExponentialFailureRateLimiter()
in github.com/kubernetes-incubator/service-catalog/vendor/k8s.io/client-go/util/workqueue/default_rate_limiters.go
?
I think adding a nice exponential backoff instead of a constant delay should be fairly simple. The rate limiter has a When()
method you can call to get the next time.
But I haven't really looked at the rate limiters closely. So I may be wrong :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An exponential backoff is ideal, it is what we were supposed to be doing here, but we constantly reset the notion of failure/success when we do status updates or complete the orphan mitigation and we never quit after max retries, it just restarts all over.
To use exponential backoff we need to maintain more state, including the # of retries.
Ideally we would have a queue much like today that has very short backoff/retry for kube reconciliation, but we probably need another queue that is for remote broker operations that has a larger backoff. I 100% agree that exponential backoff is the right fix, but its going to take a large rework. The only goal of this PR is to stop us from pounding on the broker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I meant just create a new rate limiter instead of the retryTime
map. The rate limiter keeps track of everything. All you need to do is call rateLimiter.When(instance)
here and it'll keep track of the individual delays for each ServiceInstance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you are saying Marko, great idea. I think I'll have to keep the retryMap
in addition to adding the rateLimter
. The rateLimiter
doesn't have support for checking if an item is in queue and getting the next execution time. But I really like the idea of exponential backoff.
@@ -339,6 +421,18 @@ func (c *controller) reconcileServiceInstanceAdd(instance *v1beta1.ServiceInstan | |||
return nil | |||
} | |||
|
|||
// don't DOS the broker. If we already did a provision attemp that ended with a non-terminal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/attemp/attempt/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e28d57f
to
9aed723
Compare
// with repeated requests. | ||
retryTime map[string]time.Time | ||
queueLastCleaned time.Time | ||
rateLimter workqueue.RateLimiter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing 'i' in rateLimter
@@ -99,8 +111,16 @@ func (c *controller) instanceAdd(obj interface{}) { | |||
glog.Errorf("Couldn't get key for object %+v: %v", obj, err) | |||
return | |||
} | |||
c.instanceQueue.AddRateLimited(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very confident this needs to be reverted. Even if the maximum backoff time is "just" 82s, that's too much if it means that's how long it takes the controller to start acting on a change made by the user to the ServiceInstance.
test/util/util.go
Outdated
return | ||
} | ||
glog.Infof("Instance %v/%v: %+v", namespace, name, instance) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing you used this method during testing. It's not used anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @luksa, I've reverted back to non rate limited add and corrected the other two issues.
LGTM I'd /lgtm but I'm not in the kubernetes-incubator org yet. Hopefully they'll add me today. |
c846d30
to
102dcbe
Compare
hmm... It appears changing the |
@kibbles-n-bytes, @nilebox could you please review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems okay so far. I'd like to take another pass at it later. needs some more comments, I think.
) | ||
|
||
type retryQueue struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this strikes me as the wrong place to define this, or it needs a better name, as it seems very provision+instance specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
granted I'm not familiar with the existing queues, but is there a reason to avoid using some version of "k8s.io/client-go/util/workqueue" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very instance & provision oriented. This isn't meant to be long term - it's a long read, but #2076 discusses many of the existing issues we're seeing in here. The instance controller really needs a major refactoring - I don't picture this new provisionRetryQueue code being part of it. This is purely to plug a hole specific instance provisioning retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, thanks
pkg/controller/controller.go
Outdated
@@ -221,7 +222,8 @@ type controller struct { | |||
// clusterIDLock protects access to clusterID between the | |||
// monitor writing the value from the configmap, and any | |||
// readers passing the clusterID to a broker. | |||
clusterIDLock sync.RWMutex | |||
clusterIDLock sync.RWMutex | |||
provisionRetryQueue retryQueue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new struct defined in controller_instance.
pkg/controller/controller.go
Outdated
@@ -167,7 +167,8 @@ func NewController( | |||
//DeleteFunc: controller.servicePlanDelete, | |||
//}) | |||
} | |||
|
|||
controller.provisionRetryQueue.retryTime = make(map[string]time.Time) | |||
controller.provisionRetryQueue.rateLimiter = workqueue.NewItemExponentialFailureRateLimiter(minBrokerProvisioningRetryDelay, maxBrokerProvisioningRetryDelay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the base controller.provisionRetryQueue
is just the empty struct from basic object instantiation, right? Nothing extra we need to do before we start touching it's pieces?
The OO/Java guy in me wants this done in a constructor of retryQueue, but we're in a controntroller constructor for an internal private object, so I think we're fine for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the constants being defined in the other file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I hear you. But this is very specific to the instance controller, I originally had it all contained there with initialization in an init method, but I think this is clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please no inits
// provision reattempt shoud be made to ensure we don't overwhelm broker | ||
// with repeated requests. | ||
retryTime map[string]time.Time | ||
queueLastCleaned time.Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs comment.
// retry time. | ||
func (c *controller) purgeExpiredRetryEntries() { | ||
now := time.Now() | ||
if now.Before(c.provisionRetryQueue.queueLastCleaned.Add(maxBrokerProvisioningRetryDelay * 2)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why * 2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to prematurely remove any entries, but I agree * 2 is arbitrary without comments. I'll update it.
|
||
// Ensure we only purge items that aren't being acted on by retries, this shouldn't | ||
// have any work to do but we want to be certain this queue doesn't get overly large. | ||
overDue := now.Add(-maxBrokerProvisioningRetryDelay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add negative seems weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but I have a duration and Time.Sub() takes another Time type. Adding a negative duration is the prescribed way AFAIK.
overDue := now.Add(-maxBrokerProvisioningRetryDelay) | ||
for k := range c.provisionRetryQueue.retryTime { | ||
if c.provisionRetryQueue.retryTime[k].Before(overDue) { | ||
glog.V(5).Infof("removed %s from provisionRetry map which had retry time of %v", k, c.provisionRetryQueue.retryTime[k]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a log message before this to let us know what time "now" is so that we can see how backed off this is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial calculation is is done ihere and logged: https://github.com/kubernetes-incubator/service-catalog/blob/cb9ae55ef70a75e26f4344f64c5a8c9639e0a5d3/pkg/controller/controller_instance.go#L367
when we backoff vs doing a regularly scheduled provision attempt we log it as well: https://github.com/kubernetes-incubator/service-catalog/blob/cb9ae55ef70a75e26f4344f64c5a8c9639e0a5d3/pkg/controller/controller_instance.go#L439
c.provisionRetryQueue.rateLimiter.Forget(k) | ||
} | ||
} | ||
glog.V(5).Infof("purged expired entries, provisionRetry queue length is %v", len(c.provisionRetryQueue.retryTime)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before and after count? or how many were purged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My real concern was ensuring the map didn't grow - I'll add additional logging indicating how many remain in the map.
@@ -339,6 +433,18 @@ func (c *controller) reconcileServiceInstanceAdd(instance *v1beta1.ServiceInstan | |||
return nil | |||
} | |||
|
|||
// don't DOS the broker. If we already did a provision attempt that ended with a non-terminal | |||
// error then we set a next retry time. Observe that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more context for "observe that".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update. @MHBauer thanks for the review and comments, I'll update accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs one tiny comment tweak.
// Ensure we only purge items that aren't being acted on by retries, this | ||
// shouldn't have much work to do but we want to be certain this queue | ||
// doesn't get overly large. Entries are removed one by one when deleted | ||
// (not orphan mitigation)of successfully provisioned, this function ensures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is of
supposed to be or
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you got it, thanks, fixed. And to your comment below - yes, as I updated this comment I realized I wasn't thinking of the happy-path... I updated the provision success code to remove the entry from the map and verified the change.
// (not orphan mitigation)of successfully provisioned, this function ensures | ||
// all others get purged eventually. Due to queues and potential delays, | ||
// only remove entries that are at least maxBrokerProvisioningRetryDelay | ||
// past next retry time to ensure entries are not prematurely removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good comment. Thanks.
if now.Before(c.provisionRetryQueue.queueLastCleaned.Add(maxBrokerProvisioningRetryDelay * 2)) { | ||
|
||
// run periodically ensuring we don't prematurely purge any entries | ||
timeToPurge := c.provisionRetryQueue.queueLastCleaned.Add(maxBrokerProvisioningRetryDelay * 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the variable name.
@@ -1685,7 +1809,7 @@ func (c *controller) processProvisionSuccess(instance *v1beta1.ServiceInstance, | |||
if _, err := c.updateServiceInstanceStatus(instance); err != nil { | |||
return err | |||
} | |||
|
|||
c.removeInstanceFromRetryMap(instance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a new addition mentioned in the big new comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should also be a case when we hit a "terminal" error for a particular generation (e.g. 400 Bad Request
) and won't retry until the spec is updated. We should invoke c.removeInstanceFromRetryMap(instance)
in that case as well. Otherwise, when the spec will be updated, we will still retry with accumulated delay...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm talking about processTerminalProvisionFailure
(https://github.com/kubernetes-incubator/service-catalog/blob/491f227a4fc7453917e347ccc957da9a312a4774/pkg/controller/controller_instance.go#L1695:22) and processTerminalUpdateServiceInstanceFailure
(https://github.com/kubernetes-incubator/service-catalog/blob/491f227a4fc7453917e347ccc957da9a312a4774/pkg/controller/controller_instance.go#L1801)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks fine, please check if we need to purge item from the retry queue in processTerminalProvisionFailure
and processTerminalUpdateServiceInstanceFailure
methods, plus some nits.
|
||
// delayProvisionRetry returns a duration which should be observed before attempting to provision | ||
func (c *controller) getDelayForProvisionRetry(instance *v1beta1.ServiceInstance) time.Duration { | ||
c.purgeExpiredRetryEntries() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get...
in the method name suggests that this method should not have any side effects.
I would prefer to explicitly invoke purgeExpiredRetryEntries
when needed, rather then doing it here hidden from the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a new worker in the main controller that will periodically invoke the purge.
c.provisionRetryQueue.mutex.Lock() | ||
c.provisionRetryQueue.retryTime[key] = time.Now().Add(duration) | ||
glog.V(4).Infof("provisionRetry for %s after %v", key, duration) | ||
c.provisionRetryQueue.mutex.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do defer c.provisionRetryQueue.mutex.Unlock()
right after acquiring lock instead to be safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @nilebox that defer should be used here to ensure the lock is released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was short enough with no early returns, but I've updated it.
defer c.provisionRetryQueue.mutex.RUnlock() | ||
now := time.Now() | ||
if t := c.provisionRetryQueue.retryTime[key]; t.After(now) { | ||
return t.Sub(now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still a small chance for this value to be negative (between t.After()
and t.Sub()
some time has passed).
To be safe, this whole block can be rewritten to:
t := c.provisionRetryQueue.retryTime[key]
delay := t.Sub(now)
if delay < 0 {
return 0
}
return delay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I was using time.Now() I'd agree, but these variables are not going to change between checking and subtracting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah true.
c.provisionRetryQueue.mutex.RLock() | ||
defer c.provisionRetryQueue.mutex.RUnlock() | ||
now := time.Now() | ||
if t := c.provisionRetryQueue.retryTime[key]; t.After(now) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably check for the key to be present in the map? i.e. if t, ok := ...; ok { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, it will lead to t
being equal to 0
in that case, fine.
purgedEntries := 0 | ||
for k := range c.provisionRetryQueue.retryTime { | ||
if c.provisionRetryQueue.retryTime[k].Before(overDue) { | ||
glog.V(5).Infof("removed %s from provisionRetry map which had retry time of %v", k, c.provisionRetryQueue.retryTime[k]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to cache the value in the var if it's being used twice (c.provisionRetryQueue.retryTime[k]
).
@@ -339,6 +443,18 @@ func (c *controller) reconcileServiceInstanceAdd(instance *v1beta1.ServiceInstan | |||
return nil | |||
} | |||
|
|||
// don't DOS the broker. If we already did a provision attempt that ended with a non-terminal | |||
// error then we set a next retry time. | |||
if delay := c.getDelayForProvisionRetry(instance); delay > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the instance spec was updated? have we purged the item from the provisionRetryQueue
by this time?
We should reset the backoff when we start a new operation with generation incremented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. I've updated the map to use the instance with the generation appended now. This ensures if the instance is updated the backoff will be reset.
@@ -1685,7 +1809,7 @@ func (c *controller) processProvisionSuccess(instance *v1beta1.ServiceInstance, | |||
if _, err := c.updateServiceInstanceStatus(instance); err != nil { | |||
return err | |||
} | |||
|
|||
c.removeInstanceFromRetryMap(instance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should also be a case when we hit a "terminal" error for a particular generation (e.g. 400 Bad Request
) and won't retry until the spec is updated. We should invoke c.removeInstanceFromRetryMap(instance)
in that case as well. Otherwise, when the spec will be updated, we will still retry with accumulated delay...
@@ -1685,7 +1809,7 @@ func (c *controller) processProvisionSuccess(instance *v1beta1.ServiceInstance, | |||
if _, err := c.updateServiceInstanceStatus(instance); err != nil { | |||
return err | |||
} | |||
|
|||
c.removeInstanceFromRetryMap(instance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm talking about processTerminalProvisionFailure
(https://github.com/kubernetes-incubator/service-catalog/blob/491f227a4fc7453917e347ccc957da9a312a4774/pkg/controller/controller_instance.go#L1695:22) and processTerminalUpdateServiceInstanceFailure
(https://github.com/kubernetes-incubator/service-catalog/blob/491f227a4fc7453917e347ccc957da9a312a4774/pkg/controller/controller_instance.go#L1801)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the comments left by @nilebox.
c.provisionRetryQueue.mutex.Lock() | ||
c.provisionRetryQueue.retryTime[key] = time.Now().Add(duration) | ||
glog.V(4).Infof("provisionRetry for %s after %v", key, duration) | ||
c.provisionRetryQueue.mutex.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @nilebox that defer should be used here to ensure the lock is released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nilebox thanks very much for the review. I've reworked things a bit: generalized to backoff for both provision retries and update retries. The key in the map is now the instance name along with the generation appended to ensure if there is an update the backoff is reset.
c.provisionRetryQueue.mutex.Lock() | ||
c.provisionRetryQueue.retryTime[key] = time.Now().Add(duration) | ||
glog.V(4).Infof("provisionRetry for %s after %v", key, duration) | ||
c.provisionRetryQueue.mutex.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was short enough with no early returns, but I've updated it.
|
||
// delayProvisionRetry returns a duration which should be observed before attempting to provision | ||
func (c *controller) getDelayForProvisionRetry(instance *v1beta1.ServiceInstance) time.Duration { | ||
c.purgeExpiredRetryEntries() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a new worker in the main controller that will periodically invoke the purge.
defer c.provisionRetryQueue.mutex.RUnlock() | ||
now := time.Now() | ||
if t := c.provisionRetryQueue.retryTime[key]; t.After(now) { | ||
return t.Sub(now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I was using time.Now() I'd agree, but these variables are not going to change between checking and subtracting.
@@ -339,6 +443,18 @@ func (c *controller) reconcileServiceInstanceAdd(instance *v1beta1.ServiceInstan | |||
return nil | |||
} | |||
|
|||
// don't DOS the broker. If we already did a provision attempt that ended with a non-terminal | |||
// error then we set a next retry time. | |||
if delay := c.getDelayForProvisionRetry(instance); delay > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. I've updated the map to use the instance with the generation appended now. This ensures if the instance is updated the backoff will be reset.
@@ -1685,7 +1809,7 @@ func (c *controller) processProvisionSuccess(instance *v1beta1.ServiceInstance, | |||
if _, err := c.updateServiceInstanceStatus(instance); err != nil { | |||
return err | |||
} | |||
|
|||
c.removeInstanceFromRetryMap(instance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more changes requested.
@@ -1696,6 +1844,7 @@ func (c *controller) processTerminalProvisionFailure(instance *v1beta1.ServiceIn | |||
if failedCond == nil { | |||
return fmt.Errorf("failedCond must not be nil") | |||
} | |||
c.removeInstanceFromRetryMap(instance) | |||
return c.processProvisionFailure(instance, readyCond, failedCond, shouldMitigateOrphan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and in other places: there is a chance that we will fail to update the status inside processProvisionFailure
, but we have already removed instance from retry map. Which means that operation will be retried without backoff.
Probably fine, given that the probability of this seems low, and it won't break anything.
@@ -1721,6 +1870,10 @@ func (c *controller) processProvisionFailure(instance *v1beta1.ServiceInstance, | |||
errorMessage = fmt.Errorf(readyCond.Message) | |||
} | |||
|
|||
// assume a provision retry will happen, set a not-before time so we don't pound the Broker | |||
// in a constant try to provision/fail/orphan mitigation/repeat loop. | |||
c.setNextOperationRetryTime(instance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only needed for non-terminal errors. For terminal errors you have just removed instance from the retry map in processTerminalProvisionFailure
.
So the proper place to put this is into the "non-terminal" condition block below: https://github.com/kubernetes-incubator/service-catalog/blob/3837abf27fe5d9b090a7d530a81d6cebc634a79e/pkg/controller/controller_instance.go#L1747-L1752
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, thanks
@@ -1696,6 +1844,7 @@ func (c *controller) processTerminalProvisionFailure(instance *v1beta1.ServiceIn | |||
if failedCond == nil { | |||
return fmt.Errorf("failedCond must not be nil") | |||
} | |||
c.removeInstanceFromRetryMap(instance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably remove instance if !shouldMitigateOrphan
. I guess we could just move this line inside processProvisionFailure
under condition
https://github.com/kubernetes-incubator/service-catalog/blob/3837abf27fe5d9b090a7d530a81d6cebc634a79e/pkg/controller/controller_instance.go#L1752-L1755
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which makes me wonder if we may just invoke removeInstanceFromRetryMap
inside clearServiceInstanceCurrentOperation
method and remove this invocation from all other places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably remove instance if !shouldMitigateOrphan. I guess we could just move this line inside processProvisionFailure under condition
There are cases where we do retries (and want backoff) without doing orphan mitigation in between (retrying a failed update, retrying a provisioning after a 400 or 403). I'll do some additional review, but at this point I'm not convinced doing the removeInstancefromRetryMap
within clearServiceInstanceCurrentOperation
is the right place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clearServiceInstanceCurrentOperation
is invoked a lot, I believe I saw it invoked as part of the orphan mitigation - from debug output, I am not comfortable doing the removeInstanceFromRetryMap
in clearServiceInstanceCurrentOperation
as I think it would prematurely clear out the instance. @nilebox if you feel strongly about this let me know and I'll spend more time verifying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I saw it invoked as part of the orphan mitigation
@jboyd01 We only invoke clearServiceInstanceAsyncOsbOperation
in case of orphan mitigation instead of clearServiceInstanceCurrentOperation
. Orphan mitigation is considered part of the original Service Catalog operation. The clearServiceInstanceCurrentOperation
should only be invoked when we finished operation and no backoff needed anymore.
Please check, I think it would make the code cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jboyd01 You're right, we do invoke clearServiceInstanceCurrentOperation
from the processDeprovisionSuccess
method even it was orphan mitigation. We also invoke it in the beginning of orphan mitigation from recordStartOfServiceInstanceOperation
method. So it's safer to invoke removeInstancefromRetryMap
explicitly like you do, agreed.
I found func (c *controller) processServiceInstanceOperationError(instance *v1beta1.ServiceInstance, readyCond *v1beta1.ServiceInstanceCondition) error {
// If error is from an update, assume a retry will happen, set a not-before time so we
// don't pound the Broker in a constant update/fail/update repeat loop.
if strings.Contains(readyCond.Reason, "UpdateInstance") {
c.setNextOperationRetryTime(instance)
} |
@jboyd01 why don't you want to do backoff for orphan mitigation retries, sorry? Orphan mitigation is part of the original operation. If you don't do backoff for orphan mitigation, you won't do backoff for the operation when you retry provisioning after successful orphan mitigation? Isn't this the whole point of this PR to keep exponential backoff while |
@nilebox The original issue is that provisioning a certain service keeps failing with a specific error. OM (orphan mitigation) worked fine and the successful OM was part of the reason we didn't backoff on the next provision. Given the way the original reconciliation & worker retry loops work, OM success was resetting the worker so there was no backoff on the retry to provision. The solution I had been aiming for was With your questioning, I see it would be pretty straight forward to use the same backoff value (if its set) prior to doing OM (deprovision). Successful OM shouldn't reset the backoff though. If OM fails, it would make sense to increase the backoff prior to retrying OM. Once OM succeeds, the provision would fire immediately (it already spent the backoff prior to OM). I think this means the provision failure loop would look like this as long as OM doesn't get an error:
and if there is an error doing OM I think we add another backoff:
again noting that OM (success) does not reset the backoff. A bonus to this is that in addition to provision and update, we'd add deprovision & OM to the backoff retry. WDYT? |
Not sure if I fully understand. I don't like waiting before performing OM. Is your in-memory backoff flexible enough to set the backoff upon provision failure, but respect it only when we retry provisioning? That is what we really want in the end, I think.
In other words, given that we consider OM to be part of the original operation, we don't want to apply your custom backoff there. Would it be possible? Orphan mitigation could fail itself and should be retried with backoff delay as well... But in that case the normal rate limiting queue should take effect I think? |
To word it differently: we should only apply this custom backoff when we retry the operation.
So we should be very clear about:
|
…rovisioning fails
…Fixup per review comments
…cessProvisionFailure()
…e key and enclosed time and dirty flag within the entry.
7628528
to
8a137aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much nicer and easier to follow.
overDue := now.Add(-maxBrokerOperationRetryDelay) | ||
purgedEntries := 0 | ||
for k := range c.instanceOperationRetryQueue.instances { | ||
if due := c.instanceOperationRetryQueue.instances[k]; due.calculatedRetryTime.Before(overDue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get the value from the instance map as part of the for
statement instead of finding the value using the key.
for k, v := range c.instanceOperatorRetryQueue.instances {
if v.calculatedRetryTime.Before(overDue) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks!
purgedEntries := 0 | ||
for k := range c.instanceOperationRetryQueue.instances { | ||
if due := c.instanceOperationRetryQueue.instances[k]; due.calculatedRetryTime.Before(overDue) { | ||
glog.V(5).Infof("removing %s from instanceOperationRetryQueue which had retry time of %v", k, due) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should due
be due.calculatedRetryTime
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absolutely, good catch
purgedEntries++ | ||
} | ||
} | ||
glog.V(5).Infof("purged %v expired entries, instanceOperationRetryQueue queue length is %v", purgedEntries, len(c.instanceOperationRetryQueue.instances)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log entry is not accurate since it is not really the queue length that is being shown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are objecting to the text of the message right, not the calculation of the number of entries, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, just the text of the message.
type backoffEntry struct { | ||
generation int64 | ||
calculatedRetryTime time.Time // earliest time we should retry | ||
dirty bool // true indicates new backoff should be calculated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it cleaner to calculate the delay from start of request to start of request rather than from end of request to start of request. Asymptotically, it approaches the same delay in either case. I'm not convinced that it makes much difference operationally how the delay is calculated, so long as it is increasing. If the delay were calculated from start-to-start, then you would not need to keep track of a dirty bit. The calculatedRetryTime
would be set directly in setRetryBackoffRequired
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue I fought with here is that in one of the non-happy path loops, you provision, fail, then have to go through orphan mitigation which may fail multiple times, then you do the retry. On short backoffs the prior calculated retry added no delay because orphan mitigation took "so long".
I agree its a bit convoluted, but this will be a short term solution (measured in months) before its replaced with a re-design for proper backoff/retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. That makes sense.
The build failure in pull-service-catalog-xbuild is a pipeline problem - this builds fine and ready for final review, just need one more LGTM please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/retest |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jboyd01 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one nit, feel free to ignore.
glog.Errorf(pcb.Messagef("Couldn't create a key for object %+v: %v", instance, err)) | ||
return false | ||
} | ||
delay := time.Millisecond * 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to declare a zero delay
var there, you could just do
delay := retryEntry.calculatedRetryTime.Sub(now)
below (replace =
with :=
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed @nilebox, I think this was left over after a refactor and I missed it. This was just merged, next time I'm in here I'll clean it up.
c.instanceOperationRetryQueue.mutex.Lock() | ||
defer c.instanceOperationRetryQueue.mutex.Unlock() | ||
retryEntry, found := c.instanceOperationRetryQueue.instances[key] | ||
if !found || retryEntry.generation != instance.Generation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jboyd01 I think we may need to also store and compare instance's UID
.
@joshk has reported that when he deletes an old instance and creates a new one under the same name, he gets a RetryBackoff
event immediately. I suspect this could happen when the old instance had a Generation = 1
(created and never changed) - and the new instance will get the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Thanks for digging in on it @nilebox.
@jboyd01 can we simply use |
@nilebox Right, I was thinking the same thing. Also looking into purging the entry from the retry maps after successfully doing non-orphan mitigation deprovision. |
fixes #2025 - at least it prevents Controller from blasting Brokers with constant provision retry/orphan mitigation loop.
In addition to #2025, see #2076 for additional details on problems with the instance reconciliation. This PR is a near term band-aid to prevent pain at the brokers.
A new in memory list of instances with provision retry times provides a delay to prevent a tight provision/provision-failure/orphan mitigation loop that pounds brokers without pause when provisioning ends with non-terminal failure.
This short term fix adds the retryQueue where we introduce additional delay on top of what the reconciler loop was originally doing. Before doing a broker operation (Provision or Update) we check the backoffBeforeRetrying and retryTime maps to see if we should first delay. If the instance has an entry in backoffBeforeRetrying that indicates the operation previously failed and we need to get a new exponential backoff delay from the RateLimter. This is calculated and a new time is set in the retryTime and then then the instance is put back into the main worker queue with an AddAfter(time) which causes the worker to reprocess after the delay. On this next pass, verification is done to ensure the time delay has expired and we then set the dirty bit in the backoffBeforeRetrying in anticipation of another failure. If the operation is successful, the entries will be cleared from the retryQueue. A failure runs through the process again. A background tasks periodically purges old entries from retryQueue.
The main point above is that all the backoff processing happens just before we execute the broker operation. There are a lot of edge cases and post error processing that otherwise makes it difficult to instrument the backoff properly in all the necessary places. This is a best effort band aid, a temporary work around until the overall reconcilation flow can be reworked properly (issue #2076).
We do not do additional backoff for Deprovisioning (regular or Orphan Mitigation) as the OM status handling is embedded with in the delete and its difficult to be certain if a delete is part of OM and ensuring we don't clear out overall backoff if there are failures and then success within OM.
Binding Reconcilation likely has the exact same issues.