-
Notifications
You must be signed in to change notification settings - Fork 382
Provide delay after performing orphan mitigation when Service Instance was failed #2025
Comments
I'm using Service Catalog in version 0.1.17 |
@polskikiel Thank you for your report and suggestion. We also caught this problem. And, the version should be v0.1.17, right? |
Thanks for reporting @polskikiel, the behavior you described suggests that we have a bug in our code.
That's exactly how we agreed it to behave, I think. |
@jianzhangbjz Yes, sorry I meant v0.1.17. |
@nilebox Yes, that's the intended behavior, so we definitely have a bug. My guess is the successful OM is clearing the rate limiter for the instance queue, so there isn't a backoff when the provision retries, though I haven't looked at the code in a while. |
It's easy to reproduce this. I took the ups broker and modified it to return a 205 on instance provisioning. This triggers the tight loop described above. From what I'm seeing each cycle basically passes through the controller's service instance work queue 3 times before it starts all over again:
The process then starts over again. The only place we do any backoff is in step 2, but the initial backoff is set to 5 milliseconds and we keep resetting the requeues to zero in all steps but 2. So we never achieve any significant backoff, and we never get higher then 1 on a retry count. Unless steps 1 and 3 result in a requeue, we're not going to achieve a significant backoff from this queue. I'm unaware of anything else that would be gating how frequently we'd retry. The behavior we see in step 1 (returning success to the worker) interrupts any notion of retries - I understand that we must separate status updates out in the reconciliation loop, but it looks to me like this breaks the fundamentals around how the backoff and retry is supposed to work here - we keep resetting any state that indicates we failed an overall operation and need to retry. I updated the code so passes 1 and 3 result in the worker doing requeue, but the backoff is never significant and even after we stop requeue-ing because we exceeded the maxRetries, the instance gets put right back into the queue and nothing is really changed. I think the watch is what puts it back in the queue? |
I agree, and the reason is that we treat 3 continuous stages of the reconciliation as part of one operation to meet OSB requirements, while normally in Kubernetes there are no stages? |
Trying to debug this, I made the instance status update and Orphan Mitigation code return a failure to the worker so it would requeue with a backoff. I can see that logging as I'd expect. But I still see a very tight loop with the worker constantly being invoked - there is no backoff between invocations. The worker() (https://github.com/kubernetes-incubator/service-catalog/blob/master/pkg/controller/controller.go#L304-L335) seems to do the right thing when we indicate its a failure (I added debug to the https://github.com/kubernetes-incubator/service-catalog/blob/master/vendor/k8s.io/client-go/util/workqueue/default_rate_limiters.go and can see its using increasing backoffs) but every time we exit the worker, instead of seeing for the backoff time elapse, I see the worker immediately re-invoked for the instance. I believe the problem is the instance is getting requeued by https://github.com/kubernetes-incubator/service-catalog/blob/master/pkg/controller/controller_instance.go#L103 which is invoked from a thread with this call stack:
which I think is a watch on the instance kicking because it sees an update to the instance. Does this make sense? Should the watch be causing the reconciliation loop to fire immediately like this vs waiting for the backoff? @kibbles-n-bytes this kind of sounds like #1143 |
@jboyd01 I believe there are issues with our rate-limiting of non-async operations. My guess is it has to do with our handlers here: https://github.com/kubernetes-incubator/service-catalog/blob/9678178de2bb7ec29c4be24915d5a4eac2b6f7e0/pkg/controller/controller_instance.go#L96-L114 Notice that on Could you try changing that line to |
@kibbles-n-bytes that definitely helps - I actually now see some exponential backoff. By the time we get to max requeues (15), we have a 40 second delay. It feels like we could stand to take a 2nd look at the starting delay factor - its really small at 5 milliseconds with a max repeat of only 15. Eventually we hit max retries so the worker doesn't requeue and does a queue.Forget(). But a watch immediately triggers and puts the instance back into the queue and we start all over again resuming reconciliation where we left off which restarts the attempt to provision -> orphan mitigation all over again. If we want to use the max requeue to determine when to stop this cycle of attempt to provision -> orphan mitigation, reattempt... then I think we'll need some new logic that changes the instance's status indicating we shouldn't re-attempt to provision again. I don't know that the queue's
I'm not certain why I have those |
Hah! That's funny, but makes total sense. It's an unfortunate consequence of having two separate areas where we're dealing with queuing. What is our desired behavior, and what matches other Kubernetes resource backoffs? The way it was intended, we were supposed to retry with backoff, until when we hit the maxRetries, where we stop. And then on the next watch sync, we would restart the retry loop with our rate limiter reset. We could certainly fix it to do that, but depending on what other resources do, and what we desire, we may want to play around more with our queuing code. It may be that we should do quick retries only for things not hitting the broker, but then we have much longer retry intervals for things that communicate across the wire (like we do for async polling). Just depends what we want. |
I've just observed this issue with one of our failing ServiceInstance. SC retries without backoff. |
@ash2k was this failure on synchronous provisioning? If not, any additional details would be great. |
I see that when this issue is run with synchronous provisioning & deprovisioning with provisioning made to always fails, we actually invoke the provision call twice in a row and then the orphan mitigation twice in a row. log details
|
proposed quick term fix: enforce a delay of X between retry attempts provisioning an instance. X defaults to 10 minutes, but can be over ridden with a controller parameter if customers have concerns about their brokers being hit too much. @pmorie and I have discussed, I've got the basics working. @kibbles-n-bytes does this sound ok to you for a short term solution? Keeps us from pounding brokers. Obviously we have to come back in here and fix the real issue. The work around looks to be achievable by maintaining an in memory list of instances with a retry-no-earlier-then timestamp, perhaps set it in Orphan Mitigation and a new check against it before we do an Add. I'm close to having a working fix. |
Major flaws reported by kyma team has been fixed: - kubernetes-retired/service-catalog#2025 - kubernetes-retired/service-catalog#1879 - kubernetes-retired/service-catalog#2006 Enabled namespaced broker feature.
Major flaws reported by kyma team has been fixed: - kubernetes-retired/service-catalog#2025 - kubernetes-retired/service-catalog#1879 - kubernetes-retired/service-catalog#2006 Enabled namespaced broker feature.
Hello,
We wrote about our issue in #2006. It's about infinite loop when Service Instance provision fails, e.g. when some condition from our business logic wasn't fulfilled.
The loop looks like this:
In the issue #2006, @nilebox suggested using
reconciliation-retry-duration
flag, but it didn't help us.It defines how much time does the Service Catalog have to ask for the last operation.
In our case, each step is another operation so it won't work.
We need a similar solution but for the whole loop showed above.
These are the logs from the controller-manager when I started to provision a Service Instance which fails:
and so on forever..
As you can see, the Service Catalog continuously spams our broker with provision and deprovision requests without any delay or maxRetries (or maxRetryDuration) amount.
We would expect a flow like this:
Our idea is to provide an exponential delay after performing an orphan mitigation to avoid spamming the brokers.
Let us know what do you think about it :)
Also let us know if you need the code snippets from the Service Catalog where the whole cycle is executed.
Thank you in advance.
The text was updated successfully, but these errors were encountered: