ServiceCatalog floods brokers when provisioning is permanently not possible. #2006

mszostok · 2018-04-30T12:57:41Z

Hi all :)

This is a nutshell of our issue:
The flow of the ServiceInstance provisioning was changed after refactoring (#1886). Now the Service Catalog has infinite loop

start provisioning -> fail, requires orphan mitigation -> perform orphan mitigation 
-> start provisioning -> …

it will do so until it finally succeeds.

Details

Scenario - Service Catalog version 0.1.3 - correct flow

The Service Catalog triggers provisioning of given ClusterServiceClass
We are responding with http code 202 (Accepted) and triggering asynchronous operation which fails because of some reason.
When Service Catalog calls last_operation endpoint we are returning http code 200 (OK) with body

{
  "state": "failed",
  "description": "Cannot provision instance. <reason>"
}

Service Catalog marks the ServiceInstance as failed, sets Deprovision Request as Required and does not perform any additional actions. When we call deprovision on such failed instance the Service Catalog sends deprovision request also to the broker.

Scenario - Service Catalog version 0.1.11 - broken flow

Steps 1-3 same as above.
Step 4: Service Catalog marks the ServiceInstance as failed, sets Deprovision Request as Not Required and does not perform any additional actions.

When we call deprovision on such failed instance the Service Catalog does not send deprovision request to the Service Broker, so you created a ticket to fix this #1879 but PR which should fix that #1886, caused something more. Now the Service Catalog in versions above the 0.1.11 has infinite loop

start provisioning -> fail, requires orphan mitigation -> perform orphan mitigation 
-> start provisioning -> …

This loop "spams" our brokers because there are no delays and max retries. In some cases such approach breaks our flow.
We can also consider an example when someone has a Azure Service Broker and wants to provision a db service. Instance can failed e.g. because the quota in given namespace was exhausted. So repating describe action over and over does not make sense.

The question is, why are you performing the orphan mitigation if Service Broker responded with 200 (OK) status code?

In spec https://github.com/openservicebrokerapi/servicebroker/blob/v2.13/spec.md#orphans, we can find that in such case you shouldn't do that.

Could you elaborate about your new approach for provisioning? Because I cannot find any documentation about it in the Service Catalog.
What's more in my opinion you should restore the previous behaviour (from version 0.1.3). When Service Broker respond with 200 status code on last_operation endpoint with body

{
  "state": "failed",
  "description": "Cannot provision instance. <reason>"
}

you should only mark this ServiceInstance as failed and stop pooling. Then when user sends deprovision request you should call the Service Broker. Thanks to that the given Service Broker will perform a deprovision action (same kind of garbage collector) for failed instance.

Thank you in advance

The text was updated successfully, but these errors were encountered:

nilebox · 2018-05-01T04:24:31Z

@mszostok thanks for reporting with a lot of detail, there seems to be something to clarify in the docs or fix/alter Service Catalog behavior.

TL;DR

Retry loop in Service Catalog (for both sync and async OSB operations) was introduced to improve UX and make it as close as possible to the Kubernetes UX for reconciliation loop: retry with exponential backoff.
Orphan mitigation is potentially needed for both sync and async operations based on the discussions we had (see the detailed description below).

This loop "spams" our brokers because there are no delays and max retries.

This doesn't seem right. The retry loop should have an exponential backoff delay. If it doesn't, that's a bug, please report an issue, preferrably with a detailed log/example.
As of max retries, there is a max retry duration, that is 7 days by default, but can be tuned via reconciliation-retry-duration flag, see https://github.com/kubernetes-incubator/service-catalog/blob/master/cmd/controller-manager/app/options/options.go#L119

The question is, why are you performing the orphan mitigation if Service Broker responded with 200 (OK) status code?

First of all, I would like to provide more context:
Initially when I introduced the retry loop in #1765, for async operations there was no orphan mitigation.
Then, it was reported as a bug, see discussion thread https://github.com/kubernetes-incubator/service-catalog/pull/1765/files#r177482025
So in order to fix this bug I sent a #1886 to perform orphan mitigation for failed async operations.

So to answer why orphan mitigation might be needed for async operations, let's first reiterate why orphan mitigation is part of OSB spec at all. In the ideal world, it should have been OSB broker's concern whether to perform a cleanup or not. And there is a general agreement that in the next major release of OSB spec (aka v3, which doesn't exist yet), we will remove the orphan mitigation requirement from the spec completely. But in v2 as a CloudFoundy's legacy, OSB spec inherited a burden of cleanup to be platform's problem (i.e. Service Catalog or CloudFoundry), not broker's.

That implies that a broker doesn't have to have a retry and cleanup loop, it could be just a "simple" proxy that reacts to the platform poking it via REST API. Which effectively means that as a result of failed deprovisioning the service managed by OSB broker could end up in the "unknown" abandoned state, potentially having created some resources that waste users' money.

Now back to your question. A "simple stateless broker" that supports async API could also not have a retry loop, just as sync one. And because the side effects of failed provisioning is unknown per OSB spec v2 (both sync and async), platform needs to perform orphan mitigation "just in case". If broker has nothing to cleanup - it could just implement this deprovisioning for non-existing instance as a no-op.

you should only mark this ServiceInstance as failed and stop pooling.

What the retry loop in Service Catalog is addressing is trying to improve the UX, making it as close as possible to the Kubernetes-native UX. What if failed provisioning was just a result of temporary network blip and it will succeed on retry? That's how Kubernetes works - it retries to deploy a Pod even if the specified Docker image doesn't exist yet, because it might be present at the next retry.
Having just one chance to provision an instance doesn't sound like a Kubernetes way, or as a good UX in general.

nilebox · 2018-05-01T04:40:07Z

Also, to clarify the scope of the issue: by "when provisioning is permanently not possible" in the title it is implied that platform should either treat any async provisioning failure as a permanent one, or be able to determine whether a failure is temporary or permanent.

Unfortunately, with OSB API v2 it is impossible for the platform (Service Catalog) to know whether the failure is temporary or permanent. OSB broker just returns state: failed in the response to last_operation, there is no error categorization. This is definitely something we should address in v3 (or as a backward compatible optional field in v2), but currently we have to deal with what we have.

In 0.1.3 Service Catalog treated any failure as permanent. It was a general agreement that this was a flaw in the Service Catalog code that had been justified by our focus on beta release that only supported the "happy path" well. Since beta was released, we invested a lot of effort to improve coverage of "non-happy path", see issues filtered by non-happy-path label for example: https://github.com/kubernetes-incubator/service-catalog/issues?utf8=%E2%9C%93&q=label%3Anon-happy-path+

I hope that helps.
Please do report any bugs you might find.

mszostok · 2018-05-07T08:52:30Z

@nilebox Thank you for you reply. It helps a lot!

Now we know that this behaviour is by design. Previous comment was based on version 0.1.12, but now I can see that v0.1.17 is the newest one, so we will validate the new flow one more time. If we find the same problems, we will look into the code and report issue with the details.

PS. Sorry for such late response but last week I was at conference and vacation.

Major flaws reported by kyma team has been fixed: - kubernetes-retired/service-catalog#2025 - kubernetes-retired/service-catalog#1879 - kubernetes-retired/service-catalog#2006 Enabled namespaced broker feature.

mszostok closed this as completed May 7, 2018

polskikiel mentioned this issue May 9, 2018

Provide delay after performing orphan mitigation when Service Instance was failed #2025

Closed

PK85 mentioned this issue Aug 5, 2018

Upgrade service catalog from v0.1.11 to v0.1.28 kyma-project/kyma#120

Merged

4 tasks

jasiu001 mentioned this issue Feb 5, 2019

Helm-Broker does not handle chart install errors in a proper way kyma-project/kyma#2197

Closed

mszostok mentioned this issue May 19, 2019

REQUEST: New membership for @mszostok kubernetes/org#832

Closed

6 tasks

ksputo mentioned this issue Mar 18, 2020

Service Catalog does not retry failed provisioning kyma-project/kyma#7288

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ServiceCatalog floods brokers when provisioning is permanently not possible. #2006

ServiceCatalog floods brokers when provisioning is permanently not possible. #2006

mszostok commented Apr 30, 2018

nilebox commented May 1, 2018

nilebox commented May 1, 2018

mszostok commented May 7, 2018

ServiceCatalog floods brokers when provisioning is permanently not possible. #2006

ServiceCatalog floods brokers when provisioning is permanently not possible. #2006

Comments

mszostok commented Apr 30, 2018

nilebox commented May 1, 2018

nilebox commented May 1, 2018

mszostok commented May 7, 2018