-
Notifications
You must be signed in to change notification settings - Fork 382
ServiceCatalog floods brokers when provisioning is permanently not possible. #2006
Comments
@mszostok thanks for reporting with a lot of detail, there seems to be something to clarify in the docs or fix/alter Service Catalog behavior. TL;DR
This doesn't seem right. The retry loop should have an exponential backoff delay. If it doesn't, that's a bug, please report an issue, preferrably with a detailed log/example.
First of all, I would like to provide more context: So to answer why orphan mitigation might be needed for async operations, let's first reiterate why orphan mitigation is part of OSB spec at all. In the ideal world, it should have been OSB broker's concern whether to perform a cleanup or not. And there is a general agreement that in the next major release of OSB spec (aka That implies that a broker doesn't have to have a retry and cleanup loop, it could be just a "simple" proxy that reacts to the platform poking it via REST API. Which effectively means that as a result of failed deprovisioning the service managed by OSB broker could end up in the "unknown" abandoned state, potentially having created some resources that waste users' money. Now back to your question. A "simple stateless broker" that supports async API could also not have a retry loop, just as sync one. And because the side effects of failed provisioning is unknown per OSB spec v2 (both sync and async), platform needs to perform orphan mitigation "just in case". If broker has nothing to cleanup - it could just implement this deprovisioning for non-existing instance as a no-op.
What the retry loop in Service Catalog is addressing is trying to improve the UX, making it as close as possible to the Kubernetes-native UX. What if failed provisioning was just a result of temporary network blip and it will succeed on retry? That's how Kubernetes works - it retries to deploy a Pod even if the specified Docker image doesn't exist yet, because it might be present at the next retry. |
Also, to clarify the scope of the issue: by "when provisioning is permanently not possible" in the title it is implied that platform should either treat any async provisioning failure as a permanent one, or be able to determine whether a failure is temporary or permanent. Unfortunately, with OSB API v2 it is impossible for the platform (Service Catalog) to know whether the failure is temporary or permanent. OSB broker just returns In 0.1.3 Service Catalog treated any failure as permanent. It was a general agreement that this was a flaw in the Service Catalog code that had been justified by our focus on beta release that only supported the "happy path" well. Since beta was released, we invested a lot of effort to improve coverage of "non-happy path", see issues filtered by I hope that helps. |
@nilebox Thank you for you reply. It helps a lot! Now we know that this behaviour is by design. Previous comment was based on version 0.1.12, but now I can see that v0.1.17 is the newest one, so we will validate the new flow one more time. If we find the same problems, we will look into the code and report issue with the details. PS. Sorry for such late response but last week I was at conference and vacation. |
Major flaws reported by kyma team has been fixed: - kubernetes-retired/service-catalog#2025 - kubernetes-retired/service-catalog#1879 - kubernetes-retired/service-catalog#2006 Enabled namespaced broker feature.
Major flaws reported by kyma team has been fixed: - kubernetes-retired/service-catalog#2025 - kubernetes-retired/service-catalog#1879 - kubernetes-retired/service-catalog#2006 Enabled namespaced broker feature.
Hi all :)
This is a nutshell of our issue:
The flow of the ServiceInstance provisioning was changed after refactoring (#1886). Now the Service Catalog has infinite loop
it will do so until it finally succeeds.
Details
Scenario - Service Catalog version 0.1.3 - correct flow
Deprovision Request
as Required and does not perform any additional actions. When we call deprovision on such failed instance the Service Catalog sends deprovision request also to the broker.Scenario - Service Catalog version 0.1.11 - broken flow
Deprovision Request
as Not Required and does not perform any additional actions.When we call deprovision on such failed instance the Service Catalog does not send deprovision request to the Service Broker, so you created a ticket to fix this #1879 but PR which should fix that #1886, caused something more. Now the Service Catalog in versions above the 0.1.11 has infinite loop
This loop "spams" our brokers because there are no delays and max retries. In some cases such approach breaks our flow.
We can also consider an example when someone has a Azure Service Broker and wants to provision a db service. Instance can failed e.g. because the quota in given namespace was exhausted. So repating describe action over and over does not make sense.
The question is, why are you performing the orphan mitigation if Service Broker responded with 200 (OK) status code?
In spec https://github.com/openservicebrokerapi/servicebroker/blob/v2.13/spec.md#orphans, we can find that in such case you shouldn't do that.
Could you elaborate about your new approach for provisioning? Because I cannot find any documentation about it in the Service Catalog.
What's more in my opinion you should restore the previous behaviour (from version 0.1.3). When Service Broker respond with 200 status code on last_operation endpoint with body
you should only mark this ServiceInstance as failed and stop pooling. Then when user sends deprovision request you should call the Service Broker. Thanks to that the given Service Broker will perform a deprovision action (same kind of garbage collector) for failed instance.
Thank you in advance
The text was updated successfully, but these errors were encountered: