-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schema Resource Timeout Affects Synchronized Resources Globally With Context Aware Functions #675
Comments
You always give me complicated ones @bflad. :) This one's rough just because I don't know what the right behavior here is. Speaking naively here: not having timeouts at all seems like a poor user experience. I don't love that we enforced them unintentionally, but I'm not positive that the academically "correct" answer here is to allow opting out of them? Should providers be able to say "we're going to take how long we take, no you don't get to limit us at all"? That's effectively what allowing providers to ignore timeouts does. Worse, because timeouts are implicitly supported for each and every resource and there's no way to opt out of them, the user can set timeouts and have them be ignored. It's not lost on me that you didn't actually propose just not applying the timeouts to the contexts, but whenever we introduce new behavior by accident/take away behavior by accident, my first instinct is to see if reverting makes sense, and I want to call out in this case that I don't know that it does, and we may want to forge a path we're happy with instead of rolling back. As for what you did propose, it sounds like you want to do the locking behavior prior to the timeout timer starting. And have some really neat ideas for how we could provide optional hooks for providers making it sustainable, maintainable, and obvious for us. That's really cool! Here's my thing though. What's... the point of these timeouts? What are we trying to protect against? What user intent do the capture, and what representations do they make to the user? My suspicion is that the purpose of these timeouts is "I want to prevent In that light, I'm of two minds about your proposal. On the one hand, we're clearly circumventing the defensive protections for the part of the provider code most likely to have bugs. A lock that is never released will just freeze things up and disregard timeouts completely. And that seems not ideal. Mutexes seem like the exact place I'd like timeouts to apply, honestly. On the other hand, predicting what a "reasonable" timeout in that situation means is incredibly difficult. It depends on the number of resources being created in parallel, and the specifics of how the API is bottlenecked. It's not a great user experience to push that onto them at the risk of errors being thrown in their face. I'm not super clear on what to do here. I think more information is needed on what these timeouts are for and what they're intended to do so we can honor the spirit of the user's configuration. |
How can I help and what type of artifacts are you looking for? The original implementation for customizable resource timeouts can be found here: hashicorp/terraform#12311 (including a 20 minute default)
To me that reads that the original intention of this mechanism was for allowing practitioners to control the expected provisioning time of a single resource. A single resource may have multiple resource instances given a Whether other unintended usage of those customizable timeouts has arisen since that initial design decision, I think would require broadly looking across the publicly available Terraform Provider code. However, I am having a hard time discerning what else a provider developer could have done with it where it would change practitioner expectations. The semantics of the original timeout implementation meant that provider developers had to choose when to start the timeout during a resource function and from the "SDK perspective", the timeout was not enforced at the resource function level, only configured and available for resource functions to reference and optionally implement. As part of [RFC] TF-209: Context Aware Terraform SDK, there is an explicit mention about this new behavior in the CRUD(E) Operations > Caveats subsection:
However, there is no further details or resolution in that artifact. The pull request implementing the change also contains comments on this subject, which also mention potentially giving the ability to opt out. I think we are agreeing on all these points, but my additional two cents here is that this is a case where the SDK change to implement the timeouts is better for provider developers and practitioners. Certainly do not want to dispute that and certainly don't want to give the impression that reverting this or circumventing that is being requested. All resources automatically get customizable timeouts support and are closer to supporting operation cancellation, which seems great. (Upgrade problems for providers, like AWS, that had hardcoded timeouts in resource logic greater than this newly enforced default is its Own Thing™.) To highlight it more from a practitioner perspective, it may not be clear that if I do something like: resource "takes_really_time_and_can_only_be_serialized" "one" {}
# this may be many files/modules away from the above and repeated much more
resource "takes_really_time_and_can_only_be_serialized" "two" {} That my resource timeout errors are because the API had some requirement coded into the resource to prevent other errors. The first instinct might be to file a provider bug, where the provider maintainer will need to explain that really you want: resource "takes_really_time_and_can_only_be_serialized" "one" {
timeouts {
create = "??"
}
}
# this may be many files/modules away from the above and repeated much more
resource "takes_really_time_and_can_only_be_serialized" "two" {
timeouts {
create = "??"
}
} Where there may not be clear guidance on what to fill in there since it depends on the number of things you have times any documented API SLA (if any). Oh, and every time you adjust the number of these, you could need to adjust those timeouts more. Seems a little unfair to expect practitioners (especially new ones) to continually figure out this class of problem, even if there is technically a configuration fix that can be documented. In this sense, the timeouts configuration now represents a potentially confusing burden. If there is more of something I can do with regards to this last point or the timeouts investigation, let me know! I don't mind writing up an RFC so we can solicit broader opinions either. |
…t default timeout Reference: hashicorp#675 The Terraform Plugin SDK has always supported customizable timeouts, declared at the `helper/schema.Resource.Timeouts` level with the operation aware [`helper/schema.ResourceTimeout` type](https://pkg.go.dev/github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema#ResourceTimeout). These were designed to allow practitioners to implement a timeout per resource, e.g. ```terraform resource "example_thing" "example" { timeouts { create = "60m" } } ``` Where the resource logic could fetch the timeout value via [`(helper/schema.ResourceData).Timeout()` function](https://pkg.go.dev/github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema#ResourceData.Timeout). Recently, context-aware resource operation functions were introduced and the existing non-context-aware functions were deprecated. These new context-aware resource operations forcibly introduce an operation timeout (defaults to 20 minutes) for the called logic, e.g. https://github.com/hashicorp/terraform-plugin-sdk/blob/e21a5bbb73209dcb31f4c9dff3a8c0cec1ca65ac/helper/schema/resource.go#L283-L285 Previously, the non-context-aware resources could be implemented with `helper/schema.Resource.Timeouts.Create` defined and operation logic such as: ```go func resourceExampleThingCreate(d *schema.ResourceData, meta interface{}) error { mutexkv.Lock("somekey") defer mutexkv.Unlock("somekey") // ... reference to d.Timeout(schema.TimeoutCreate) ... } ``` Where in this case, the create timeout could be referenced in the code via `d.Timeout(schema.TimeoutCreate)` after the synchronization coordination. The creation timeout only affected this single operation, even if multiple operations were going to occur. In the case of the context-aware functionality however, the timeout is configured before the mutex. So when multiple operations are waiting on the mutex, they are now sharing the same resource operation timeout. Since the timeout cannot be extended within the resource logic, developers are stuck either setting an arbitrarily high timeout (losing its usefulness) or practitioners are arbitrarily hitting the timeout as they scale out their configuration. To characterize how some remote systems can require that operations must be serialized or otherwise synchronized: * Only 1 resource type can be created at a time * Only 1 child resource type can be created at a time per parent resource * Only 10 concurrent updates of a resource type can be performed at a time * Remote system may support configurable synchronization limits (either by the customer or remote system operators) This essentially means the semantics of the synchronization can be based off resource configuration, operation type, and potentially accordant to the limits set by a remote system. ### Attempted Solutions ```go &schema.Resource{ Timeouts: &schema.ResourceTimeout{ Create: schema.DefaultTimeout(3 * time.Hour), }, } ``` This introduces separate CRUD methods that are invoked before the context is set with the timeout, so as to not "force" the resource timeout. If timeout behavior is still desired, developers must manually implement timeout handling in the function logic when using these. Most developers should prefer the existing context-aware functions and this is documented as such. However, another minor side benefit of these functions is that it also allows developers to upgrade CRUD function signatures for context awareness without logically changing resource behavior, then slowly switch to the timeout functions over time.
Submitted #723 with proposal 2 implementation as discussed. Please let me know if you have any changes you would like made. |
…t default timeout Reference: #675 The Terraform Plugin SDK has always supported customizable timeouts, declared at the `helper/schema.Resource.Timeouts` level with the operation aware [`helper/schema.ResourceTimeout` type](https://pkg.go.dev/github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema#ResourceTimeout). These were designed to allow practitioners to implement a timeout per resource, e.g. ```terraform resource "example_thing" "example" { timeouts { create = "60m" } } ``` Where the resource logic could fetch the timeout value via [`(helper/schema.ResourceData).Timeout()` function](https://pkg.go.dev/github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema#ResourceData.Timeout). Recently, context-aware resource operation functions were introduced and the existing non-context-aware functions were deprecated. These new context-aware resource operations forcibly introduce an operation timeout (defaults to 20 minutes) for the called logic, e.g. https://github.com/hashicorp/terraform-plugin-sdk/blob/e21a5bbb73209dcb31f4c9dff3a8c0cec1ca65ac/helper/schema/resource.go#L283-L285 Previously, the non-context-aware resources could be implemented with `helper/schema.Resource.Timeouts.Create` defined and operation logic such as: ```go func resourceExampleThingCreate(d *schema.ResourceData, meta interface{}) error { mutexkv.Lock("somekey") defer mutexkv.Unlock("somekey") // ... reference to d.Timeout(schema.TimeoutCreate) ... } ``` Where in this case, the create timeout could be referenced in the code via `d.Timeout(schema.TimeoutCreate)` after the synchronization coordination. The creation timeout only affected this single operation, even if multiple operations were going to occur. In the case of the context-aware functionality however, the timeout is configured before the mutex. So when multiple operations are waiting on the mutex, they are now sharing the same resource operation timeout. Since the timeout cannot be extended within the resource logic, developers are stuck either setting an arbitrarily high timeout (losing its usefulness) or practitioners are arbitrarily hitting the timeout as they scale out their configuration. To characterize how some remote systems can require that operations must be serialized or otherwise synchronized: * Only 1 resource type can be created at a time * Only 1 child resource type can be created at a time per parent resource * Only 10 concurrent updates of a resource type can be performed at a time * Remote system may support configurable synchronization limits (either by the customer or remote system operators) This essentially means the semantics of the synchronization can be based off resource configuration, operation type, and potentially accordant to the limits set by a remote system. ### Attempted Solutions ```go &schema.Resource{ Timeouts: &schema.ResourceTimeout{ Create: schema.DefaultTimeout(3 * time.Hour), }, } ``` This introduces separate CRUD methods that are invoked before the context is set with the timeout, so as to not "force" the resource timeout. If timeout behavior is still desired, developers must manually implement timeout handling in the function logic when using these. Most developers should prefer the existing context-aware functions and this is documented as such. However, another minor side benefit of these functions is that it also allows developers to upgrade CRUD function signatures for context awareness without logically changing resource behavior, then slowly switch to the timeout functions over time.
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
SDK version
Use-cases
The Terraform Plugin SDK has always supported customizable timeouts, declared at the
helper/schema.Resource.Timeouts
level with the operation awarehelper/schema.ResourceTimeout
type. These were designed to allow practitioners to implement a timeout per resource, e.g.Where the resource logic could fetch the timeout value via
(helper/schema.ResourceData).Timeout()
function.Recently, context-aware resource operation functions were introduced and the existing non-context-aware functions were deprecated. These new context-aware resource operations forcibly introduce an operation timeout (defaults to 20 minutes) for the called logic, e.g.
terraform-plugin-sdk/helper/schema/resource.go
Lines 283 to 285 in e21a5bb
Previously, the non-context-aware resources could be implemented with
helper/schema.Resource.Timeouts.Create
defined and operation logic such as:Where in this case, the create timeout could be referenced in the code via
d.Timeout(schema.TimeoutCreate)
after the synchronization coordination. The creation timeout only affected this single operation, even if multiple operations were going to occur.In the case of the context-aware functionality however, the timeout is configured before the mutex. So when multiple operations are waiting on the mutex, they are now sharing the same resource operation timeout. Since the timeout cannot be extended within the resource logic, developers are stuck either setting an arbitrarily high timeout (losing its usefulness) or practitioners are arbitrarily hitting the timeout as they scale out their configuration.
To characterize how some remote systems can require that operations must be serialized or otherwise synchronized:
This essentially means the semantics of the synchronization can be based off resource configuration, operation type, and potentially accordant to the limits set by a remote system.
Attempted Solutions
Proposals
Proposal 1: New Resource Methods for Locking
Introduce a resource operation hook before the context timeout is set for the purposes of handling the synchronization. To keep this generic, we can lean on the
sync.Locker
interface while allowing developers access to the resource data (e.g. attribute values) and meta (e.g. API client).In the Terraform Plugin SDK, this would be something like:
Where a simple provider implementation may look like:
Proposal 2: New Context-Aware Methods Without Timeout
Another option is to provide a separate receiver method that is called before the context is set with the timeout so as to not "force" the resource timeout (instead it would be up to developers to manually implement it).
Where a provider implementation may look like:
The text was updated successfully, but these errors were encountered: