Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema Resource Timeout Affects Synchronized Resources Globally With Context Aware Functions #675

Closed
bflad opened this issue Jan 7, 2021 · 4 comments · Fixed by #723
Closed
Assignees
Labels
enhancement New feature or request

Comments

@bflad
Copy link
Contributor

bflad commented Jan 7, 2021

SDK version

v2.4.0

Use-cases

The Terraform Plugin SDK has always supported customizable timeouts, declared at the helper/schema.Resource.Timeouts level with the operation aware helper/schema.ResourceTimeout type. These were designed to allow practitioners to implement a timeout per resource, e.g.

resource "example_thing" "example" {
  timeouts {
    create = "60m"
  }
}

Where the resource logic could fetch the timeout value via (helper/schema.ResourceData).Timeout() function.

Recently, context-aware resource operation functions were introduced and the existing non-context-aware functions were deprecated. These new context-aware resource operations forcibly introduce an operation timeout (defaults to 20 minutes) for the called logic, e.g.

ctx, cancel := context.WithTimeout(ctx, d.Timeout(TimeoutCreate))
defer cancel()
return r.CreateContext(ctx, d, meta)

Previously, the non-context-aware resources could be implemented with helper/schema.Resource.Timeouts.Create defined and operation logic such as:

func resourceExampleThingCreate(d *schema.ResourceData, meta interface{}) error {
  mutexkv.Lock("somekey")
  defer mutexkv.Unlock("somekey")
  // ... reference to d.Timeout(schema.TimeoutCreate) ...
}

Where in this case, the create timeout could be referenced in the code via d.Timeout(schema.TimeoutCreate) after the synchronization coordination. The creation timeout only affected this single operation, even if multiple operations were going to occur.

In the case of the context-aware functionality however, the timeout is configured before the mutex. So when multiple operations are waiting on the mutex, they are now sharing the same resource operation timeout. Since the timeout cannot be extended within the resource logic, developers are stuck either setting an arbitrarily high timeout (losing its usefulness) or practitioners are arbitrarily hitting the timeout as they scale out their configuration.

To characterize how some remote systems can require that operations must be serialized or otherwise synchronized:

  • Only 1 resource type can be created at a time
  • Only 1 child resource type can be created at a time per parent resource
  • Only 10 concurrent updates of a resource type can be performed at a time
  • Remote system may support configurable synchronization limits (either by the customer or remote system operators)

This essentially means the semantics of the synchronization can be based off resource configuration, operation type, and potentially accordant to the limits set by a remote system.

Attempted Solutions

&schema.Resource{
	Timeouts: &schema.ResourceTimeout{
		Create: schema.DefaultTimeout(3 * time.Hour),
	},
}

Proposals

Proposal 1: New Resource Methods for Locking

Introduce a resource operation hook before the context timeout is set for the purposes of handling the synchronization. To keep this generic, we can lean on the sync.Locker interface while allowing developers access to the resource data (e.g. attribute values) and meta (e.g. API client).

In the Terraform Plugin SDK, this would be something like:

type ResourceLockFunc func (context.Context, *ResourceData, interface{}) (sync.Locker, diag.Diagnostics)

type Resource struct{
	CreateLock ResourceLockFunc
	DeleteLock ResourceLockFunc
	ReadLock   ResourceLockFunc
	UpdateLock ResourceLockFunc
}

func (r *Resource) create(ctx context.Context, d *ResourceData, meta interface{}) diag.Diagnostics {
	if r.CreateLock != nil {
		locker, diags := r.CreateLock(ctx, d, meta)
		
		if diags.HasError() {
			return diags
		}

		if locker != nil {
			locker.Lock()
			defer locker.Unlock()
		}
	}

	// ... existing logic ...
}

Where a simple provider implementation may look like:

var resourceExampleThingMutex *sync.Mutex

&schema.Resource{
	CreateContext: resourceExampleThingCreate,
	CreateLock:    resourceExampleThingLock,
	Timeouts:      &schema.ResourceTimeout{
		Create: schema.DefaultTimeout(30 * time.Minute),
	},
}

func resourceExampleThingCreate(ctx context.Context, d *schema.ResourceData, meta interface{}) diag.Diagnostics {
	// ...
}

func resourceExampleThingLock(ctx context.Context, d *ResourceData, meta interface{}) (sync.Locker, diag.Diagnostics) {
	return resourceExampleThingMutex, nil
}

Proposal 2: New Context-Aware Methods Without Timeout

Another option is to provide a separate receiver method that is called before the context is set with the timeout so as to not "force" the resource timeout (instead it would be up to developers to manually implement it).

type Resource struct{
	CreateWithoutTimeout CreateContextFunc
	DeleteWithoutTimeout DeleteContextFunc
	ReadWithoutTimeout   ReadContextFunc
	UpdateWithoutTimeout UpdateContextFunc
}

func (r *Resource) create(ctx context.Context, d *ResourceData, meta interface{}) diag.Diagnostics {
	if r.Create != nil {
		if err := r.Create(d, meta); err != nil {
			return diag.FromErr(err)
		}
		return nil
	}

	if r.CreateWithoutTimeout != nil {
		return r.CreateWithoutTimeout(ctx, d, meta)
	}

	ctx, cancel := context.WithTimeout(ctx, d.Timeout(TimeoutCreate))
	defer cancel()
	return r.CreateContext(ctx, d, meta)
}

Where a provider implementation may look like:

&schema.Resource{
	CreateContext: resourceExampleThingCreate,
	Timeouts:      &schema.ResourceTimeout{
		Create: schema.DefaultTimeout(30 * time.Minute),
	},
}

func resourceExampleThingCreate(ctx context.Context, d *schema.ResourceData, meta interface{}) diag.Diagnostics {
	mutexkv.Lock("somekey")
	defer mutexkv.Unlock("somekey")
	// ... reference to d.Timeout(schema.TimeoutCreate) ...
}
@bflad bflad added the enhancement New feature or request label Jan 7, 2021
@bflad bflad changed the title Schema Resource Timeout Affects Synchronized Resources Globally Schema Resource Timeout Affects Synchronized Resources Globally With Context Aware Functions Jan 7, 2021
@paddycarver
Copy link
Contributor

You always give me complicated ones @bflad. :)

This one's rough just because I don't know what the right behavior here is.

Speaking naively here: not having timeouts at all seems like a poor user experience. I don't love that we enforced them unintentionally, but I'm not positive that the academically "correct" answer here is to allow opting out of them? Should providers be able to say "we're going to take how long we take, no you don't get to limit us at all"? That's effectively what allowing providers to ignore timeouts does. Worse, because timeouts are implicitly supported for each and every resource and there's no way to opt out of them, the user can set timeouts and have them be ignored.

It's not lost on me that you didn't actually propose just not applying the timeouts to the contexts, but whenever we introduce new behavior by accident/take away behavior by accident, my first instinct is to see if reverting makes sense, and I want to call out in this case that I don't know that it does, and we may want to forge a path we're happy with instead of rolling back.

As for what you did propose, it sounds like you want to do the locking behavior prior to the timeout timer starting. And have some really neat ideas for how we could provide optional hooks for providers making it sustainable, maintainable, and obvious for us. That's really cool!

Here's my thing though. What's... the point of these timeouts? What are we trying to protect against? What user intent do the capture, and what representations do they make to the user?

My suspicion is that the purpose of these timeouts is "I want to prevent terraform apply from getting caught on a bug and spinning indefinitely, so I'm going to set a timeout that should not be reasonably hit in normal operation so that if something goes wrong it doesn't spin forever, but if things are working as they're supposed to, it doesn't get triggered." That is, I think they're a purely defensive mechanism to protect against bugs in the provider or the API. I'm open to further ideas about what these are for and what people want them to do.

In that light, I'm of two minds about your proposal. On the one hand, we're clearly circumventing the defensive protections for the part of the provider code most likely to have bugs. A lock that is never released will just freeze things up and disregard timeouts completely. And that seems not ideal. Mutexes seem like the exact place I'd like timeouts to apply, honestly. On the other hand, predicting what a "reasonable" timeout in that situation means is incredibly difficult. It depends on the number of resources being created in parallel, and the specifics of how the API is bottlenecked. It's not a great user experience to push that onto them at the risk of errors being thrown in their face.

I'm not super clear on what to do here. I think more information is needed on what these timeouts are for and what they're intended to do so we can honor the spirit of the user's configuration.

@bflad
Copy link
Contributor Author

bflad commented Jan 29, 2021

I think more information is needed on what these timeouts are for and what they're intended to do so we can honor the spirit of the user's configuration.

How can I help and what type of artifacts are you looking for?

The original implementation for customizable resource timeouts can be found here: hashicorp/terraform#12311 (including a 20 minute default)

This adds to helper/schema the concept of Timeouts, allowing Provider developers to opt-in to offering customizable timeouts for a given Create, Read, Update, Delete action. By exposing timeouts per action, operators can customize the amount of time Terraform is allowed to operate a specific action before timing out.

A concrete example can be found in aws_db_instance.go, a resource that notoriously takes a very long time to provision. In the past there have been frequent requests to increase the amount of time allowed for provisioning, modification, and deletion. Because these actions vary on configuration (instance size, backup creation requirements), some require more time than others. While bumping the timeout is generally fine, it does have drawbacks, mainly being that any change requires a new release of Terraform, and there exists possibilities of unknown issues that would cause an operation to get stuck in a retry loop, thus taking a very long time to surface a currently uncaught error.

To me that reads that the original intention of this mechanism was for allowing practitioners to control the expected provisioning time of a single resource. A single resource may have multiple resource instances given a count or for_each configuration where the intention of those configuration language features is to overlay a consistent configuration across them, so in this case where each resource instance inherits its own timeout, intended for its own operations. The Go documentation for the (helper/schema.Resource).Timeouts functionality has stated the same intention since the beginning and still has the same documentation today.

Whether other unintended usage of those customizable timeouts has arisen since that initial design decision, I think would require broadly looking across the publicly available Terraform Provider code. However, I am having a hard time discerning what else a provider developer could have done with it where it would change practitioner expectations. The semantics of the original timeout implementation meant that provider developers had to choose when to start the timeout during a resource function and from the "SDK perspective", the timeout was not enforced at the resource function level, only configured and available for resource functions to reference and optionally implement.

As part of [RFC] TF-209: Context Aware Terraform SDK, there is an explicit mention about this new behavior in the CRUD(E) Operations > Caveats subsection:

The SDK has a hardcoded 20 minute default timeout, this means the contexts will timeout after this time if no timeout was set for the CRUD operation. We may want to allow the ability to opt out of this behavior.

However, there is no further details or resolution in that artifact. The pull request implementing the change also contains comments on this subject, which also mention potentially giving the ability to opt out.


I think we are agreeing on all these points, but my additional two cents here is that this is a case where the SDK change to implement the timeouts is better for provider developers and practitioners. Certainly do not want to dispute that and certainly don't want to give the impression that reverting this or circumventing that is being requested. All resources automatically get customizable timeouts support and are closer to supporting operation cancellation, which seems great. (Upgrade problems for providers, like AWS, that had hardcoded timeouts in resource logic greater than this newly enforced default is its Own Thing™.)

To highlight it more from a practitioner perspective, it may not be clear that if I do something like:

resource "takes_really_time_and_can_only_be_serialized" "one" {}

# this may be many files/modules away from the above and repeated much more
resource "takes_really_time_and_can_only_be_serialized" "two" {}

That my resource timeout errors are because the API had some requirement coded into the resource to prevent other errors. The first instinct might be to file a provider bug, where the provider maintainer will need to explain that really you want:

resource "takes_really_time_and_can_only_be_serialized" "one" {
  timeouts {
    create = "??"
  }
}

# this may be many files/modules away from the above and repeated much more
resource "takes_really_time_and_can_only_be_serialized" "two" {
  timeouts {
    create = "??"
  }
}

Where there may not be clear guidance on what to fill in there since it depends on the number of things you have times any documented API SLA (if any). Oh, and every time you adjust the number of these, you could need to adjust those timeouts more. Seems a little unfair to expect practitioners (especially new ones) to continually figure out this class of problem, even if there is technically a configuration fix that can be documented. In this sense, the timeouts configuration now represents a potentially confusing burden.


If there is more of something I can do with regards to this last point or the timeouts investigation, let me know! I don't mind writing up an RFC so we can solicit broader opinions either.

bflad added a commit to bflad/terraform-plugin-sdk that referenced this issue Mar 9, 2021
…t default timeout

Reference: hashicorp#675

The Terraform Plugin SDK has always supported customizable timeouts, declared at the `helper/schema.Resource.Timeouts` level with the operation aware [`helper/schema.ResourceTimeout` type](https://pkg.go.dev/github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema#ResourceTimeout). These were designed to allow practitioners to implement a timeout per resource, e.g.

```terraform
resource "example_thing" "example" {
  timeouts {
    create = "60m"
  }
}
```

Where the resource logic could fetch the timeout value via [`(helper/schema.ResourceData).Timeout()` function](https://pkg.go.dev/github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema#ResourceData.Timeout).

Recently, context-aware resource operation functions were introduced and the existing non-context-aware functions were deprecated. These new context-aware resource operations forcibly introduce an operation timeout (defaults to 20 minutes) for the called logic, e.g.

https://github.com/hashicorp/terraform-plugin-sdk/blob/e21a5bbb73209dcb31f4c9dff3a8c0cec1ca65ac/helper/schema/resource.go#L283-L285

Previously, the non-context-aware resources could be implemented with `helper/schema.Resource.Timeouts.Create` defined and operation logic such as:

```go
func resourceExampleThingCreate(d *schema.ResourceData, meta interface{}) error {
  mutexkv.Lock("somekey")
  defer mutexkv.Unlock("somekey")
  // ... reference to d.Timeout(schema.TimeoutCreate) ...
}
```

Where in this case, the create timeout could be referenced in the code via `d.Timeout(schema.TimeoutCreate)` after the synchronization coordination. The creation timeout only affected this single operation, even if multiple operations were going to occur.

In the case of the context-aware functionality however, the timeout is configured before the mutex. So when multiple operations are waiting on the mutex, they are now sharing the same resource operation timeout. Since the timeout cannot be extended within the resource logic, developers are stuck either setting an arbitrarily high timeout (losing its usefulness) or practitioners are arbitrarily hitting the timeout as they scale out their configuration.

To characterize how some remote systems can require that operations must be serialized or otherwise synchronized:

* Only 1 resource type can be created at a time
* Only 1 child resource type can be created at a time per parent resource
* Only 10 concurrent updates of a resource type can be performed at a time
* Remote system may support configurable synchronization limits (either by the customer or remote system operators)

This essentially means the semantics of the synchronization can be based off resource configuration, operation type, and potentially accordant to the limits set by a remote system.

### Attempted Solutions

```go
&schema.Resource{
	Timeouts: &schema.ResourceTimeout{
		Create: schema.DefaultTimeout(3 * time.Hour),
	},
}
```

This introduces separate CRUD methods that are invoked before the context is set with the timeout, so as to not "force" the resource timeout. If timeout behavior is still desired, developers must manually implement timeout handling in the function logic when using these.

Most developers should prefer the existing context-aware functions and this is documented as such. However, another minor side benefit of these functions is that it also allows developers to upgrade CRUD function signatures for context awareness without logically changing resource behavior, then slowly switch to the timeout functions over time.
@bflad
Copy link
Contributor Author

bflad commented Mar 9, 2021

Submitted #723 with proposal 2 implementation as discussed. Please let me know if you have any changes you would like made.

paddycarver pushed a commit that referenced this issue Mar 18, 2021
…t default timeout

Reference: #675

The Terraform Plugin SDK has always supported customizable timeouts, declared at the `helper/schema.Resource.Timeouts` level with the operation aware [`helper/schema.ResourceTimeout` type](https://pkg.go.dev/github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema#ResourceTimeout). These were designed to allow practitioners to implement a timeout per resource, e.g.

```terraform
resource "example_thing" "example" {
  timeouts {
    create = "60m"
  }
}
```

Where the resource logic could fetch the timeout value via [`(helper/schema.ResourceData).Timeout()` function](https://pkg.go.dev/github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema#ResourceData.Timeout).

Recently, context-aware resource operation functions were introduced and the existing non-context-aware functions were deprecated. These new context-aware resource operations forcibly introduce an operation timeout (defaults to 20 minutes) for the called logic, e.g.

https://github.com/hashicorp/terraform-plugin-sdk/blob/e21a5bbb73209dcb31f4c9dff3a8c0cec1ca65ac/helper/schema/resource.go#L283-L285

Previously, the non-context-aware resources could be implemented with `helper/schema.Resource.Timeouts.Create` defined and operation logic such as:

```go
func resourceExampleThingCreate(d *schema.ResourceData, meta interface{}) error {
  mutexkv.Lock("somekey")
  defer mutexkv.Unlock("somekey")
  // ... reference to d.Timeout(schema.TimeoutCreate) ...
}
```

Where in this case, the create timeout could be referenced in the code via `d.Timeout(schema.TimeoutCreate)` after the synchronization coordination. The creation timeout only affected this single operation, even if multiple operations were going to occur.

In the case of the context-aware functionality however, the timeout is configured before the mutex. So when multiple operations are waiting on the mutex, they are now sharing the same resource operation timeout. Since the timeout cannot be extended within the resource logic, developers are stuck either setting an arbitrarily high timeout (losing its usefulness) or practitioners are arbitrarily hitting the timeout as they scale out their configuration.

To characterize how some remote systems can require that operations must be serialized or otherwise synchronized:

* Only 1 resource type can be created at a time
* Only 1 child resource type can be created at a time per parent resource
* Only 10 concurrent updates of a resource type can be performed at a time
* Remote system may support configurable synchronization limits (either by the customer or remote system operators)

This essentially means the semantics of the synchronization can be based off resource configuration, operation type, and potentially accordant to the limits set by a remote system.

### Attempted Solutions

```go
&schema.Resource{
	Timeouts: &schema.ResourceTimeout{
		Create: schema.DefaultTimeout(3 * time.Hour),
	},
}
```

This introduces separate CRUD methods that are invoked before the context is set with the timeout, so as to not "force" the resource timeout. If timeout behavior is still desired, developers must manually implement timeout handling in the function logic when using these.

Most developers should prefer the existing context-aware functions and this is documented as such. However, another minor side benefit of these functions is that it also allows developers to upgrade CRUD function signatures for context awareness without logically changing resource behavior, then slowly switch to the timeout functions over time.
@ghost
Copy link

ghost commented Apr 18, 2021

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked as resolved and limited conversation to collaborators Apr 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
2 participants