-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate methods of handling failure / cancellation mid-apply #10065
Comments
My thoughts here: I don't think I like Option 1, given the interpolation problems and how it means misreporting success at creating the resource (even though it's often the case). It means we can't perform followup actions on the resource properly, such as deleting the default node pool in GKE. Option 2 seems safer, but needs deeper investigation. 3 is the one I've considered the least, and definitely needs to be unraveled. |
Customer Impact: low (see b/195658189) |
(removing question label- I think that stops this from showing up in team triage, and I'd like to bring this up there) |
I'm debating on creating a separate issue. Our pain with google provider is partial state in pipelines when we have a timeout waiting on API response. The result is rerunning the pipeline, which must destroy the previously deployed resources compounding the delay in new environment creation. I wonder if the google provider is implementing
|
We used to use partial mode and probably do in some places, but have since stopped using it. It was discovered by the SDK Lead at HashiCorp that it effectively didn't work, unfortunately! There was a period of time and set of assumptions where/when it did, but that's no longer generally true. That's reflected by partial not appearing in the successor to that guide, https://developer.hashicorp.com/terraform/tutorials/providers/provider-update, and the SDK function being informally deprecated.
That's a different issue than this covers, please do! |
Affected Resource(s)
Community Note
Description
We've been working with users running Terraform in a little unusual of an environment, where SIGTERMS are sent regularly to the provider. This is handled by killing the provider run immediately, which means resources mid-provisoning get dropped and cause subsequent 409s. I feel this is the expected "naive" behaviour- the provider was asked to die by Terraform Core and did, and users can manually reconcile their state using
terraform import
.This works less well in automated systems, though, and is exacerbated by systems that regularly time out or otherwise kill Terraform. In absence of features to automatically pick up these resources like hashicorp/terraform#19017 or import directives in config, there are a few viable solutions to investigate:
Option 1: Store the operation id and report success
We're trialed this for a few resources- GKE Cluster, GKE Node Pool, and IGM with GoogleCloudPlatform/magic-modules#2857 as an example. Note that it's highly undesirable to wait for the operation in read as
terraform plan
andterraform refresh
are expected to return quickly.The problem here is that because Terraform reports success we've got to block in read, persist the empty resource through the next
terraform apply
, or make a synthetic diff to be able to run (and block on) update (ex: make theoperation
fieldOptional
instead ofComputed
, and use the diff from some value ->""
to enter update).Also note that reporting success on a creation implies that all values are set to the value present in the user's config. Terraform gives us some leeway right now, but future SDK versions will require we do so. All unknown values will also be reported as known. We'll be reporting false results, effectively, which can cause interesting interpolation problems if the resource hasn't finished creation by next apply.
We've seen this with GKE- it can take as long as an hour after sending the initial request for the LRO to succeed, and any subsequent runs in that time will have broken interpolation (eg: using the helm or kubernetes providers)
Worth investigating: Could we supplement GoogleCloudPlatform/magic-modules#2857 with a synthetic diff and have creation of child resources work correctly? It's likely not the case, though.
Option 2: Store the operation id and report failure (tainting the resource)
This is largely the same as Option 1, but will mitigate the interpolation issues by tainting the resource first. We will have reported an error, and Terraform will recognise that unknown values are unknown, and should order operations differently.
terraform refresh
andterraform plan
can update the state, but if blocking is needed to finish creating the resource it can be done in the delete method where we can safely block.Option 3: Use GCP's built-in idempotency tokens
GCP has built-in idempotency support that we don't use, the
requestId
at https://cloud.google.com/compute/docs/reference/rest/v1/instances/insertI haven't used this personally, and some open questions are:
References
The text was updated successfully, but these errors were encountered: