Investigate methods of handling failure / cancellation mid-apply #10065

rileykarson · 2021-09-13T19:46:57Z

Affected Resource(s)

google_*

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment. If the issue is assigned to the "modular-magician" user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If the issue is assigned to a user, that user is claiming responsibility for the issue. If the issue is assigned to "hashibot", a community member has claimed the issue already.

Description

We've been working with users running Terraform in a little unusual of an environment, where SIGTERMS are sent regularly to the provider. This is handled by killing the provider run immediately, which means resources mid-provisoning get dropped and cause subsequent 409s. I feel this is the expected "naive" behaviour- the provider was asked to die by Terraform Core and did, and users can manually reconcile their state using terraform import.

This works less well in automated systems, though, and is exacerbated by systems that regularly time out or otherwise kill Terraform. In absence of features to automatically pick up these resources like hashicorp/terraform#19017 or import directives in config, there are a few viable solutions to investigate:

Option 1: Store the operation id and report success

We're trialed this for a few resources- GKE Cluster, GKE Node Pool, and IGM with GoogleCloudPlatform/magic-modules#2857 as an example. Note that it's highly undesirable to wait for the operation in read as terraform plan and terraform refresh are expected to return quickly.

The problem here is that because Terraform reports success we've got to block in read, persist the empty resource through the next terraform apply, or make a synthetic diff to be able to run (and block on) update (ex: make the operation field Optional instead of Computed, and use the diff from some value -> "" to enter update).

Also note that reporting success on a creation implies that all values are set to the value present in the user's config. Terraform gives us some leeway right now, but future SDK versions will require we do so. All unknown values will also be reported as known. We'll be reporting false results, effectively, which can cause interesting interpolation problems if the resource hasn't finished creation by next apply.

We've seen this with GKE- it can take as long as an hour after sending the initial request for the LRO to succeed, and any subsequent runs in that time will have broken interpolation (eg: using the helm or kubernetes providers)

Worth investigating: Could we supplement GoogleCloudPlatform/magic-modules#2857 with a synthetic diff and have creation of child resources work correctly? It's likely not the case, though.

Option 2: Store the operation id and report failure (tainting the resource)

This is largely the same as Option 1, but will mitigate the interpolation issues by tainting the resource first. We will have reported an error, and Terraform will recognise that unknown values are unknown, and should order operations differently. terraform refresh and terraform plan can update the state, but if blocking is needed to finish creating the resource it can be done in the delete method where we can safely block.

Option 3: Use GCP's built-in idempotency tokens

GCP has built-in idempotency support that we don't use, the requestId at https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert

I haven't used this personally, and some open questions are:

Where do we create the id? Do we use a persistent one stored in state? At what level- provider or resource?
What do the messages returned from the API look like? What happens if I use the same id:
- Between different resource kinds
- Between different resources of the same kind
- Between the same resource
- Between the sam resource but with different values

References

409s when SIGTERM/cancel/CTRL+C terraform while creating a resource which involves an Operation #9874
discuss.hashicorp.com/t/best-practices-for-implementing-sigterm-ctrl-c-cancel-in-google-provider-for-partial-admission-operations/28415

The text was updated successfully, but these errors were encountered:

rileykarson · 2021-09-13T19:48:57Z

My thoughts here: I don't think I like Option 1, given the interpolation problems and how it means misreporting success at creating the resource (even though it's often the case). It means we can't perform followup actions on the resource properly, such as deleting the default node pool in GKE. Option 2 seems safer, but needs deeper investigation. 3 is the one I've considered the least, and definitely needs to be unraveled.

btleedev · 2021-10-11T17:58:00Z

Customer Impact: low (see b/195658189)
Follow up work: b/202743046

rileykarson · 2021-10-11T17:59:36Z

(removing question label- I think that stops this from showing up in team triage, and I'd like to bring this up there)

codeangler · 2022-11-11T16:09:04Z

I'm debating on creating a separate issue.

Our pain with google provider is partial state in pipelines when we have a timeout waiting on API response. The result is rerunning the pipeline, which must destroy the previously deployed resources compounding the delay in new environment creation.

I wonder if the google provider is implementing

Partial mode is a mode that can be enabled by a callback that tells Terraform that it is possible for partial state to occur. When this mode is enabled, the provider must explicitly tell Terraform what is safe to persist and what is not. docs

rileykarson · 2022-11-11T20:45:03Z

I wonder if the google provider is implementing partial mode

We used to use partial mode and probably do in some places, but have since stopped using it. It was discovered by the SDK Lead at HashiCorp that it effectively didn't work, unfortunately! There was a period of time and set of assumptions where/when it did, but that's no longer generally true. That's reflected by partial not appearing in the successor to that guide, https://developer.hashicorp.com/terraform/tutorials/providers/provider-update, and the SDK function being informally deprecated.

Our pain with google provider is partial state in pipelines when we have a timeout waiting on API response.

That's a different issue than this covers, please do!

rileykarson added enhancement question labels Sep 13, 2021

rileykarson removed the question label Oct 11, 2021

rileykarson added the size/xl label Oct 18, 2021

rileykarson added this to the Goals milestone Oct 18, 2021

rileykarson mentioned this issue Sep 7, 2022

Add documentation on handling errors in Resource Create funcs #12496

Open

codeangler mentioned this issue Nov 14, 2022

missing Partial State; API timeout leave partially build environment, subsequent pipelines require destory before new apply #13035

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate methods of handling failure / cancellation mid-apply #10065

Investigate methods of handling failure / cancellation mid-apply #10065

rileykarson commented Sep 13, 2021 •

edited by edwardmedia

Loading

rileykarson commented Sep 13, 2021

btleedev commented Oct 11, 2021

rileykarson commented Oct 11, 2021

codeangler commented Nov 11, 2022

rileykarson commented Nov 11, 2022

Investigate methods of handling failure / cancellation mid-apply #10065

Investigate methods of handling failure / cancellation mid-apply #10065

Comments

rileykarson commented Sep 13, 2021 • edited by edwardmedia Loading

Affected Resource(s)

Community Note

Description

Option 1: Store the operation id and report success

Option 2: Store the operation id and report failure (tainting the resource)

Option 3: Use GCP's built-in idempotency tokens

References

rileykarson commented Sep 13, 2021

btleedev commented Oct 11, 2021

rileykarson commented Oct 11, 2021

codeangler commented Nov 11, 2022

rileykarson commented Nov 11, 2022

rileykarson commented Sep 13, 2021 •

edited by edwardmedia

Loading