Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate methods of handling failure / cancellation mid-apply #10065

Open
rileykarson opened this issue Sep 13, 2021 · 5 comments
Open

Investigate methods of handling failure / cancellation mid-apply #10065

rileykarson opened this issue Sep 13, 2021 · 5 comments

Comments

@rileykarson
Copy link
Collaborator

rileykarson commented Sep 13, 2021

Affected Resource(s)

  • google_*

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment. If the issue is assigned to the "modular-magician" user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If the issue is assigned to a user, that user is claiming responsibility for the issue. If the issue is assigned to "hashibot", a community member has claimed the issue already.

Description

We've been working with users running Terraform in a little unusual of an environment, where SIGTERMS are sent regularly to the provider. This is handled by killing the provider run immediately, which means resources mid-provisoning get dropped and cause subsequent 409s. I feel this is the expected "naive" behaviour- the provider was asked to die by Terraform Core and did, and users can manually reconcile their state using terraform import.

This works less well in automated systems, though, and is exacerbated by systems that regularly time out or otherwise kill Terraform. In absence of features to automatically pick up these resources like hashicorp/terraform#19017 or import directives in config, there are a few viable solutions to investigate:

Option 1: Store the operation id and report success

We're trialed this for a few resources- GKE Cluster, GKE Node Pool, and IGM with GoogleCloudPlatform/magic-modules#2857 as an example. Note that it's highly undesirable to wait for the operation in read as terraform plan and terraform refresh are expected to return quickly.

The problem here is that because Terraform reports success we've got to block in read, persist the empty resource through the next terraform apply, or make a synthetic diff to be able to run (and block on) update (ex: make the operation field Optional instead of Computed, and use the diff from some value -> "" to enter update).

Also note that reporting success on a creation implies that all values are set to the value present in the user's config. Terraform gives us some leeway right now, but future SDK versions will require we do so. All unknown values will also be reported as known. We'll be reporting false results, effectively, which can cause interesting interpolation problems if the resource hasn't finished creation by next apply.

We've seen this with GKE- it can take as long as an hour after sending the initial request for the LRO to succeed, and any subsequent runs in that time will have broken interpolation (eg: using the helm or kubernetes providers)

Worth investigating: Could we supplement GoogleCloudPlatform/magic-modules#2857 with a synthetic diff and have creation of child resources work correctly? It's likely not the case, though.

Option 2: Store the operation id and report failure (tainting the resource)

This is largely the same as Option 1, but will mitigate the interpolation issues by tainting the resource first. We will have reported an error, and Terraform will recognise that unknown values are unknown, and should order operations differently. terraform refresh and terraform plan can update the state, but if blocking is needed to finish creating the resource it can be done in the delete method where we can safely block.

Option 3: Use GCP's built-in idempotency tokens

GCP has built-in idempotency support that we don't use, the requestId at https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert

I haven't used this personally, and some open questions are:

  • Where do we create the id? Do we use a persistent one stored in state? At what level- provider or resource?
  • What do the messages returned from the API look like? What happens if I use the same id:
    • Between different resource kinds
    • Between different resources of the same kind
    • Between the same resource
    • Between the sam resource but with different values

References

@rileykarson
Copy link
Collaborator Author

My thoughts here: I don't think I like Option 1, given the interpolation problems and how it means misreporting success at creating the resource (even though it's often the case). It means we can't perform followup actions on the resource properly, such as deleting the default node pool in GKE. Option 2 seems safer, but needs deeper investigation. 3 is the one I've considered the least, and definitely needs to be unraveled.

@btleedev
Copy link

Customer Impact: low (see b/195658189)
Follow up work: b/202743046

@rileykarson
Copy link
Collaborator Author

(removing question label- I think that stops this from showing up in team triage, and I'd like to bring this up there)

@codeangler
Copy link

I'm debating on creating a separate issue.

Our pain with google provider is partial state in pipelines when we have a timeout waiting on API response. The result is rerunning the pipeline, which must destroy the previously deployed resources compounding the delay in new environment creation.

I wonder if the google provider is implementing

Partial mode is a mode that can be enabled by a callback that tells Terraform that it is possible for partial state to occur. When this mode is enabled, the provider must explicitly tell Terraform what is safe to persist and what is not. docs

@rileykarson
Copy link
Collaborator Author

I wonder if the google provider is implementing partial mode

We used to use partial mode and probably do in some places, but have since stopped using it. It was discovered by the SDK Lead at HashiCorp that it effectively didn't work, unfortunately! There was a period of time and set of assumptions where/when it did, but that's no longer generally true. That's reflected by partial not appearing in the successor to that guide, https://developer.hashicorp.com/terraform/tutorials/providers/provider-update, and the SDK function being informally deprecated.

Our pain with google provider is partial state in pipelines when we have a timeout waiting on API response.

That's a different issue than this covers, please do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants