Set resource id before polling operation and re-create failed deployments #59

alexmunda · 2021-02-04T18:52:49Z

Set the ID of resources that need to poll for an operation before polling the operation. This prevents the case where a resource is created in the DB but the async resource operation fails and Terraform has no state entry so it thinks the resource does not exist (even though it does and is in a failed state), so the user can't run terraform destroy to clean up the failed resource.

This PR also sets the Terraform ID of resources in a failed state to "" during read. This allows HCP resources (ie a Consul cluster) to be automatically replaced if they failed to provision in the first place. Without this logic, the user would need to look at the HCP UI to determine their cluster/HVN/peering is in a FAILED state, and delete/recreate manually using Terraform (or the UI and remove from state manually).

Failed provision:

❯ terraform apply --auto-approve
hcp_hvn.example_hvn: Refreshing state... [id=/project/51545062-b7a8-4066-8e32-ca283cb147fc/hashicorp.network.hvn/hcp-tf-example-hvn]
hcp_consul_cluster.example_consul_cluster: Creating...
hcp_consul_cluster.example_consul_cluster: Still creating... [10s elapsed]
hcp_consul_cluster.example_consul_cluster: Still creating... [20s elapsed]

Error: unable to create Consul cluster (hcp-tf-example-consul-cluster): create Consul cluster operation (7920f945-933d-4bcc-bdfa-b151e6e50e8b) failed [code=3, message=failed to deploy consul cluster: failed to generate Consul config file: failed to create consul config generator: invalid configuration options: 1 error occurred:
	* rpc error: code = InvalidArgument desc = datacenter cannot be "HCP-TF-EXAMPLE-CONSUL-CLUSTER". Please use only [a-z0-9-_].

]

On next terraform plan:

❯ tf plan
hcp_hvn.example_hvn: Refreshing state... [id=/project/51545062-b7a8-4066-8e32-ca283cb147fc/hashicorp.network.hvn/hcp-tf-example-hvn]
hcp_consul_cluster.example_consul_cluster: Refreshing state... [id=/project/51545062-b7a8-4066-8e32-ca283cb147fc/hashicorp.consul.cluster/hcp-tf-example-consul-cluster]

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # hcp_consul_cluster.example_consul_cluster must be replaced
-/+ resource "hcp_consul_cluster" "example_consul_cluster" {
      + cloud_provider                = (known after apply)
      + cluster_id                    = "hcp-tf-example-consul-cluster"
      + connect_enabled               = true
      + consul_automatic_upgrades     = (known after apply)
      + consul_ca_file                = (known after apply)
      + consul_config_file            = (known after apply)
      + consul_private_endpoint_url   = (known after apply)
      + consul_public_endpoint_url    = (known after apply)
      + consul_root_token_accessor_id = (known after apply)
      + consul_root_token_secret_id   = (sensitive value)
      + consul_snapshot_interval      = (known after apply)
      + consul_snapshot_retention     = (known after apply)
      + consul_version                = (known after apply)
      + datacenter                    = (known after apply)
      + hvn_id                        = "hcp-tf-example-hvn"
      + id                            = (known after apply)
      + organization_id               = (known after apply)
      + project_id                    = (known after apply)
      + public_endpoint               = false
      + region                        = (known after apply)
      + scale                         = (known after apply)
      + tier                          = "development"
    }

Plan: 1 to add, 0 to change, 1 to destroy.

I really wish we could return a warning diag instead of just logging and return nil but 🤷

roaks3

Definitely makes sense, but also wondering how things will behave on the next run (assuming there was a failure), since we wouldn't want Terraform to assume that the resource is ready to be used. Perhaps this is a common pattern though? Or maybe this is a place where a state field would help us?

alexmunda · 2021-02-04T19:17:18Z

@roaks3 Yes exactly! I was thinking about how to make the subsequent run obvious that there was a failure and we came to the same conclusion 😄. I am adding a state check on read right now to remove failed resources.

alexmunda · 2021-02-05T00:54:28Z

If hashicorp/terraform-plugin-sdk#657 (comment) gets resolved in the future, we should return a warning diag.

alexmunda · 2021-02-05T02:09:21Z

If we add a state field on these resources, it would show a diff when it shows the diff on the re-create. Thoughts?

xargs-P · 2021-02-05T20:39:39Z

If we add a state field on these resources, it would show a diff when it shows the diff on the re-create. Thoughts?

I like this idea. So the users are aware there is an issue...and why (on the next run) TF wants to replace it. Do we know how AWS ec2 instances are treated in a similar situation? 🤔

alexmunda · 2021-02-05T21:09:55Z

@xargs-P Looks like AWS removes the resource based on the terminated state https://github.com/hashicorp/terraform-provider-aws/blob/9c9de45b8e36fc5cffc7f59b3f882a8ec22bee1d/aws/resource_aws_instance.go#L769-L773

xargs-P · 2021-02-05T21:30:52Z

ah ok, so they expose the instance state to TF. 👌

alexmunda · 2021-02-05T21:33:37Z

ah ok, so they expose the instance state to TF. 👌

@xargs-P It doesn't appear to be on the resource schema, (which is currently what we have now) but they do check it on the response so this PR implements a similar behavior.

alexmunda · 2021-02-05T21:41:47Z

If we add a state field on these resources, it would show a diff when it shows the diff on the re-create. Thoughts?

Turns out this isn't true. Since state is Computed: true, it will always show (known after apply)

bcmdarroch

Code looks good!
I tried to test this out locally but unfortunately ran into some issues running my local provider. 😩 Hoping to investigate next week!

xargs-P · 2021-02-06T01:00:44Z

If we add a state field on these resources, it would show a diff when it shows the diff on the re-create. Thoughts?

Turns out this isn't true. Since state is Computed: true, it will always show (known after apply)

🤔 Would we need to mirror a solution closer to AWS's?

bcmdarroch · 2021-02-08T18:35:50Z

Already approved - but now I can say I've exercised this locally too 😎

point at internal Go SDK

Set resource id before polling operation

d7f52d4

alexmunda requested review from roaks3, bcmdarroch and aclaygray February 4, 2021 18:52

roaks3 approved these changes Feb 4, 2021

View reviewed changes

Recreate cluster/hvn/snapshot/peering if they are in a FAILED state

dd2e722

alexmunda changed the title ~~Set resource id before polling operation (in case of operation failure)~~ Set resource id before polling operation and re-create failed deployments Feb 5, 2021

bcmdarroch approved these changes Feb 6, 2021

View reviewed changes

roaks3 mentioned this pull request Feb 9, 2021

HCPE-830 - Add TGW attachment resource #58

Merged

alexmunda merged commit b04ab5e into main Feb 9, 2021

alexmunda deleted the set-id-before-wait branch February 9, 2021 17:18

jjti mentioned this pull request Jun 13, 2022

Store Consul cluster/snapshot state to fix failed cluster behavior #326

Merged

3 tasks

aidan-mundy pushed a commit that referenced this pull request Sep 8, 2023

Merge pull request #59 from hashicorp/replace-internal-go-sdk

d4f3298

point at internal Go SDK

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set resource id before polling operation and re-create failed deployments #59

Set resource id before polling operation and re-create failed deployments #59

alexmunda commented Feb 4, 2021 •

edited

Loading

roaks3 left a comment

alexmunda commented Feb 4, 2021

alexmunda commented Feb 5, 2021

alexmunda commented Feb 5, 2021

xargs-P commented Feb 5, 2021

alexmunda commented Feb 5, 2021

xargs-P commented Feb 5, 2021

alexmunda commented Feb 5, 2021 •

edited

Loading

alexmunda commented Feb 5, 2021

bcmdarroch left a comment

xargs-P commented Feb 6, 2021

bcmdarroch commented Feb 8, 2021

Set resource id before polling operation and re-create failed deployments #59

Set resource id before polling operation and re-create failed deployments #59

Conversation

alexmunda commented Feb 4, 2021 • edited Loading

roaks3 left a comment

Choose a reason for hiding this comment

alexmunda commented Feb 4, 2021

alexmunda commented Feb 5, 2021

alexmunda commented Feb 5, 2021

xargs-P commented Feb 5, 2021

alexmunda commented Feb 5, 2021

xargs-P commented Feb 5, 2021

alexmunda commented Feb 5, 2021 • edited Loading

alexmunda commented Feb 5, 2021

bcmdarroch left a comment

Choose a reason for hiding this comment

xargs-P commented Feb 6, 2021

bcmdarroch commented Feb 8, 2021

alexmunda commented Feb 4, 2021 •

edited

Loading

alexmunda commented Feb 5, 2021 •

edited

Loading