Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destroy Action #649

Closed
Nalum opened this issue Dec 11, 2014 · 46 comments
Closed

Destroy Action #649

Nalum opened this issue Dec 11, 2014 · 46 comments

Comments

@Nalum
Copy link

Nalum commented Dec 11, 2014

Is there a way to have a resource run a command before it is destroyed?

Example: I have an aws_instance that when provisioned was added to a chef servers node and client list, when this resource is destroyed I want to remove it from the node and client list.

@armon
Copy link
Member

armon commented Dec 11, 2014

Not currently, but we want to extend the provisioners to support running them at different points in the lifecycle. e.g. post-create, pre-destroy, etc.

@Nalum
Copy link
Author

Nalum commented Dec 11, 2014

To contribute code do I just fork the repo, write code, make sure it doesn't break anything and does what I expect and then create a pull request or is there some other things to be done?

@sethvargo
Copy link
Contributor

@Nalum that and add some tests 😄. You can read more about it in the Contributing.md file at the root of the repo. Feel free to ping us if you have any questions!

@armon
Copy link
Member

armon commented Dec 11, 2014

Something like this is a pretty major change, both code wise and in terms of UX and core interaction. So I would definitely start with a design doc to open up the discussion. I'd hate for you to do a bunch of work and have us disagree on it.

/cc: @mitchellh

@knuckolls
Copy link
Contributor

Just a +1 to all of this. Was thinking yesterday that perhaps the cleanest way to implement the logical_mutate resource that was referenced in #580 would just be to run a provisioner later in the dependency graph based upon post-create, post-update, or post-destroy events.

Along these lines, I'd also love to be able to run a provisioner based on the fact that multiple resources fired created, updated, or deleted events. I haven't dug down into the internals far enough to suggest how this might work but it's certainly something I'm hoping Terraform will support.

@Nalum
Copy link
Author

Nalum commented Dec 11, 2014

Is there an example or previous design doc I can look at so I can follow your existing process.

@Nalum
Copy link
Author

Nalum commented Dec 15, 2014

I've not really done any Design Docs before, I'm sure I'm missing quite a bit so let me know but here it is: https://docs.google.com/document/d/1fg5cxNDqWMDFioDXIpNbcSYiaOD7EGrnfhspfWvdkVg/edit

@armon
Copy link
Member

armon commented Dec 16, 2014

@Nalum I think we need to address a few major concerns with the lifecycle. One is partial failures. With the current post-create provisioner, if it fails, we mark the resource as tainted, and re-create it on the next run. With pre-create I assume we could simply abort the apply of that instance. I think the delete ones are more complex. What does it mean if a pre-destroy fails? Do we taint the resource? Do we destroy anyways? I think post-destroy is the most confusing, as the resource is already gone. It's not clear at first thought what is best to do.

The other is remote provisioning. Those clearly only make sense for post-create and pre-destroy. I guess it is probably sensible to have a validation pass that is dependent on lifecycle so that declaring a "remote-exec" in the other life cycles is aborted.

@mitchellh
Copy link
Contributor

Isn't there another issue that asks for this exact same thing? /cc @armon

@Nalum
Copy link
Author

Nalum commented Dec 16, 2014

@mitchellh If this is better discussed else where I'll go there if I can.

@armon I had thought of implementing it in the following way:

provisioner "local-exec" {
    command = "..."
    pre-create = true
}

Instead of the lifecycle attribute you have each pre or post option being an attribute, this would allow you to have the pre-create on local-exec and not on remote-exec.

Would you only allow for one value, so if you have pre-create you cannot have post-create, pre-destroy or post-destroy?

Maybe post-destroy isn't a valid option as the resource is gone at this point. The only reason I put it as an option is if you had some generated data locally that you wanted to remove with the resource, but I don't know how common or rare that would be.

For me if pre-destroy was to fail I would want the resource to be tainted, but doing so with a local-exec where you have command which accepts a string value could be problematic as you can run multiple commands in that string. So, if this was to be implemented, would it be best to have command accept a list so that you can track which command failed and the next run can continue from that point?

Also should I open that document up so anyone can edit it or are you happy with it being in suggestion mode? I can add anyone to it who should be able to edit it.

@mitchellh
Copy link
Contributor

There is definitely an issue tracking this somewhere, but this looks about along the lines of what I was thinking. I think our point of discussion revolved around "what does tainting mean" in this scenario.

@armon
Copy link
Member

armon commented Dec 17, 2014

@Nalum I would just open the doc up for edits as well.

I think you bring up an interesting point, of can you run a provisioner at multiple life cycle points? I can't think of a case where you would want to, but maybe there is a case...

I think the partial failure behavior at least for pre-create (abort), post-create (taint), and pre-destroy (taint) make sense. Maybe by avoiding post-destroy we can avoid that complication.

@Nalum
Copy link
Author

Nalum commented Dec 17, 2014

I've updated the doc and opened it up for anyone to edit.

@Nalum
Copy link
Author

Nalum commented Dec 23, 2014

@armon @mitchellh Are you happy for me to go at this as defined in the doc?

It will probably take me some time as I'm only getting into Go now.

@knuckolls
Copy link
Contributor

@Nalum fwiw i just read through your google doc and I like the format defined there.

@bitglue
Copy link

bitglue commented Jan 29, 2015

I would think a much cleaner way to address this problem would be creating a terraform resource that represents the entry in the Chef client list. Then the chef client entry gets created and destroyed through the normal Terraform lifecycle.

Provisioners are a hack at best, and this highlights the reason why. If you think of your infrastructure as a state machine, then Terraform is based on modelling the desired state and Terraform then selects the right CRUD operations from the providers to make the edges to get to that state. Provisioners don't model desired state, they model edges. Consequently Terraform doesn't know anything about how that provisioner altered state, and you have a bad time.

If you want a tool that models edges and not state, that tool already exists. It's called Ansible. Or Chef. Or Puppet. There's no reason you can't do both. I don't see any reason to dilute Terraform's model to support a use case that's already well-addressed by other tools.

@knuckolls
Copy link
Contributor

I agree with you about the possibility of using custom terraform resources for things like the "clean up the chef client on instance destruction". Although, I do respectfully disagree with you about the need for a before-destroy provisioner specifically. I have a blog post largely written up about this idea as a part of a grander scheme of other things we've been working on but here's the part that's pertinent to this issue specifically.

Assume you have a cluster of stateful machines. An elasticsearch cluster for example. Anything that is clustered by default with a >1 replication factor fits this model quite nicely actually. Now, assume that we want to cut a new server image with packer and roll the cluster so that all the machines are on the new image. There are a variety of good reasons to do this. Right now Terraform does not have a safe way to drain old AMIs one by one. It can either delete the whole cluster and re-create it, or stand up the new nodes and then delete the old nodes. I believe we require stronger guarantees about how our stateful instances get destroyed. For example, we should never delete all the nodes that contain a given block worth of data until we have confirmed that the data has been replicated to the new set of instances. Create before destroy doesn't give us that assurance. It merely waits for the new instance to be up before terminating the old instance.

Terraform is very close to providing the correct primitives for such graceful draining of stateful instances. It requires two things: the parallelism semaphore deep within terraform/context.go should be exposed up to the command line UI as a "--parallelism" flag so you can enforce that a specific apply is done one node at a time. Right now parallelism is just hardcoded to a default of 10 items at a time if it isn't overridden and the current implementation of what sets up the ContextOpts doesn't override it as far as I tell. The second thing is that stateful instances controlled by terraform require the need to gracefully decomission themselves and then spin until their blocks have been replicated to the other nodes before starting the next destroy. This decomission and blocking dance can be done within before-destroy provisioners on the instance resource. Either me or someone on our team will be PRing some of this work as we get to it and I'd like to hopefully come to some sort of agreement on how it could be done that the community and Hashicorp is comfortable with.

Furthermore, tainting a stateful node may not be strong enough of an assurance when it comes to the automated destruction of stateful instances. I believe a new "decomission-on-delete" flag may need to be created so that we can provide the worst case assurance that you can flag specific instances as ones that we want humans to follow up and destroy, just in case. Decomissioning a resource would merely take it out of the statefile, or move it into a decomissioned section and whine about how the user needs to clean these up on every terraform plan. This is because we don't want Terraform to accidentally cause a "we need to recover from cold backups" scenario because too many nodes were deleted too quickly. Decomission-on-delete would put the onus on the operator to be quite certain that the state was replicated off that node before initializing it's destruction.

I'll eventually publish my thoughts much more extensively about this and related things in a more general sort of way. It did seem prudent to defend my stance that adding an on-destroy provisioner is not a dilution of Terraform's model and is instead an integral piece to the graceful rolling of stateful instances. As always, I'd be willing to make time to discuss these sorts of things on a hangout or move this discussion to it's own dedicated issue if needed.

Upon re-reading this I realize this diatribe likely falls under the umbrella of "but but someone on the internet is wrong". My apologies if it comes across as gruff in any way. This sort of thing has been a primary concern of mine for awhile now. While this may not be the final solution for graceful rolling of stateful nodes with Terraform, I believe it's one possible way that we just haven't gotten to taking care of yet. :)

@bitglue
Copy link

bitglue commented Jan 29, 2015

The second thing is that stateful instances controlled by terraform require the need to gracefully decomission themselves and then spin until their blocks have been replicated to the other nodes before starting the next destroy. This decomission and blocking dance can be done within before-destroy provisioners on the instance resource.

But it can also be done by making a resource that represents whatever it is you did when you provisioned the thing. This "provisioning" resource, which might represent some bit of configuration in a cluster (like a Chef client list, or whatever) gets created after the EC2 instance is created, and on destroy, the provisioning resource is destroyed before the instance is destroyed.

Furthermore, tainting a stateful node may not be strong enough of an assurance when it comes to the automated destruction of stateful instances.

You are absolutely right. "Tainting" is just a half-baked, ambiguously defined state. Since it's also an unknown state, it's impossible to do anything automatically from this state without first resetting to a known state (for example by destroying the resource) or without making restrictions on how provisioners can work (for example requiring that they are idempotent).

I believe a new "decomission-on-delete" flag may need to be created

But you already have that, it's the Delete function of a resource provider. If your provisioning is resources, then you don't need the half-baked "tainted" state: you have the real deal, actual state, defined exactly as you need it by your provider implementation.

What you are describing with a "deprovisioning" mechanism is CRUD without the R or U, and an incomplete description of the known state (tainting). It's just a half-baked implementation of what Terraform already does, but which can't recover from failures of any of the operations unless you make all operations idempotent, and if you do that, what you've done is implemented Ansible.

@dalehamel
Copy link

Not currently, but we want to extend the provisioners to support running them at different points in the lifecycle. e.g. post-create, pre-destroy, etc.

I agree this is a decent approach.

This is definitely related to #386

One thing that @thegedge and I discussed is that provisioners almost seem like they should just be their own resource, rather than a nested attribute of some types of resources.

@thegedge
Copy link
Contributor

thegedge commented Jul 7, 2015

To explain why I see them as a resource: they add data to the chef server (environment, run list, etc) that follows CRUD. To me this doesn't seem terribly different from, for example, an aws_ecs_task_definition or consul_keys resource.

It really is a key component of our infrastructure at the moment, which fits with the definition of a resource. The actual bootstrapping run could be part of the provisioning block, but the client/node setup in the chef server is. Would it make sense to have a chef-server provider?

@samdunne
Copy link
Contributor

Has there been any further movement on this? It's becoming a serious issue when coupled with #2720

@davidblewett
Copy link

+1 on this feature (mainly for ensuring chef stays current). In the mean time, does someone have some kind of script to clean out expired data from chef?

@icebourg
Copy link
Contributor

icebourg commented Nov 9, 2015

I've taken to using a local provisioner to delete the Chef node and client. This is a horrible hack but it at least lets me create new instances without having to manually delete the old ones. Unfortunately, it only runs when an instance is provisioned, so if you de-provision an instance, you'll still have stale instances in Chef's database. But at least I can recreate instances without worrying what's in Chef.

  # Remove the Chef node since a terraform destroy won't. Return true no matter what.
  provisioner "local-exec" {
    command = "knife node delete -y ${var.instance_name}; knife client delete -y ${var.instance_name}; true"
  }

@calvn
Copy link

calvn commented Dec 9, 2015

👍 for this feature/enhancement

@Tails
Copy link

Tails commented Dec 28, 2015

+1.

This would come in handy so Terraform could have Digitalocean make a snapshot of a server image instead of deleting it and building from scratch.

@eidge
Copy link

eidge commented Jan 14, 2016

+1.

Our use case is unregistering app services from consul so that they are taken out of the load balancer before killing the server.

@davehodgson
Copy link

+1 My use case is that some of the servers are joined to Active Directory domain, it would be very handy to have them delete themselves from the domain pre-destroy. The lack of it means that if I change the server in a way that results in a destroy and recreate the bootstrap to join the domain fails because the account name already exists

@dhoer
Copy link

dhoer commented Feb 23, 2016

+1 See #3605

@avdhoot
Copy link

avdhoot commented Mar 2, 2016

+1 desperately need this 😀

@jseaidou
Copy link

has there been any activity on this ?

@tjlee0909
Copy link

Is this issue still blocked on a philosophical debate regarding whether this feature is appropriate? Or is it a matter of developing a refined product spec for how the feature should behave, and having the time to implement it? (I'm not taking sides or passing judgment, I'm just honestly asking).

@dmourati
Copy link

@DigitalRoadi3s
Copy link

+1

@tjlee0909
Copy link

tjlee0909 commented May 24, 2016

Hey guys, we (Box) is about to start design work on implementing a wrapper around "terraform destroy" for internal use. Are there any updates on this feature request?

edit: whoops, just saw apparentlymart's proposal

@apparentlymart
Copy link
Contributor

Don't let my complex proposal discourage a simple solution here... I was trying to collect together a number of use-cases, but my main target was the cases where we use null_resource to trigger actions on updates, rather than destroy actions.

There are a few different ways to attack destroy actions in particular so it would be nice to have a few different approaches to think about.

There are some great ideas in this thread that take a different approach to what I suggested. Trying out some of these ideas with real examples is probably the best way to converge on a solution.

@eyalzek
Copy link

eyalzek commented Jun 9, 2016

+1

@PaulusTM
Copy link

I have mainly 2 use cases where this would be really helpful.

  1. Windows Server 2012r2 instances that are a domain member. Unjoining them before destroy with a powershell command is quite easy, but I need a trigger somehow.
  2. As provisioner/chef: unregister the client from chef server when replacing an instance #3605 mentions a Chef client needs to unregister itself

@dmourati
Copy link

How about /sbin/halt.local or equivalent?

https://www.redhat.com/archives/rhl-list/2007-August/msg00314.html

@ashald
Copy link
Contributor

ashald commented Oct 15, 2016

We have similar requirements where we need to manually de-register Consul server when we destroy the ASG with them.

@saurabh-hirani
Copy link

+1 - this is a must have for different stages - even if it rudimentary script execution hooks at various points in the infra lifecycle.

@saurabh-hirani
Copy link

saurabh-hirani commented Oct 25, 2016

I am relatively new to terraform but building on @icebourg 's example and #2831 one could do the following

terraform apply -var="deregister=1" -target=null_resource.deregister_my_resource

with

variable "deregister" {
  description = "Deregister from chef - 1 or 0"
  default = 0
}

and

resource "null_resource" "deregister_my_resource" {
  count = "${var.deregister}"
  provisioner "local-exec" {
    command = "./bin/deregister-chef.sh my_resource; true"
  }
}

with ./bin/deregister-chef.sh containing

#!/bin/bash
knife node delete -y $1
knife client delete -y $1

That way this can be called post a destroy.

@5tefan
Copy link

5tefan commented Nov 3, 2016

Another use case: when removing a Kuberenetes worker, it would be nice to move pods off the node and remove it before the resource is destroyed:

kubectl drain <nodename>
kubectl delete node <nodename>

Thanks @DocBradfordSoftware hermanjunge/kubernetes-digitalocean-terraform#21

@amclain
Copy link

amclain commented Nov 3, 2016

Another use case: when removing a Kuberenetes worker

I had the same issue except with Nomad. When Terraform would recreate the instance(s), the containers wouldn't migrate to the live instances until nomad node-drain was run, which I ended up having to do manually.

@ajbouh
Copy link

ajbouh commented Nov 3, 2016

I also need it for nomad and kubernetes

On Nov 3, 2016 2:45 PM, "Alex McLain" [email protected] wrote:

Another use case: when removing a Kuberenetes worker

I had the same issue except with Nomad
https://gitter.im/hashicorp-nomad/Lobby?at=58177616482c168b22d6c440.
When Terraform would recreate the instance(s), the containers wouldn't
migrate to the live nodes until nomad node-drain was run, which I ended
up having to do manually.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#649 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAcnfSq8IrbfcfxcJUaknq9mQoEEDyyks5q6lXvgaJpZM4DHHZg
.

@mitchellh
Copy link
Contributor

I know there is a lot of discussion here but this is identical to #386 and there is an equally large amount of discussion there so I'm going to close this and centralize there.

@ghost
Copy link

ghost commented Apr 20, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 20, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests