Destroy: based on `openshift/hive/contrib: aws_tag_deprovision` #324

abhinavdahiya · 2018-09-25T17:46:55Z

wking · 2018-09-25T18:07:00Z

cmd/openshift-install/BUILD.bazel

@@ -19,6 +20,6 @@ go_binary(
    embed = [":go_default_library"],
    # Use pure to build a pure-go binary.
    # This has the nice side effect of making the binary statically linked.
-    pure = "on",
+    pure = "off",


Do you know why we need this? It may break @steveej (#243, #300).

github.com/libvirt/libvirt-go requires cgo; the libvirt destroy is turned off because bazel cannot build it yet.

I can drop it for now.

I can drop it for now.

I'd favor that :-)

dgoodwin · 2018-09-25T17:51:49Z

pkg/destroy/destroyer.go

+	"os"
+	"path/filepath"
+
+	atd "github.com/openshift/hive/contrib/pkg/aws_tag_deprovision"


Apologies, openshift/hive#13 is about to move this to a package name without the underscores. I will try to remember to send you a PR as soon as we get it merged.

dgoodwin · 2018-09-25T17:59:18Z

cmd/openshift-install/main.go

 		if err != nil {
-			log.Fatalf("failed to generate asset: %v", err)
+			log.Fatalf("failed to create destroyer: %v", err)


Minor note but Fatalf will os.Exit(1) for you I think.

dgoodwin · 2018-09-25T18:12:31Z

pkg/destroy/destroyer.go

+func NewAWSDestroyer(level log.Level, metadata *types.ClusterMetadata) *atd.ClusterUninstaller {
+	return &atd.ClusterUninstaller{
+		Filters: metadata.ClusterPlatformMetadata.AWS.Identifier,
+		Region:  metadata.ClusterPlatformMetadata.AWS.Region,


What do you think about having optional CLI flags to bypass loading the metadata file, if not necessary? This would help us in Hive, but it's not urgent.

@dgoodwin there are plans to drop clusterID will be dropped from InstallConfig;
currently the clusterID is used inside cluster as its identity => that will be moved to ClusterVersionOperator,
And the clusterID is also used to tag the AWS resources => this will be local identifier generated by installer and will be different from one used by CVO.
cc @crawford

So metadata.json will become the only source where you'll be able to extract the identifier used but installer to tag aws resources.

We were still hoping to use openshift-install to both provision and deprovision clusters in Hive (we should talk about moving the uninstall code to your repo at some point), but we're running in pods, there is no persistent disk state and thus no metadata.json. If there were a CLI flag or env var to just specify the cluster ID, we could use openshift-install without requiring any on disk state, and then we're all using the same uninstall path.

It's also a nice override for a user who lost their install dir and metadata, but can see the cluster UUID tag on their AWS objects.

NOTE: we do plan to extract the cluster UUID after provisioning and store in the status on the API object, so we would have this avail when deprovisioning. Technically we could use that to generate a metadata.json and inject into the pod it's just more of a pain with Kube than just being able to specify command params or env vars.

Discussed with @crawford on slack, he suggests just pushing the entire metadata file up as a configmap and mounting back in for deprovision, and initially this sounds like it should be ok. We'll go with this for now.

dgoodwin · 2018-09-25T18:13:08Z

pkg/destroy/destroyer.go

+// NewAWSDestroyer returns aws Uninstaller from ClusterMetadata.
+func NewAWSDestroyer(level log.Level, metadata *types.ClusterMetadata) *atd.ClusterUninstaller {
+	return &atd.ClusterUninstaller{
+		Filters: metadata.ClusterPlatformMetadata.AWS.Identifier,


Out of curiosity what exactly ends up in this metadata file for the identifier tags?

5f94630#diff-b89ef46bc3637998a26a9a32e4915db9R52

wking · 2018-09-25T19:15:14Z

/lint

Looks like pkg/destroy is missing a // Package destroy ... comment; I'm not sure why golint (or go-vet?) isn't complaining...

openshift-ci-robot

@wking: 5 warnings.

In response to this:

/lint

Looks like pkg/destroy is missing a // Package destroy ... comment; I'm not sure why golint (or go-vet?) isn't complaining...

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2018-09-25T19:15:24Z

pkg/types/clustermetadata.go

@@ -0,0 +1,26 @@
+package types


Golint comments: should have a package comment, unless it's in another file for this package. More info.

openshift-ci-robot · 2018-09-25T19:15:24Z

pkg/asset/metadata/metadata.go

@@ -0,0 +1,80 @@
+package metadata


Golint comments: should have a package comment, unless it's in another file for this package. More info.

openshift-ci-robot · 2018-09-25T19:15:24Z

pkg/asset/metadata/stock.go

@@ -0,0 +1,31 @@
+package metadata


Golint comments: should have a package comment, unless it's in another file for this package. More info.

openshift-ci-robot · 2018-09-25T19:15:24Z

pkg/destroy/destroyer.go

@@ -0,0 +1,78 @@
+package destroy


Golint comments: should have a package comment, unless it's in another file for this package. More info.

openshift-ci-robot · 2018-09-25T19:15:24Z

pkg/destroy/libvirt/libvirt_prefix_deprovision.go

@@ -0,0 +1,241 @@
+package libvirt


Golint comments: should have a package comment, unless it's in another file for this package. More info.

abhinavdahiya · 2018-09-25T19:17:10Z

@wking is package level comment required. pkg/destroy name seems enough?

wking · 2018-09-25T19:18:35Z

pkg/destroy/destroyer.go

+}
+
+// NewDestroyer returns Destroyer based on `metadata.json` in `rootDir`.
+func NewDestroyer(level log.Level, rootDir string) (Destroyer, error) {


I'd rather this took a *ClusterMetadata or some such with the file loading and deserialization happing externally (or in a NewDestroyerFromFile helper, or some such). That would allow folks to create destroyers without hitting the disk (e.g. if they pulled metadata out of a cluster asset).

https://github.com/openshift/installer/pull/324/files/932081a8a8ef7920f6b8f5deedcbeb60135269ec#diff-db418465d1b29f16c8b7e7dcd99862eaR65
Already takes *types.ClusterMetadata

pkg/destroy/destroyer.go

wking · 2018-09-25T19:32:44Z

@wking is package level comment required. pkg/destroy name seems enough?

From here, "Every package should have a package comment...". Doesn't have to be more than a line. Maybe:

// Package destroy contains tools for destroying clusters based on their metadata.
package destroy

wking · 2018-09-25T20:10:31Z

pkg/destroy/libvirt/libvirt_prefix_deprovision.go

+// returns true, when the name should be handled.
+type filterFunc func(name string) bool
+
+// ClusterNamePrefixFilter returns true for name


nit: "name" -> "names".

i don't get names? it is clustername

i don't get names? it is clustername

I'm fine with "clustername" (matching the argument), "clusterName" (potentially more idiomatic Go vs. clustername), or "names" (generic English).

keeping it as is.

"for name that are" doesn't read smoothly to me. "name" is singular; "are" is for plurals. "...for names that are prefixed with clustername..." reads smoothly to me. But whatever, it's just a comment ;).

pkg/destroy/libvirt/libvirt_prefix_deprovision.go

wking · 2018-09-25T20:14:14Z

pkg/destroy/libvirt/libvirt_prefix_deprovision.go

+	// wait for them to finish
+	for i := 0; i < len(deleteFuncs); i++ {
+		select {
+		case res := <-returnChannel:


Can we take a context.Context in this function? This sort of select looks like something that should also be watching for ctx.Done().

This is waiting for all runners to respond done/complete. single context will not be useful.

... single context will not be useful...

Single context lets you cancel cleanly.

wking · 2018-09-25T20:16:22Z

pkg/destroy/libvirt/libvirt_prefix_deprovision.go

+	})
+
+	if err != nil {
+		logger.Fatalf("Unrecoverable error/timed out: %v", err)


A more graceful way to handle this would be to wait on the runners with an error channel. Send nil when a runner completes successfully, and the error if a runner fails unrecoverably. The waiter would collect responses until it received the correct number (because you'd no longer have access to deleteFuncName, and error out if any of the responses were non-nil.

an error from one of the runner means it cannot continue. and all runners must be running to make progress.

an error from one of the runner means it cannot continue. and all runners must be running to make progress.

Then if the caller receives an error it can Cancel() the Context, wait out the canceled goroutines, and exit cleanly. But I'm fine rolling with what you have and revisiting this later if you don't want to pick it up now.

wking · 2018-09-25T20:18:12Z

pkg/destroy/libvirt/libvirt_prefix_deprovision.go

+	domains, err := conn.ListAllDomains(0)
+	if err != nil {
+		logger.Errorf("Error listing domains: %v", err)
+		return false, nil


At some point this should count as an unrecoverable error. Maybe with a context.Context that we can time out?

The function is called using https://godoc.org/k8s.io/apimachinery/pkg/util/wait#Backoff ; transient errors resolve themselves, other timeout for now.

The function is called using https://godoc.org/k8s.io/apimachinery/pkg/util/wait#Backoff ; transient errors resolve themselves, other timeout for now.

Ah, so no way to pass through a Context now besides putting it into a struct and making these methods. Where does the timeout get set?

pkg/destroy/libvirt/libvirt_prefix_deprovision.go

wking · 2018-09-25T20:22:45Z

pkg/destroy/libvirt/libvirt_prefix_deprovision.go

+		}
+		network, err := conn.LookupNetworkByName(nName)
+		if err != nil {
+			logger.Errorf("Error getting networl %s: %v", nName, err)


nit: networl -> network

pkg/asset/cluster/cluster.go

wking · 2018-09-25T20:44:26Z

pkg/asset/cluster/cluster.go

@@ -79,7 +81,8 @@ func (c *Cluster) Generate(parents map[asset.Asset]*asset.State) (*asset.State,

 	stateFile, err := terraform.Apply(tmpDir, terraform.InfraStep, templateDir)
 	if err != nil {
-		return nil, err
+		// we should try to fetch the terraform state file.


For cleanup? For a later retry? I'm fine just leaving the dead cluster and using the new tag/prefix cleanup you're adding to prune it.

terrafrom always writes tfstate file, even if if was failed midway. It can be used for debugging to fix errors and use terraform to continue from where it left.

It can be used for debugging...

This sounds useful.

... use terraform to continue from where it left...

This sounds like something we don't want to support. Hopefully install/destroy are both quick enough that the caller can just start clean.

pkg/asset/metadata/metadata.go

libvirt destroyer impl is working but has been commented out because: `bazel build tarball` cannot build with `github.com/libvirt-go/libvirt` as one of the dependencies, due to some missing cgo dependencies. `go build cmd/openshift-install` works perfectly fine. revisit add libvirt destroyer when bazel is dropped.

crawford · 2018-09-25T22:57:10Z

/lgtm

openshift-ci-robot · 2018-09-25T22:57:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, crawford

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya,crawford]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

We haven't set it in the cluster struct yet, that's a few lines below. Fixes [1]: # openshift-install cluster ? Email Address [email protected] ? Password [? for help] ******** ? SSH Public Key <none> ? Base Domain REDACTED ? Cluster Name dgoodwin1 ? Pull Secret REDACTED ? Platform aws ? Region us-east-1 (N. Virginia) FATA[0046] failed to generate asset: failed to determine default AMI: MissingRegion: could not find region configuration and a copy/paste bug from c6f02ba (pkg/cluster/config: add missing ami fetch, 2018-09-25, openshift#324). [1]: openshift#335 (comment)

This decouples our platforms a bit and makes it easier to distinguish between platform-specific and platform-agnostic code. It also gives us much more compact struct names, since now we don't need to distinguish between many flavors of machine pool, etc. in a single package. I've also updated pkg/types/doc.go; pkg/types includes more than user-specified configuration since 78c3118 (pkg: add ClusterMetadata asset,type that can be used for destroy, 2018-09-25, openshift#324). I've also added OWNERS files for some OpenStack-specific directories that were missing them before. There's still more work to go in this direction (e.g. pushing default logic into subdirs), but this seems like a reasonable chunk.

And I've rerolled deletion to use a single call to each deleter, failing fast if they error. That should address cases where we cannot destroy a shut-off domain [1]: $ virsh -c $OPENSHIFT_INSTALL_LIBVIRT_URI list --all Id Name State ---------------------------------------------------- - master0 shut off - test1-worker-0-zd7hd shut off $ bin/openshift-install destroy cluster --dir test --log-level debug DEBUG Deleting libvirt volumes DEBUG Deleting libvirt domains DEBUG Deleting libvirt network DEBUG Exiting deleting libvirt network DEBUG goroutine deleteNetwork complete ERROR Error destroying domain test1-worker-0-zd7hd: virError(Code=55, Domain=10, Message='Requested operation is not valid: domain is not running') DEBUG Exiting deleting libvirt domains DEBUG Exiting deleting libvirt volumes DEBUG goroutine deleteVolumes complete DEBUG Deleting libvirt domains ERROR Error destroying domain test1-worker-0-zd7hd: virError(Code=55, Domain=10, Message='Requested operation is not valid: domain is not running') [...] Now we'll fail-fast in those cases, allowing the caller to clear the stuck domains, after which they can restart deletion. The previous goroutine approach was borrowed from the AWS destroyer. But AWS has a large, complicated resource dependency graph which includes cycles. Libvirt is much simpler, with volumes and a network that are all independent, followed by domains which depend on the network and some of the volumes. With this commit we now take a single pass at destroying those resources starting at the leaf domains and working our way rootwards. I've retained some looping (although no longer in a separate goroutine) for domain deletion. This guards against racing domain creation, as discussed in the new godocs for deleteDomains. Also: * Rename from libvirt_prefix_deprovision.go to libvirt.go. The name is from 998ba30 (cmd,pkg/destroy: add non-terraform destroy, 2018-09-25, openshift#324), but the implementation doesn't need to be represented in the filename. This commit renames to libvirt.go to match the package name, since this file is the guts of this package. * Simplify the AlwaysTrueFilter implementation. No semantic changes, but this saves us a few lines of code. * Add trailing periods for godocs to comply with [2]. [1]: openshift#656 (comment) [2]: https://github.com/golang/go/wiki/CodeReviewComments#comment-sentences

Avoid the confusing: ERROR Failed to read tfstate: open : no such file or directory and backing syscall when terraform.Apply returns an empty string. We're not exiting immediately on Apply errors (and are instead attempting to recover the Terraform state regardless of Apply errors) since 78c3118 (pkg: add ClusterMetadata asset,type that can be used for destroy, 2018-09-25, openshift#324). The goal there is to retain the current resource state in case the user wants to investigate the resources associated with the partially-launched cluster and potentially invoke Terraform themselves to finish the install instead of deleting and re-creating their cluster. Apply can return an empty-string state file since 3f4fe57 (cmd/openshift-install: Add 'destroy bootstrap' command, 2018-10-18, leave the temporary directory without a terraform state file. And we don't care about recovering the state file in those cases, because we haven't created any external resources at that point (before the apply call). Also flatten some unnecessary nesting in the err2 handling by using an 'else if' clause.

We've been using the tagged private zone to look up the public zone since the non-Terraform destroy code landed in a8fc89b (vendor: add aws deprovision, openshift#324). When there's an existing cluster with a given domain, that can lead to false-positive removals like [1]: 1. Cluster 1 comes up like usual. 2. Cluster 2 creates a private zone. 3. Cluster 2 dies when its public record conflicts with cluster 1 (new since d1c17b7, terraform/exec/plugins/vendor: Bump terraform-provider-aws to v2.2.0, 2019-03-19, openshift#1442). 4. 'destroy cluster' on cluster 2's metadata.json removes cluster 2 resources (good) and cluster 1's public record (bad). With the explicit dependency in this commit, we ensure that we only ever create the private zone after we have successfully claimed ownership of the public record. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1659970#c7

openshift-ci-robot requested review from crawford and dgoodwin September 25, 2018 17:46

openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 25, 2018

abhinavdahiya force-pushed the core branch from d255df3 to 1cdc164 Compare September 25, 2018 17:52

wking reviewed Sep 25, 2018

View reviewed changes

dgoodwin reviewed Sep 25, 2018

View reviewed changes

abhinavdahiya force-pushed the core branch 2 times, most recently from 44ea90b to 932081a Compare September 25, 2018 18:22

openshift-ci-robot reviewed Sep 25, 2018

View reviewed changes

wking reviewed Sep 25, 2018

View reviewed changes

pkg/destroy/destroyer.go Show resolved Hide resolved

abhinavdahiya force-pushed the core branch from 932081a to 6075064 Compare September 25, 2018 19:49

wking reviewed Sep 25, 2018

View reviewed changes

pkg/asset/cluster/cluster.go Show resolved Hide resolved

wking reviewed Sep 25, 2018

View reviewed changes

pkg/asset/metadata/metadata.go Outdated Show resolved Hide resolved

wking reviewed Sep 25, 2018

View reviewed changes

pkg/asset/metadata/metadata.go Outdated Show resolved Hide resolved

wking reviewed Sep 25, 2018

View reviewed changes

pkg/asset/metadata/metadata.go Show resolved Hide resolved

wking mentioned this pull request Sep 25, 2018

libvirt installer leaves lots of artifacts when failing #325

Closed

abhinavdahiya added 5 commits September 25, 2018 14:46

vendor: add aws deprovision

a8fc89b

pkg/cluster/config: add missing ami fetch

c6f02ba

fix logrus import

008b619

pkg: add ClusterMetadata asset,type that can be used for destroy

78c3118

abhinavdahiya force-pushed the core branch from 6075064 to 998ba30 Compare September 25, 2018 21:46

openshift-ci-robot assigned crawford Sep 25, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 25, 2018

openshift-merge-robot merged commit d8844be into openshift:master Sep 25, 2018

abhinavdahiya deleted the core branch September 25, 2018 23:03

wking mentioned this pull request Sep 25, 2018

libvirt_prefix_deprovision: Fix "for name that are" -> "names" #327

Merged

This was referenced Sep 27, 2018

pkg/types/config: Get the region from the install config #346

Merged

libvirt install: use fqdn for libvirt instance names #354

Closed

wking mentioned this pull request Nov 12, 2018

pkg/types: Push platform-specific types (AWS, etc.) into subdirs #657

Merged

wking mentioned this pull request Apr 1, 2019

Bug 1659970: data/aws/route53: Block private Route 53 zone on public record #1508

Merged

Destroy: based on openshift/hive/contrib: aws_tag_deprovision #324

Destroy: based on openshift/hive/contrib: aws_tag_deprovision #324

Conversation

abhinavdahiya commented Sep 25, 2018

Choose a reason for hiding this comment

abhinavdahiya Sep 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wking commented Sep 25, 2018

openshift-ci-robot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhinavdahiya commented Sep 25, 2018

wking Sep 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wking commented Sep 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhinavdahiya Sep 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crawford commented Sep 25, 2018

openshift-ci-robot commented Sep 25, 2018

Destroy: based on `openshift/hive/contrib: aws_tag_deprovision` #324

Destroy: based on `openshift/hive/contrib: aws_tag_deprovision` #324

abhinavdahiya Sep 25, 2018 •

edited

Loading

wking Sep 25, 2018 •

edited

Loading

abhinavdahiya Sep 25, 2018 •

edited

Loading