-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Destroy: based on openshift/hive/contrib: aws_tag_deprovision
#324
Conversation
cmd/openshift-install/BUILD.bazel
Outdated
@@ -19,6 +20,6 @@ go_binary( | |||
embed = [":go_default_library"], | |||
# Use pure to build a pure-go binary. | |||
# This has the nice side effect of making the binary statically linked. | |||
pure = "on", | |||
pure = "off", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
github.com/libvirt/libvirt-go
requires cgo
; the libvirt destroy is turned off because bazel cannot build it yet.
I can drop it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can drop it for now.
I'd favor that :-)
"os" | ||
"path/filepath" | ||
|
||
atd "github.com/openshift/hive/contrib/pkg/aws_tag_deprovision" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, openshift/hive#13 is about to move this to a package name without the underscores. I will try to remember to send you a PR as soon as we get it merged.
if err != nil { | ||
log.Fatalf("failed to generate asset: %v", err) | ||
log.Fatalf("failed to create destroyer: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor note but Fatalf will os.Exit(1) for you I think.
pkg/destroy/destroyer.go
Outdated
func NewAWSDestroyer(level log.Level, metadata *types.ClusterMetadata) *atd.ClusterUninstaller { | ||
return &atd.ClusterUninstaller{ | ||
Filters: metadata.ClusterPlatformMetadata.AWS.Identifier, | ||
Region: metadata.ClusterPlatformMetadata.AWS.Region, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about having optional CLI flags to bypass loading the metadata file, if not necessary? This would help us in Hive, but it's not urgent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dgoodwin there are plans to drop clusterID
will be dropped from InstallConfig
;
currently the clusterID is used inside cluster as its identity => that will be moved to ClusterVersionOperator,
And the clusterID is also used to tag the AWS resources => this will be local identifier generated by installer and will be different from one used by CVO.
cc @crawford
So metadata.json
will become the only source where you'll be able to extract the identifier used but installer to tag aws resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were still hoping to use openshift-install to both provision and deprovision clusters in Hive (we should talk about moving the uninstall code to your repo at some point), but we're running in pods, there is no persistent disk state and thus no metadata.json. If there were a CLI flag or env var to just specify the cluster ID, we could use openshift-install without requiring any on disk state, and then we're all using the same uninstall path.
It's also a nice override for a user who lost their install dir and metadata, but can see the cluster UUID tag on their AWS objects.
NOTE: we do plan to extract the cluster UUID after provisioning and store in the status on the API object, so we would have this avail when deprovisioning. Technically we could use that to generate a metadata.json and inject into the pod it's just more of a pain with Kube than just being able to specify command params or env vars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @crawford on slack, he suggests just pushing the entire metadata file up as a configmap and mounting back in for deprovision, and initially this sounds like it should be ok. We'll go with this for now.
pkg/destroy/destroyer.go
Outdated
// NewAWSDestroyer returns aws Uninstaller from ClusterMetadata. | ||
func NewAWSDestroyer(level log.Level, metadata *types.ClusterMetadata) *atd.ClusterUninstaller { | ||
return &atd.ClusterUninstaller{ | ||
Filters: metadata.ClusterPlatformMetadata.AWS.Identifier, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity what exactly ends up in this metadata file for the identifier tags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
44ea90b
to
932081a
Compare
/lint Looks like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wking: 5 warnings.
In response to this:
/lint
Looks like
pkg/destroy
is missing a// Package destroy ...
comment; I'm not sure whygolint
(orgo-vet
?) isn't complaining...
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@@ -0,0 +1,26 @@ | |||
package types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Golint comments: should have a package comment, unless it's in another file for this package. More info.
@@ -0,0 +1,80 @@ | |||
package metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Golint comments: should have a package comment, unless it's in another file for this package. More info.
@@ -0,0 +1,31 @@ | |||
package metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Golint comments: should have a package comment, unless it's in another file for this package. More info.
@@ -0,0 +1,78 @@ | |||
package destroy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Golint comments: should have a package comment, unless it's in another file for this package. More info.
@@ -0,0 +1,241 @@ | |||
package libvirt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Golint comments: should have a package comment, unless it's in another file for this package. More info.
@wking |
} | ||
|
||
// NewDestroyer returns Destroyer based on `metadata.json` in `rootDir`. | ||
func NewDestroyer(level log.Level, rootDir string) (Destroyer, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather this took a *ClusterMetadata
or some such with the file loading and deserialization happing externally (or in a NewDestroyerFromFile
helper, or some such). That would allow folks to create destroyers without hitting the disk (e.g. if they pulled metadata out of a cluster asset).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/openshift/installer/pull/324/files/932081a8a8ef7920f6b8f5deedcbeb60135269ec#diff-db418465d1b29f16c8b7e7dcd99862eaR65
Already takes *types.ClusterMetadata
// returns true, when the name should be handled. | ||
type filterFunc func(name string) bool | ||
|
||
// ClusterNamePrefixFilter returns true for name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "name" -> "names".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't get names
? it is clustername
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't get
names
? it isclustername
I'm fine with "clustername" (matching the argument), "clusterName" (potentially more idiomatic Go vs. clustername
), or "names" (generic English).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keeping it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"for name that are" doesn't read smoothly to me. "name" is singular; "are" is for plurals. "...for names that are prefixed with clustername..." reads smoothly to me. But whatever, it's just a comment ;).
// wait for them to finish | ||
for i := 0; i < len(deleteFuncs); i++ { | ||
select { | ||
case res := <-returnChannel: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we take a context.Context
in this function? This sort of select
looks like something that should also be watching for ctx.Done()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is waiting for all runners to respond done/complete. single context will not be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... single context will not be useful...
Single context lets you cancel cleanly.
}) | ||
|
||
if err != nil { | ||
logger.Fatalf("Unrecoverable error/timed out: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more graceful way to handle this would be to wait on the runners with an error
channel. Send nil
when a runner completes successfully, and the error if a runner fails unrecoverably. The waiter would collect responses until it received the correct number (because you'd no longer have access to deleteFuncName
, and error out if any of the responses were non-nil
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an error from one of the runner means it cannot continue. and all runners must be running to make progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an error from one of the runner means it cannot continue. and all runners must be running to make progress.
Then if the caller receives an error it can Cancel()
the Context
, wait out the canceled goroutines, and exit cleanly. But I'm fine rolling with what you have and revisiting this later if you don't want to pick it up now.
domains, err := conn.ListAllDomains(0) | ||
if err != nil { | ||
logger.Errorf("Error listing domains: %v", err) | ||
return false, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point this should count as an unrecoverable error. Maybe with a context.Context
that we can time out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is called using https://godoc.org/k8s.io/apimachinery/pkg/util/wait#Backoff ; transient errors resolve themselves, other timeout for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is called using https://godoc.org/k8s.io/apimachinery/pkg/util/wait#Backoff ; transient errors resolve themselves, other timeout for now.
Ah, so no way to pass through a Context
now besides putting it into a struct
and making these methods. Where does the timeout get set?
} | ||
network, err := conn.LookupNetworkByName(nName) | ||
if err != nil { | ||
logger.Errorf("Error getting networl %s: %v", nName, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: networl
-> network
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -79,7 +81,8 @@ func (c *Cluster) Generate(parents map[asset.Asset]*asset.State) (*asset.State, | |||
|
|||
stateFile, err := terraform.Apply(tmpDir, terraform.InfraStep, templateDir) | |||
if err != nil { | |||
return nil, err | |||
// we should try to fetch the terraform state file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For cleanup? For a later retry? I'm fine just leaving the dead cluster and using the new tag/prefix cleanup you're adding to prune it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
terrafrom always writes tfstate file, even if if was failed midway. It can be used for debugging to fix errors and use terraform to continue from where it left.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be used for debugging...
This sounds useful.
... use terraform to continue from where it left...
This sounds like something we don't want to support. Hopefully install/destroy are both quick enough that the caller can just start clean.
libvirt destroyer impl is working but has been commented out because: `bazel build tarball` cannot build with `github.com/libvirt-go/libvirt` as one of the dependencies, due to some missing cgo dependencies. `go build cmd/openshift-install` works perfectly fine. revisit add libvirt destroyer when bazel is dropped.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, crawford The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
We haven't set it in the cluster struct yet, that's a few lines below. Fixes [1]: # openshift-install cluster ? Email Address [email protected] ? Password [? for help] ******** ? SSH Public Key <none> ? Base Domain REDACTED ? Cluster Name dgoodwin1 ? Pull Secret REDACTED ? Platform aws ? Region us-east-1 (N. Virginia) FATA[0046] failed to generate asset: failed to determine default AMI: MissingRegion: could not find region configuration and a copy/paste bug from c6f02ba (pkg/cluster/config: add missing ami fetch, 2018-09-25, openshift#324). [1]: openshift#335 (comment)
This decouples our platforms a bit and makes it easier to distinguish between platform-specific and platform-agnostic code. It also gives us much more compact struct names, since now we don't need to distinguish between many flavors of machine pool, etc. in a single package. I've also updated pkg/types/doc.go; pkg/types includes more than user-specified configuration since 78c3118 (pkg: add ClusterMetadata asset,type that can be used for destroy, 2018-09-25, openshift#324). I've also added OWNERS files for some OpenStack-specific directories that were missing them before. There's still more work to go in this direction (e.g. pushing default logic into subdirs), but this seems like a reasonable chunk.
And I've rerolled deletion to use a single call to each deleter, failing fast if they error. That should address cases where we cannot destroy a shut-off domain [1]: $ virsh -c $OPENSHIFT_INSTALL_LIBVIRT_URI list --all Id Name State ---------------------------------------------------- - master0 shut off - test1-worker-0-zd7hd shut off $ bin/openshift-install destroy cluster --dir test --log-level debug DEBUG Deleting libvirt volumes DEBUG Deleting libvirt domains DEBUG Deleting libvirt network DEBUG Exiting deleting libvirt network DEBUG goroutine deleteNetwork complete ERROR Error destroying domain test1-worker-0-zd7hd: virError(Code=55, Domain=10, Message='Requested operation is not valid: domain is not running') DEBUG Exiting deleting libvirt domains DEBUG Exiting deleting libvirt volumes DEBUG goroutine deleteVolumes complete DEBUG Deleting libvirt domains ERROR Error destroying domain test1-worker-0-zd7hd: virError(Code=55, Domain=10, Message='Requested operation is not valid: domain is not running') [...] Now we'll fail-fast in those cases, allowing the caller to clear the stuck domains, after which they can restart deletion. The previous goroutine approach was borrowed from the AWS destroyer. But AWS has a large, complicated resource dependency graph which includes cycles. Libvirt is much simpler, with volumes and a network that are all independent, followed by domains which depend on the network and some of the volumes. With this commit we now take a single pass at destroying those resources starting at the leaf domains and working our way rootwards. I've retained some looping (although no longer in a separate goroutine) for domain deletion. This guards against racing domain creation, as discussed in the new godocs for deleteDomains. Also: * Rename from libvirt_prefix_deprovision.go to libvirt.go. The name is from 998ba30 (cmd,pkg/destroy: add non-terraform destroy, 2018-09-25, openshift#324), but the implementation doesn't need to be represented in the filename. This commit renames to libvirt.go to match the package name, since this file is the guts of this package. * Simplify the AlwaysTrueFilter implementation. No semantic changes, but this saves us a few lines of code. * Add trailing periods for godocs to comply with [2]. [1]: openshift#656 (comment) [2]: https://github.com/golang/go/wiki/CodeReviewComments#comment-sentences
Avoid the confusing: ERROR Failed to read tfstate: open : no such file or directory and backing syscall when terraform.Apply returns an empty string. We're not exiting immediately on Apply errors (and are instead attempting to recover the Terraform state regardless of Apply errors) since 78c3118 (pkg: add ClusterMetadata asset,type that can be used for destroy, 2018-09-25, openshift#324). The goal there is to retain the current resource state in case the user wants to investigate the resources associated with the partially-launched cluster and potentially invoke Terraform themselves to finish the install instead of deleting and re-creating their cluster. Apply can return an empty-string state file since 3f4fe57 (cmd/openshift-install: Add 'destroy bootstrap' command, 2018-10-18, leave the temporary directory without a terraform state file. And we don't care about recovering the state file in those cases, because we haven't created any external resources at that point (before the apply call). Also flatten some unnecessary nesting in the err2 handling by using an 'else if' clause.
We've been using the tagged private zone to look up the public zone since the non-Terraform destroy code landed in a8fc89b (vendor: add aws deprovision, openshift#324). When there's an existing cluster with a given domain, that can lead to false-positive removals like [1]: 1. Cluster 1 comes up like usual. 2. Cluster 2 creates a private zone. 3. Cluster 2 dies when its public record conflicts with cluster 1 (new since d1c17b7, terraform/exec/plugins/vendor: Bump terraform-provider-aws to v2.2.0, 2019-03-19, openshift#1442). 4. 'destroy cluster' on cluster 2's metadata.json removes cluster 2 resources (good) and cluster 1's public record (bad). With the explicit dependency in this commit, we ensure that we only ever create the private zone after we have successfully claimed ownership of the public record. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1659970#c7
/cc @dgoodwin @crawford