Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destroy: based on openshift/hive/contrib: aws_tag_deprovision #324

Merged
merged 5 commits into from
Sep 25, 2018

Conversation

abhinavdahiya
Copy link
Contributor

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 25, 2018
@@ -19,6 +20,6 @@ go_binary(
embed = [":go_default_library"],
# Use pure to build a pure-go binary.
# This has the nice side effect of making the binary statically linked.
pure = "on",
pure = "off",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why we need this? It may break @steveej (#243, #300).

Copy link
Contributor Author

@abhinavdahiya abhinavdahiya Sep 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github.com/libvirt/libvirt-go requires cgo; the libvirt destroy is turned off because bazel cannot build it yet.

I can drop it for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can drop it for now.

I'd favor that :-)

"os"
"path/filepath"

atd "github.com/openshift/hive/contrib/pkg/aws_tag_deprovision"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, openshift/hive#13 is about to move this to a package name without the underscores. I will try to remember to send you a PR as soon as we get it merged.

if err != nil {
log.Fatalf("failed to generate asset: %v", err)
log.Fatalf("failed to create destroyer: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor note but Fatalf will os.Exit(1) for you I think.

func NewAWSDestroyer(level log.Level, metadata *types.ClusterMetadata) *atd.ClusterUninstaller {
return &atd.ClusterUninstaller{
Filters: metadata.ClusterPlatformMetadata.AWS.Identifier,
Region: metadata.ClusterPlatformMetadata.AWS.Region,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about having optional CLI flags to bypass loading the metadata file, if not necessary? This would help us in Hive, but it's not urgent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgoodwin there are plans to drop clusterID will be dropped from InstallConfig;
currently the clusterID is used inside cluster as its identity => that will be moved to ClusterVersionOperator,
And the clusterID is also used to tag the AWS resources => this will be local identifier generated by installer and will be different from one used by CVO.
cc @crawford

So metadata.json will become the only source where you'll be able to extract the identifier used but installer to tag aws resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were still hoping to use openshift-install to both provision and deprovision clusters in Hive (we should talk about moving the uninstall code to your repo at some point), but we're running in pods, there is no persistent disk state and thus no metadata.json. If there were a CLI flag or env var to just specify the cluster ID, we could use openshift-install without requiring any on disk state, and then we're all using the same uninstall path.

It's also a nice override for a user who lost their install dir and metadata, but can see the cluster UUID tag on their AWS objects.

NOTE: we do plan to extract the cluster UUID after provisioning and store in the status on the API object, so we would have this avail when deprovisioning. Technically we could use that to generate a metadata.json and inject into the pod it's just more of a pain with Kube than just being able to specify command params or env vars.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @crawford on slack, he suggests just pushing the entire metadata file up as a configmap and mounting back in for deprovision, and initially this sounds like it should be ok. We'll go with this for now.

// NewAWSDestroyer returns aws Uninstaller from ClusterMetadata.
func NewAWSDestroyer(level log.Level, metadata *types.ClusterMetadata) *atd.ClusterUninstaller {
return &atd.ClusterUninstaller{
Filters: metadata.ClusterPlatformMetadata.AWS.Identifier,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity what exactly ends up in this metadata file for the identifier tags?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhinavdahiya abhinavdahiya force-pushed the core branch 2 times, most recently from 44ea90b to 932081a Compare September 25, 2018 18:22
@wking
Copy link
Member

wking commented Sep 25, 2018

/lint

Looks like pkg/destroy is missing a // Package destroy ... comment; I'm not sure why golint (or go-vet?) isn't complaining...

Copy link
Contributor

@openshift-ci-robot openshift-ci-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking: 5 warnings.

In response to this:

/lint

Looks like pkg/destroy is missing a // Package destroy ... comment; I'm not sure why golint (or go-vet?) isn't complaining...

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -0,0 +1,26 @@
package types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint comments: should have a package comment, unless it's in another file for this package. More info.

@@ -0,0 +1,80 @@
package metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint comments: should have a package comment, unless it's in another file for this package. More info.

@@ -0,0 +1,31 @@
package metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint comments: should have a package comment, unless it's in another file for this package. More info.

@@ -0,0 +1,78 @@
package destroy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint comments: should have a package comment, unless it's in another file for this package. More info.

@@ -0,0 +1,241 @@
package libvirt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint comments: should have a package comment, unless it's in another file for this package. More info.

@abhinavdahiya
Copy link
Contributor Author

@wking is package level comment required. pkg/destroy name seems enough?

}

// NewDestroyer returns Destroyer based on `metadata.json` in `rootDir`.
func NewDestroyer(level log.Level, rootDir string) (Destroyer, error) {
Copy link
Member

@wking wking Sep 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather this took a *ClusterMetadata or some such with the file loading and deserialization happing externally (or in a NewDestroyerFromFile helper, or some such). That would allow folks to create destroyers without hitting the disk (e.g. if they pulled metadata out of a cluster asset).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking
Copy link
Member

wking commented Sep 25, 2018

@wking is package level comment required. pkg/destroy name seems enough?

From here, "Every package should have a package comment...". Doesn't have to be more than a line. Maybe:

// Package destroy contains tools for destroying clusters based on their metadata.
package destroy

// returns true, when the name should be handled.
type filterFunc func(name string) bool

// ClusterNamePrefixFilter returns true for name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "name" -> "names".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't get names? it is clustername

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't get names? it is clustername

I'm fine with "clustername" (matching the argument), "clusterName" (potentially more idiomatic Go vs. clustername), or "names" (generic English).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping it as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"for name that are" doesn't read smoothly to me. "name" is singular; "are" is for plurals. "...for names that are prefixed with clustername..." reads smoothly to me. But whatever, it's just a comment ;).

pkg/destroy/libvirt/libvirt_prefix_deprovision.go Outdated Show resolved Hide resolved
pkg/destroy/libvirt/libvirt_prefix_deprovision.go Outdated Show resolved Hide resolved
// wait for them to finish
for i := 0; i < len(deleteFuncs); i++ {
select {
case res := <-returnChannel:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we take a context.Context in this function? This sort of select looks like something that should also be watching for ctx.Done().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is waiting for all runners to respond done/complete. single context will not be useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... single context will not be useful...

Single context lets you cancel cleanly.

})

if err != nil {
logger.Fatalf("Unrecoverable error/timed out: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more graceful way to handle this would be to wait on the runners with an error channel. Send nil when a runner completes successfully, and the error if a runner fails unrecoverably. The waiter would collect responses until it received the correct number (because you'd no longer have access to deleteFuncName, and error out if any of the responses were non-nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an error from one of the runner means it cannot continue. and all runners must be running to make progress.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an error from one of the runner means it cannot continue. and all runners must be running to make progress.

Then if the caller receives an error it can Cancel() the Context, wait out the canceled goroutines, and exit cleanly. But I'm fine rolling with what you have and revisiting this later if you don't want to pick it up now.

domains, err := conn.ListAllDomains(0)
if err != nil {
logger.Errorf("Error listing domains: %v", err)
return false, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point this should count as an unrecoverable error. Maybe with a context.Context that we can time out?

Copy link
Contributor Author

@abhinavdahiya abhinavdahiya Sep 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is called using https://godoc.org/k8s.io/apimachinery/pkg/util/wait#Backoff ; transient errors resolve themselves, other timeout for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is called using https://godoc.org/k8s.io/apimachinery/pkg/util/wait#Backoff ; transient errors resolve themselves, other timeout for now.

Ah, so no way to pass through a Context now besides putting it into a struct and making these methods. Where does the timeout get set?

pkg/destroy/libvirt/libvirt_prefix_deprovision.go Outdated Show resolved Hide resolved
}
network, err := conn.LookupNetworkByName(nName)
if err != nil {
logger.Errorf("Error getting networl %s: %v", nName, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: networl -> network

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -79,7 +81,8 @@ func (c *Cluster) Generate(parents map[asset.Asset]*asset.State) (*asset.State,

stateFile, err := terraform.Apply(tmpDir, terraform.InfraStep, templateDir)
if err != nil {
return nil, err
// we should try to fetch the terraform state file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cleanup? For a later retry? I'm fine just leaving the dead cluster and using the new tag/prefix cleanup you're adding to prune it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

terrafrom always writes tfstate file, even if if was failed midway. It can be used for debugging to fix errors and use terraform to continue from where it left.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be used for debugging...

This sounds useful.

... use terraform to continue from where it left...

This sounds like something we don't want to support. Hopefully install/destroy are both quick enough that the caller can just start clean.

libvirt destroyer impl is working but has been commented out because:
`bazel build tarball` cannot build with `github.com/libvirt-go/libvirt` as one of the
dependencies, due to some missing cgo dependencies. `go build cmd/openshift-install` works
perfectly fine.
revisit add libvirt destroyer when bazel is dropped.
@crawford
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 25, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, crawford

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,crawford]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit d8844be into openshift:master Sep 25, 2018
@abhinavdahiya abhinavdahiya deleted the core branch September 25, 2018 23:03
wking added a commit to wking/openshift-installer that referenced this pull request Sep 27, 2018
We haven't set it in the cluster struct yet, that's a few lines below.
Fixes [1]:

  # openshift-install cluster
  ? Email Address [email protected]
  ? Password [? for help] ********
  ? SSH Public Key <none>
  ? Base Domain REDACTED
  ? Cluster Name dgoodwin1
  ? Pull Secret REDACTED
  ? Platform aws
  ? Region us-east-1 (N. Virginia)
  FATA[0046] failed to generate asset: failed to determine default AMI: MissingRegion: could not find region configuration

and a copy/paste bug from c6f02ba (pkg/cluster/config: add missing
ami fetch, 2018-09-25, openshift#324).

[1]: openshift#335 (comment)
wking added a commit to wking/openshift-installer that referenced this pull request Nov 12, 2018
This decouples our platforms a bit and makes it easier to distinguish
between platform-specific and platform-agnostic code.  It also gives
us much more compact struct names, since now we don't need to
distinguish between many flavors of machine pool, etc. in a single
package.

I've also updated pkg/types/doc.go; pkg/types includes more than
user-specified configuration since 78c3118 (pkg: add ClusterMetadata
asset,type that can be used for destroy, 2018-09-25, openshift#324).

I've also added OWNERS files for some OpenStack-specific directories
that were missing them before.

There's still more work to go in this direction (e.g. pushing default
logic into subdirs), but this seems like a reasonable chunk.
wking added a commit to wking/openshift-installer that referenced this pull request Nov 13, 2018
And I've rerolled deletion to use a single call to each deleter,
failing fast if they error.  That should address cases where we cannot
destroy a shut-off domain [1]:

  $ virsh -c $OPENSHIFT_INSTALL_LIBVIRT_URI list --all
   Id    Name                           State
  ----------------------------------------------------
   -     master0                        shut off
   -     test1-worker-0-zd7hd           shut off

  $ bin/openshift-install destroy cluster --dir test --log-level debug
  DEBUG Deleting libvirt volumes
  DEBUG Deleting libvirt domains
  DEBUG Deleting libvirt network
  DEBUG Exiting deleting libvirt network
  DEBUG goroutine deleteNetwork complete
  ERROR Error destroying domain test1-worker-0-zd7hd: virError(Code=55, Domain=10, Message='Requested operation is not valid: domain is not running')
  DEBUG Exiting deleting libvirt domains
  DEBUG Exiting deleting libvirt volumes
  DEBUG goroutine deleteVolumes complete
  DEBUG Deleting libvirt domains
  ERROR Error destroying domain test1-worker-0-zd7hd: virError(Code=55, Domain=10, Message='Requested operation is not valid: domain is not running')
  [...]

Now we'll fail-fast in those cases, allowing the caller to clear the
stuck domains, after which they can restart deletion.

The previous goroutine approach was borrowed from the AWS destroyer.
But AWS has a large, complicated resource dependency graph which
includes cycles.  Libvirt is much simpler, with volumes and a network
that are all independent, followed by domains which depend on the
network and some of the volumes.  With this commit we now take a
single pass at destroying those resources starting at the leaf domains
and working our way rootwards.

I've retained some looping (although no longer in a separate
goroutine) for domain deletion.  This guards against racing domain
creation, as discussed in the new godocs for deleteDomains.

Also:

* Rename from libvirt_prefix_deprovision.go to libvirt.go.  The name
  is from 998ba30 (cmd,pkg/destroy: add non-terraform destroy,
  2018-09-25, openshift#324), but the implementation doesn't need to be
  represented in the filename.  This commit renames to libvirt.go to
  match the package name, since this file is the guts of this package.

* Simplify the AlwaysTrueFilter implementation.  No semantic changes,
  but this saves us a few lines of code.

* Add trailing periods for godocs to comply with [2].

[1]: openshift#656 (comment)
[2]: https://github.com/golang/go/wiki/CodeReviewComments#comment-sentences
wking added a commit to wking/openshift-installer that referenced this pull request Feb 4, 2019
Avoid the confusing:

  ERROR Failed to read tfstate: open : no such file or directory

and backing syscall when terraform.Apply returns an empty string.

We're not exiting immediately on Apply errors (and are instead
attempting to recover the Terraform state regardless of Apply errors)
since 78c3118 (pkg: add ClusterMetadata asset,type that can be used
for destroy, 2018-09-25, openshift#324).  The goal there is to retain the
current resource state in case the user wants to investigate the
resources associated with the partially-launched cluster and
potentially invoke Terraform themselves to finish the install instead
of deleting and re-creating their cluster.

Apply can return an empty-string state file since 3f4fe57
(cmd/openshift-install: Add 'destroy bootstrap' command, 2018-10-18,
leave the temporary directory without a terraform state file.  And we
don't care about recovering the state file in those cases, because we
haven't created any external resources at that point (before the apply
call).

Also flatten some unnecessary nesting in the err2 handling by using an
'else if' clause.
wking added a commit to wking/openshift-installer that referenced this pull request Apr 1, 2019
We've been using the tagged private zone to look up the public zone
since the non-Terraform destroy code landed in a8fc89b (vendor: add
aws deprovision, openshift#324).  When there's an existing cluster with a given
domain, that can lead to false-positive removals like [1]:

1. Cluster 1 comes up like usual.
2. Cluster 2 creates a private zone.
3. Cluster 2 dies when its public record conflicts with cluster 1 (new
   since d1c17b7, terraform/exec/plugins/vendor: Bump
   terraform-provider-aws to v2.2.0, 2019-03-19, openshift#1442).
4. 'destroy cluster' on cluster 2's metadata.json removes cluster 2
   resources (good) and cluster 1's public record (bad).

With the explicit dependency in this commit, we ensure that we only
ever create the private zone after we have successfully claimed
ownership of the public record.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1659970#c7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants