pkg/destroy/libvirt: Use prefix-based deletion #660

wking · 2018-11-13T00:07:31Z

To avoid wiping out the caller's whole libvirt environment, regardless of whether it was associated with our cluster or not. Using cluster-name prefixes still makes me a bit jumpy, so I've added warnings to both the environment-variable and asset-prompt docs warning libvirt users to pick something sufficiently unique.

There are also a number of other libvirt-destroy pivots in separate commits. I can pull any or all of those out into separate pull requests if that makes review easier. The only other change which seems user-visible is 926b351, which converts our looping goroutine deletion to a single pass.

Fixes #354.
Fixes #647.
Fixes #656.

data/data/libvirt/main.tf

abhinavdahiya · 2018-11-13T00:17:05Z

926b351
you will be racing with machine-api-operator which might recreate the domains.

also needs change to naming for pkg/asset/machine/libvirt

also i don't see how this is fixed #656

wking · 2018-11-13T00:23:21Z

also i don't see how this is fixed #656

From #656:

It should delete or just error out without looping continue.

With 926b351, we take the "just error out without looping" approach to resolving #656.

I'm not sure what to do about the machine-api-operator race. How about taking two passes over the domains? It seems unlikely that a domain that squeaks through the first pass gets alive enough to launch another domain before the second pass kills it.

wking · 2018-11-13T00:25:59Z

also needs change to naming for pkg/asset/machine/libvirt

What change? Are you saying 1d6fab56818 (which already touches pkg/asset/machines/libvirt) is missing something?

abhinavdahiya · 2018-11-13T00:30:29Z

With 926b351, we take the "just error out without looping" approach to resolving #656.

What is the correct response to already shutoff domain?

error out the domain is already shut-off : i think we should atleast continue to delete the rest
skip the already shut-off domain : IMO inform user and continue kind of approach is more useful

I'm not sure what to do about the machine-api-operator race. How about taking two passes over the domains? It seems unlikely that a domain that squeaks through the first pass gets alive enough to launch another domain before the second pass kills it.

if you do not delete the master domain that is running the MAO first, it will be trying to create domains to match its Machine objects. So you'll leak newly created domain as you didn't get in the list.

abhinavdahiya · 2018-11-13T00:36:02Z

also needs change to naming for pkg/asset/machine/libvirt

What change? Are you saying 1d6fab5 (which already touches pkg/asset/machines/libvirt) is missing something?

1d6fab5#diff-56a33e306cc10f32c5f102c247a572e7L43

the name needs to change to match changes in tf file.

wking · 2018-11-13T00:37:53Z

With 926b351, we take the "just error out without looping" approach to resolving #656.

What is the correct response to already shutoff domain?

error out the domain is already shut-off : i think we should atleast continue to delete the rest

skip the already shut-off domain : IMO inform user and continue kind of approach is more useful

But we don't want to continue so far that we delete their metadata.json while there are outstanding resources we know about. I'm fine continuing on to attempt deletion of other domains. I'm not sure we can delete the network before all domains are deleted. And while we could delete any volumes not used by the stuck domains, I don't know if it's worth the trouble of teaching Go about specific domain -> volume dependencies. If we need the caller to unstick a particular domain anyway, I don't see a problem with failing fast; they can always relaunch after clearing the domain.

I'm not sure what to do about the machine-api-operator race. How about taking two passes over the domains? It seems unlikely that a domain that squeaks through the first pass gets alive enough to launch another domain before the second pass kills it.

if you do not delete the master master that is running the MAO first, it will be trying to create domains to match its Machine objects. So you'll leak newly created domain as you didn't get in the list.

But if I take a second pass:

for _, del := range []deleteFunc{
  deleteDomains,
  deleteDomains, // catch any children launched by the machine-API operator
  deleteNetwork,
  deleteVolumes,
} {
  err = del(conn, o.Filter, o.Logger)
  if err != nil {
    return err
  }
}

then any newly-created domains will be caught by the second deleteDomains, right? We'd only have to worry about grandchildren, and the children caught by the second deleteDomains would have to get the grandchildren launched very fast to break through.

wking · 2018-11-13T00:46:11Z

also needs change to naming for pkg/asset/machine/libvirt

What change? Are you saying 1d6fab5 (which already touches pkg/asset/machines/libvirt) is missing something?

1d6fab5#diff-56a33e306cc10f32c5f102c247a572e7L43

the name needs to change to match changes in tf file.

Ah. I've pushed 9918d51 -> 0eb0adc. Does the new 0eb0adc have what you want?

abhinavdahiya · 2018-11-13T00:48:12Z

But we don't want to continue so far that we delete their metadata.json while there are outstanding resources we know about. I'm fine continuing on to attempt deletion of other domains. I'm not sure we can delete the network before all domains are deleted. And while we could delete any volumes not used by the stuck domains, I don't know if it's worth the trouble of teaching Go about specific domain -> volume dependencies. If we need the caller to unstick a particular domain anyway, I don't see a problem with failing fast; they can always relaunch after clearing the domain.

I prefer the current impl, but i think we need to do more regarding marking domains for deletion.. only running,pending etc....

then any newly-created domains will be caught by the second deleteDomains, right? We'd only have to worry about grandchildren, and the children caught by the second deleteDomains would have to get the grandchildren launched very fast to break through.

we should not exist / say completed deleting domains until we have actually deleted all. I don't like the 2 calls.

wking · 2018-11-13T05:25:28Z

Ok, I've pushed 0eb0adc -> 80bb4ed which leaves a single deleteDomains that only exits nil when all domains have been deleted (including ones potentially launched by the machine-API operator). Now the deleteDomains function loops over the new deleteDomainsSinglePass until we hit an error or deleteDomainsSinglePass reports nothingToDelete (showing that no new domains were racily created between the last two ListAllDomains calls). That looping is a lot like the old goroutine approach, but now that we only need to do it for domains, we no longer need to offload it to a goroutine.

wking · 2018-11-13T06:33:33Z

e2e-aws:

error openshift-ingress/ds/router-default did not come up
Waiting for daemon set "router-default" rollout to finish: 1 of 3 updated pods are available...
E1113 05:53:39.982961     674 streamwatcher.go:109] Unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body)
error: watch closed before Until timeout
error openshift-ingress/ds/router-default did not come up
Waiting for daemon set "router-default" rollout to finish: 1 of 3 updated pods are available...
E1113 05:55:10.455454     694 streamwatcher.go:109] Unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body)
error: watch closed before Until timeout

Checking on the nodes:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/660/pull-ci-openshift-installer-master-e2e-aws/1356/artifacts/e2e-aws/nodes.json | jq '[.items[] | .metadata | {"name": .name, "machine": .annotations.machine}]'
[
  {
    "name": "ip-10-0-13-33.ec2.internal",
    "machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-master-0"
  },
  {
    "name": "ip-10-0-130-112.ec2.internal",
    "machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-worker-us-east-1a-5ttjv"
  },
  {
    "name": "ip-10-0-154-238.ec2.internal",
    "machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-worker-us-east-1b-t9gzw"
  },
  {
    "name": "ip-10-0-168-175.ec2.internal",
    "machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-worker-us-east-1c-fd75b"
  },
  {
    "name": "ip-10-0-17-71.ec2.internal",
    "machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-master-1"
  },
  {
    "name": "ip-10-0-33-245.ec2.internal",
    "machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-master-2"
  }
]

And the pods:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/660/pull-ci-openshift-installer-master-e2e-aws/1356/artifacts/e2e-aws/pods.json | jq '[.items[] | {"namespace": .metadata.namespace, "name": .metadata.name, "phase": .status.phase, "node": .spec.nodeName} | select(.phase != "Running" and .phase != "Succeeded")]'
[
  {
    "namespace": "openshift-cluster-dns",
    "name": "dns-default-q5vkq",
    "phase": "Pending",
    "node": "ip-10-0-154-238.ec2.internal"
  },
  {
    "namespace": "openshift-cluster-dns",
    "name": "dns-default-v6tv5",
    "phase": "Pending",
    "node": "ip-10-0-130-112.ec2.internal"
  },
  {
    "namespace": "openshift-ingress",
    "name": "router-default-7jhr9",
    "phase": "Pending",
    "node": "ip-10-0-130-112.ec2.internal"
  },
  {
    "namespace": "openshift-ingress",
    "name": "router-default-rxq5c",
    "phase": "Pending",
    "node": "ip-10-0-154-238.ec2.internal"
  }
]

So all of those are pending for worker nodes. Finally, checking the tail of the event log:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/660/pull-ci-openshift-installer-master-e2e-aws/1356/artifacts/e2e-aws/events.json | jq -r '[.items[-10:][] | {lastTimestamp, message}]'
[
  {
    "lastTimestamp": "2018-11-13T05:36:58Z",
    "message": "pulling image \"quay.io/coreos/kube-addon-operator-dev:70cae49142ff69e83ed7b41fa81a585b02cdea7d\""
  },
  {
    "lastTimestamp": "2018-11-13T05:36:13Z",
    "message": "Failed to pull image \"quay.io/coreos/kube-addon-operator-dev:70cae49142ff69e83ed7b41fa81a585b02cdea7d\": rpc error: code = Unknown desc = Error reading manifest 70cae49142ff69e83ed7b41fa81a585b02cdea7d in quay.io/coreos/kube-addon-operator-dev: unauthorized: access to the requested resource is not authorized"
  },
  {
    "lastTimestamp": "2018-11-13T05:36:13Z",
    "message": "Error: ErrImagePull"
  },
  {
    "lastTimestamp": "2018-11-13T05:36:43Z",
    "message": "Back-off pulling image \"quay.io/coreos/kube-addon-operator-dev:70cae49142ff69e83ed7b41fa81a585b02cdea7d\""
  },
  {
    "lastTimestamp": "2018-11-13T05:36:43Z",
    "message": "Error: ImagePullBackOff"
  },
  {
    "lastTimestamp": "2018-11-13T05:37:05Z",
    "message": "Successfully pulled image \"quay.io/coreos/kube-addon-operator-dev:70cae49142ff69e83ed7b41fa81a585b02cdea7d\""
  },
  {
    "lastTimestamp": "2018-11-13T05:37:08Z",
    "message": "Created container"
  },
  {
    "lastTimestamp": "2018-11-13T05:37:08Z",
    "message": "Started container"
  },
  {
    "lastTimestamp": "2018-11-13T05:34:31Z",
    "message": "Created pod: kube-addon-operator-654588755f-kb9wj"
  },
  {
    "lastTimestamp": "2018-11-13T05:34:31Z",
    "message": "Scaled up replica set kube-addon-operator-654588755f to 1"
  }
]

That looks fine (we have an initial issue pulling the kube-addon operator, but a minute later the same pull goes through). Event's specific to openshift-ingress:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/660/pull-ci-openshift-installer-master-e2e-aws/1356/artifacts/e2e-aws/events.json | jq -r '[.items[-10:][] | {lastTimestamp, message}]'
[
  {
    "lastTimestamp": "2018-11-13T05:36:58Z",
    "message": "pulling image \"quay.io/coreos/kube-addon-operator-dev:70cae49142ff69e83ed7b41fa81a585b02cdea7d\""
  },
  {
    "lastTimestamp": "2018-11-13T05:36:13Z",
    "message": "Failed to pull image \"quay.io/coreos/kube-addon-operator-dev:70cae49142ff69e83ed7b41fa81a585b02cdea7d\": rpc error: code = Unknown desc = Error reading manifest 70cae49142ff69e83ed7b41fa81a585b02cdea7d in quay.io/coreos/kube-addon-operator-dev: unauthorized: access to the requested resource is not authorized"
  },
  {
    "lastTimestamp": "2018-11-13T05:36:13Z",
    "message": "Error: ErrImagePull"
  },
  {
    "lastTimestamp": "2018-11-13T05:36:43Z",
    "message": "Back-off pulling image \"quay.io/coreos/kube-addon-operator-dev:70cae49142ff69e83ed7b41fa81a585b02cdea7d\""
  },
  {
    "lastTimestamp": "2018-11-13T05:36:43Z",
    "message": "Error: ImagePullBackOff"
  },
  {
    "lastTimestamp": "2018-11-13T05:37:05Z",
    "message": "Successfully pulled image \"quay.io/coreos/kube-addon-operator-dev:70cae49142ff69e83ed7b41fa81a585b02cdea7d\""
  },
  {
    "lastTimestamp": "2018-11-13T05:37:08Z",
    "message": "Created container"
  },
  {
    "lastTimestamp": "2018-11-13T05:37:08Z",
    "message": "Started container"
  },
  {
    "lastTimestamp": "2018-11-13T05:34:31Z",
    "message": "Created pod: kube-addon-operator-654588755f-kb9wj"
  },
  {
    "lastTimestamp": "2018-11-13T05:34:31Z",
    "message": "Scaled up replica set kube-addon-operator-654588755f to 1"
  }
]

Events specific to openshift-cluster-dns:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/660/pull-ci-openshift-installer-master-e2e-aws/1356/artifacts/e2e-aws/events.json | jq '[.items[] | select(.involvedObject.namespace == "openshift-cluster-dns") | {lastTimestamp, message}]' | tail -n 30
  },
  {
    "lastTimestamp": "2018-11-13T05:53:16Z",
    "message": "network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]"
  },
  {
    "lastTimestamp": "2018-11-13T05:35:34Z",
    "message": "Created pod: dns-default-pfss6"
  },
  {
    "lastTimestamp": "2018-11-13T05:35:34Z",
    "message": "Created pod: dns-default-kp94s"
  },
  {
    "lastTimestamp": "2018-11-13T05:35:34Z",
    "message": "Created pod: dns-default-jh6dm"
  },
  {
    "lastTimestamp": "2018-11-13T05:38:12Z",
    "message": "Created pod: dns-default-v6tv5"
  },
  {
    "lastTimestamp": "2018-11-13T05:38:14Z",
    "message": "Created pod: dns-default-q5vkq"
  },
  {
    "lastTimestamp": "2018-11-13T05:38:28Z",
    "message": "Created pod: dns-default-rp9b9"
  }
]

Logs from one of the routers:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/660/pull-ci-openshift-installer-master-e2e-aws/1356/artifacts/e2e-aws/pods/openshift-ingress_router-default-dbzf8_router.log.gz | gunzip | tail
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
I1113 05:45:34.470678       1 router.go:481] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
W1113 05:45:39.436042       1 router.go:1035] a reencrypt terminated route with host alertmanager-main-openshift-monitoring.router.default.svc.cluster.local does not have the required certificates.  The route will still be created but no certificates will be written
I1113 05:45:39.463703       1 router.go:481] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
W1113 05:55:12.399744       1 reflector.go:272] github.com/openshift/origin/pkg/router/controller/factory/factory.go:112: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old.

Logs for some of the DNS containers:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/660/pull-ci-openshift-installer-master-e2e-aws/1356/artifacts/e2e-aws/pods/openshift-cluster-dns_dns-default-jh6dm_dns.log.gz | gunzip
.:5353
2018/11/13 05:35:40 [INFO] CoreDNS-1.1.3
2018/11/13 05:35:40 [INFO] linux/amd64, go1.9.7, 
CoreDNS-1.1.3
linux/amd64, go1.9.7, 
2018/11/13 05:35:40 [INFO] plugin/reload: Running configuration MD5 = 6dfacbfa08660b953611ad25ea5c84fc
E1113 05:37:17.916905       1 reflector.go:322] github.com/coredns/coredns/plugin/kubernetes/controller.go:313: Failed to watch *v1.Service: Get https://10.3.0.1:443/api/v1/services?resourceVersion=2456&timeoutSeconds=319&watch=true: dial tcp 10.3.0.1:443: getsockopt: connection refused
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/660/pull-ci-openshift-installer-master-e2e-aws/1356/artifacts/e2e-aws/pods/openshift-cluster-dns-operator_cluster-dns-operator-66c5d9c4bd-z2z4g_cluster-dns-operator.log.gz | gunzip
time="2018-11-13T05:35:34Z" level=info msg="Go Version: go1.10.3"
time="2018-11-13T05:35:34Z" level=info msg="Go OS/Arch: linux/amd64"
time="2018-11-13T05:35:34Z" level=info msg="operator-sdk Version: 0.0.6+git"
time="2018-11-13T05:35:34Z" level=info msg="Metrics service cluster-dns-operator created"
time="2018-11-13T05:35:34Z" level=info msg="Watching dns.openshift.io/v1alpha1, ClusterDNS, openshift-cluster-dns-operator, 600000000000"
time="2018-11-13T05:35:34Z" level=info msg="reconciling for update to clusterdns \"default\""
time="2018-11-13T05:35:37Z" level=info msg="syncOperatorStatus: created ClusterOperator openshift-cluster-dns-operator/openshift-dns (UID f344daee-e705-11e8-a7b2-128c46dc2b14)"
ERROR: logging before flag.Parse: W1113 05:37:17.982687       1 reflector.go:341] github.com/openshift/cluster-dns-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer.go:91: watch of *unstructured.Unstructured ended with: unexpected object: &{map[status:Failure message:too old resource version: 1634 (2522) reason:Gone code:410 kind:Status apiVersion:v1 metadata:map[]]}
ERROR: logging before flag.Parse: W1113 05:37:18.067760       1 reflector.go:341] github.com/openshift/cluster-dns-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer.go:91: watch of *unstructured.Unstructured ended with: unexpected object: &{map[kind:Status apiVersion:v1 metadata:map[] status:Failure message:too old resource version: 1436 (2858) reason:Gone code:410]}
time="2018-11-13T05:37:19Z" level=info msg="reconciling for update to clusterdns \"default\""
ERROR: logging before flag.Parse: E1113 05:37:34.879222       1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
time="2018-11-13T05:47:19Z" level=info msg="reconciling for update to clusterdns \"default\""

@squeed, thoughts?

wking · 2018-11-13T07:12:41Z

The e2e-aws flake is probably openshift/cluster-network-operator#32, which may be due to CRI-O's runAsUser handling.

squeed · 2018-11-13T14:13:13Z

Hard to tell if that's a network flake (it could be...). It's not the runAsUser bug; that's already fixed and merged.

I think the openshift-sdn is not as tolerant about apiservers moving around. The next time this happens, i just need the logs from the sdn pod (kubectl logs -n openshift-sdn sdn-<ID>),

pkg/destroy/libvirt/libvirt.go

abhinavdahiya · 2018-11-13T18:10:20Z

pkg/destroy/libvirt/libvirt.go

 	for _, domain := range domains {
 		defer domain.Free()
 		dName, err := domain.GetName()
 		if err != nil {
-			logger.Errorf("Error getting name for domain: %v", err)
-			return false, nil
+			return false, errors.Wrap(err, "get domain name")


same as 73617ac#r233161650

abhinavdahiya · 2018-11-13T18:12:59Z

@wking do you think we should do similar for deleteVolumes like 80bb4ed

maybe it is not required as nobody is recreating them as all domains were removed.

abhinavdahiya · 2018-11-13T18:45:16Z

Also do you think 2 commits should be sufficient? maybe {data/data/libvirt,pkg/asset/machines/libvirt} -> for renaming and pkg/destroy/libvirt -> for destroy ?

pkg/asset/machines/libvirt/machines.go

wking · 2018-11-13T20:08:35Z

@wking do you think we should do similar for deleteVolumes like 80bb4ed

maybe it is not required as nobody is recreating them as all domains were removed.

Yeah, once we kill the domains, we should be able to clear the other resources with a single pass each.

wking · 2018-11-13T20:18:34Z

Also do you think 2 commits should be sufficient? maybe {data/data/libvirt,pkg/asset/machines/libvirt} -> for renaming and pkg/destroy/libvirt -> for destroy ?

So I'm clear, I currently have:

$ git log --oneline origin/master..libvirt-destroy
bca7e1e data/data/libvirt: {cluster-name}-master-{count} naming
232d306 pkg/destroy/libvirt: Single pass (instead of looping goroutines)
3b69de2 pkg/types/installconfig: Drop LibvirtNetwork.Name
bb724a3 pkg/destroy/libvirt: Use prefix-based deletion
6e4fc34 data/data/libvirt: Rename module.libvirt_base_volume -> module.volume
3db91b6 pkg/destroy/libvirt: Trailing periods for godocs
d14ceef pkg/destroy/libvirt: Simplify AlwaysTrueFilter implementation
0fa0e8c pkg/destroy/libvirt/libvirt: Rename from libvirt_prefix_deprovision.go

You'd suggesting squashing bca7e1e, bb724a3, and 6e4fc34 into a renaming commit and 232d306, 3db91b6, and d14ceef into a destroy-implementation commit? Would you put 0fa0e8c into the destroy-implementation commit? What about 3b69de2?

pkg/asset/machines/libvirt/machines.go

abhinavdahiya · 2018-11-13T20:30:12Z

You'd suggesting squashing bca7e1e, bb724a3, and 6e4fc34 into a renaming commit and 232d306, 3db91b6, and d14ceef into a destroy-implementation commit?

Yes one that renames, another that changes the destroy

Would you put 0fa0e8c into the destroy-implementation commit?

sure

What about 3b69de2?

this looks like this should not be part of this PR at all. it looks like it has nothing to do with pkg/destroy/libvirt: Use prefix-based deletion

wking · 2018-11-13T20:33:54Z

What about 3b69de2?

this looks like this should not be part of this PR at all. it looks like it has nothing to do with pkg/destroy/libvirt: Use prefix-based deletion

The connection is that with the prefix-based deletion, we're relying on cluster-name prefixes. Not "cluster-name prefixes for most things, but whatever you called the network for that" ;). Dropping the ability to configure a separate network name ensures the vanilla cluster-name prefix search picks up the network as well.

wking · 2018-11-13T20:48:18Z

Would you put 0fa0e8c into the destroy-implementation commit?

sure

Well, that's a rename that changes the destroy ;). I've made these squashes with bca7e1e -> 01823b7, leaving the Network.Name change (previously 3b69de2, now b3ce24328) as a separate commit for now.

cgwalters · 2018-11-13T22:00:56Z

I only skimmed the commits here but - sounds awesome to me!

pkg/asset/machines/libvirt/machines.go

We'd been defaulting it to ClusterName in InstallConfig.Generate, and I see no reason for the user to want to create a separate name for the network alone. The variable dates back to 4a08942 (steps: bootstrap / etcd / topology support for libvirtm 2018-04-24, coreos/tectonic-installer#3213), where it is not explicitly motivated.

And I've rerolled deletion to use a single call to each deleter, failing fast if they error. That should address cases where we cannot destroy a shut-off domain [1]: $ virsh -c $OPENSHIFT_INSTALL_LIBVIRT_URI list --all Id Name State ---------------------------------------------------- - master0 shut off - test1-worker-0-zd7hd shut off $ bin/openshift-install destroy cluster --dir test --log-level debug DEBUG Deleting libvirt volumes DEBUG Deleting libvirt domains DEBUG Deleting libvirt network DEBUG Exiting deleting libvirt network DEBUG goroutine deleteNetwork complete ERROR Error destroying domain test1-worker-0-zd7hd: virError(Code=55, Domain=10, Message='Requested operation is not valid: domain is not running') DEBUG Exiting deleting libvirt domains DEBUG Exiting deleting libvirt volumes DEBUG goroutine deleteVolumes complete DEBUG Deleting libvirt domains ERROR Error destroying domain test1-worker-0-zd7hd: virError(Code=55, Domain=10, Message='Requested operation is not valid: domain is not running') [...] Now we'll fail-fast in those cases, allowing the caller to clear the stuck domains, after which they can restart deletion. The previous goroutine approach was borrowed from the AWS destroyer. But AWS has a large, complicated resource dependency graph which includes cycles. Libvirt is much simpler, with volumes and a network that are all independent, followed by domains which depend on the network and some of the volumes. With this commit we now take a single pass at destroying those resources starting at the leaf domains and working our way rootwards. I've retained some looping (although no longer in a separate goroutine) for domain deletion. This guards against racing domain creation, as discussed in the new godocs for deleteDomains. Also: * Rename from libvirt_prefix_deprovision.go to libvirt.go. The name is from 998ba30 (cmd,pkg/destroy: add non-terraform destroy, 2018-09-25, openshift#324), but the implementation doesn't need to be represented in the filename. This commit renames to libvirt.go to match the package name, since this file is the guts of this package. * Simplify the AlwaysTrueFilter implementation. No semantic changes, but this saves us a few lines of code. * Add trailing periods for godocs to comply with [2]. [1]: openshift#656 (comment) [2]: https://github.com/golang/go/wiki/CodeReviewComments#comment-sentences

To avoid wiping out the caller's whole libvirt environment, regardless of whether it was associated with our cluster or not. Using cluster-name prefixes still makes me a bit jumpy, so I've added warnings to both the environment-variable and asset-prompt docs warning libvirt users to pick something sufficiently unique. Also: * Use {cluster-name}-master-{count} naming. We used to use master{count}, which diverged from other usage (e.g. AWS, which has used master-{count} since way back in ca443c5 (openstack/nova: replace cloud-init with ignition, 2017-02-27, coreos/tectonic-installer#7). * Rename module.libvirt_base_volume -> module.volume. There's no reason to diverge from the module source for that name.

abhinavdahiya · 2018-11-13T22:58:51Z

/lgtm

openshift-ci-robot · 2018-11-13T22:58:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Catching up with 18ca912 (pkg/destroy/libvirt: Use prefix-based deletion, 2018-11-12, openshift#660).

openshift-ci-robot requested review from abhinavdahiya and rajatchopra November 13, 2018 00:07

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 13, 2018

abhinavdahiya reviewed Nov 13, 2018

View reviewed changes

data/data/libvirt/main.tf Show resolved Hide resolved

wking force-pushed the libvirt-destroy branch from e89343f to 9918d51 Compare November 13, 2018 00:16

wking mentioned this pull request Nov 13, 2018

libvirt install: use fqdn for libvirt instance names #354

Closed

wking force-pushed the libvirt-destroy branch from 9918d51 to 0eb0adc Compare November 13, 2018 00:44

wking mentioned this pull request Nov 13, 2018

pkg/asset/installconfig/platform: Drop *PlatformType for types.{platform}.Name #659

Merged

wking force-pushed the libvirt-destroy branch from 0eb0adc to 80bb4ed Compare November 13, 2018 05:20

abhinavdahiya reviewed Nov 13, 2018

View reviewed changes

pkg/destroy/libvirt/libvirt.go Show resolved Hide resolved

abhinavdahiya reviewed Nov 13, 2018

View reviewed changes

pkg/asset/machines/libvirt/machines.go Outdated Show resolved Hide resolved

wking force-pushed the libvirt-destroy branch from 80bb4ed to bca7e1e Compare November 13, 2018 20:13

abhinavdahiya reviewed Nov 13, 2018

View reviewed changes

pkg/asset/machines/libvirt/machines.go Outdated Show resolved Hide resolved

wking force-pushed the libvirt-destroy branch 2 times, most recently from a6ba0a1 to 01823b7 Compare November 13, 2018 20:47

wking force-pushed the libvirt-destroy branch from 01823b7 to 61d2e55 Compare November 13, 2018 20:52

abhinavdahiya reviewed Nov 13, 2018

View reviewed changes

pkg/asset/machines/libvirt/machines.go Outdated Show resolved Hide resolved

wking added 3 commits November 13, 2018 14:52

wking force-pushed the libvirt-destroy branch from 61d2e55 to 18ca912 Compare November 13, 2018 22:53

openshift-ci-robot assigned abhinavdahiya Nov 13, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 13, 2018

openshift-merge-robot merged commit 32a15a5 into openshift:master Nov 13, 2018

wking deleted the libvirt-destroy branch November 14, 2018 00:04

wking added a commit to wking/openshift-installer that referenced this pull request Nov 14, 2018

docs/dev/libvirt: Update to {cluster-name}-master-0

393aeaf

Catching up with 18ca912 (pkg/destroy/libvirt: Use prefix-based deletion, 2018-11-12, openshift#660).

wking mentioned this pull request Nov 14, 2018

docs/dev/libvirt: Update to {cluster-name}-master-0 #673

Merged

This was referenced Nov 26, 2018

libvirt: Network doesn't do dhcp #708

Closed

asset/machines/libvirt: switch to secret as source of ignition #731

Merged

wking mentioned this pull request Dec 6, 2018

libvirt: Update howto and add setup automation. #636

Closed

wking mentioned this pull request Jan 10, 2019

[FIX #1006] destroy/libvirt: cannot destroy domain if it is already shutoff #1043

Merged

bogdando mentioned this pull request Jan 20, 2021

Bug 1916692: OpenStack: Delete leftover LBs when destroying cluster #4563

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/destroy/libvirt: Use prefix-based deletion #660

pkg/destroy/libvirt: Use prefix-based deletion #660

wking commented Nov 13, 2018

abhinavdahiya commented Nov 13, 2018 •

edited

Loading

wking commented Nov 13, 2018

wking commented Nov 13, 2018

abhinavdahiya commented Nov 13, 2018 •

edited

Loading

abhinavdahiya commented Nov 13, 2018

wking commented Nov 13, 2018 •

edited

Loading

wking commented Nov 13, 2018

abhinavdahiya commented Nov 13, 2018

wking commented Nov 13, 2018

wking commented Nov 13, 2018

wking commented Nov 13, 2018

squeed commented Nov 13, 2018

abhinavdahiya Nov 13, 2018

abhinavdahiya commented Nov 13, 2018

abhinavdahiya commented Nov 13, 2018 •

edited

Loading

wking commented Nov 13, 2018

wking commented Nov 13, 2018

abhinavdahiya commented Nov 13, 2018

wking commented Nov 13, 2018

wking commented Nov 13, 2018 •

edited

Loading

cgwalters commented Nov 13, 2018

abhinavdahiya commented Nov 13, 2018

openshift-ci-robot commented Nov 13, 2018

pkg/destroy/libvirt: Use prefix-based deletion #660

pkg/destroy/libvirt: Use prefix-based deletion #660

Conversation

wking commented Nov 13, 2018

abhinavdahiya commented Nov 13, 2018 • edited Loading

wking commented Nov 13, 2018

wking commented Nov 13, 2018

abhinavdahiya commented Nov 13, 2018 • edited Loading

abhinavdahiya commented Nov 13, 2018

wking commented Nov 13, 2018 • edited Loading

wking commented Nov 13, 2018

abhinavdahiya commented Nov 13, 2018

wking commented Nov 13, 2018

wking commented Nov 13, 2018

wking commented Nov 13, 2018

squeed commented Nov 13, 2018

abhinavdahiya Nov 13, 2018

Choose a reason for hiding this comment

abhinavdahiya commented Nov 13, 2018

abhinavdahiya commented Nov 13, 2018 • edited Loading

wking commented Nov 13, 2018

wking commented Nov 13, 2018

abhinavdahiya commented Nov 13, 2018

wking commented Nov 13, 2018

wking commented Nov 13, 2018 • edited Loading

cgwalters commented Nov 13, 2018

abhinavdahiya commented Nov 13, 2018

openshift-ci-robot commented Nov 13, 2018

abhinavdahiya commented Nov 13, 2018 •

edited

Loading

abhinavdahiya commented Nov 13, 2018 •

edited

Loading

wking commented Nov 13, 2018 •

edited

Loading

abhinavdahiya commented Nov 13, 2018 •

edited

Loading

wking commented Nov 13, 2018 •

edited

Loading