-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/destroy/libvirt: Use prefix-based deletion #660
pkg/destroy/libvirt: Use prefix-based deletion #660
Conversation
e89343f
to
9918d51
Compare
From #656:
With 926b351, we take the "just error out without looping" approach to resolving #656. I'm not sure what to do about the machine-api-operator race. How about taking two passes over the domains? It seems unlikely that a domain that squeaks through the first pass gets alive enough to launch another domain before the second pass kills it. |
What change? Are you saying 1d6fab56818 (which already touches |
What is the correct response to already shutoff domain?
if you do not delete the master domain that is running the MAO first, it will be trying to create domains to match its Machine objects. So you'll leak newly created domain as you didn't get in the |
1d6fab5#diff-56a33e306cc10f32c5f102c247a572e7L43 the name needs to change to match changes in tf file. |
But we don't want to continue so far that we delete their
But if I take a second pass: for _, del := range []deleteFunc{
deleteDomains,
deleteDomains, // catch any children launched by the machine-API operator
deleteNetwork,
deleteVolumes,
} {
err = del(conn, o.Filter, o.Logger)
if err != nil {
return err
}
} then any newly-created domains will be caught by the second |
9918d51
to
0eb0adc
Compare
Ah. I've pushed 9918d51 -> 0eb0adc. Does the new 0eb0adc have what you want? |
I prefer the current impl, but i think we need to do more regarding marking domains for deletion.. only running,pending etc....
we should not exist / say completed deleting domains until we have actually deleted all. I don't like the 2 calls. |
0eb0adc
to
80bb4ed
Compare
Ok, I've pushed 0eb0adc -> 80bb4ed which leaves a single |
Checking on the nodes: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/660/pull-ci-openshift-installer-master-e2e-aws/1356/artifacts/e2e-aws/nodes.json | jq '[.items[] | .metadata | {"name": .name, "machine": .annotations.machine}]'
[
{
"name": "ip-10-0-13-33.ec2.internal",
"machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-master-0"
},
{
"name": "ip-10-0-130-112.ec2.internal",
"machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-worker-us-east-1a-5ttjv"
},
{
"name": "ip-10-0-154-238.ec2.internal",
"machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-worker-us-east-1b-t9gzw"
},
{
"name": "ip-10-0-168-175.ec2.internal",
"machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-worker-us-east-1c-fd75b"
},
{
"name": "ip-10-0-17-71.ec2.internal",
"machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-master-1"
},
{
"name": "ip-10-0-33-245.ec2.internal",
"machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-master-2"
}
] And the pods: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/660/pull-ci-openshift-installer-master-e2e-aws/1356/artifacts/e2e-aws/pods.json | jq '[.items[] | {"namespace": .metadata.namespace, "name": .metadata.name, "phase": .status.phase, "node": .spec.nodeName} | select(.phase != "Running" and .phase != "Succeeded")]'
[
{
"namespace": "openshift-cluster-dns",
"name": "dns-default-q5vkq",
"phase": "Pending",
"node": "ip-10-0-154-238.ec2.internal"
},
{
"namespace": "openshift-cluster-dns",
"name": "dns-default-v6tv5",
"phase": "Pending",
"node": "ip-10-0-130-112.ec2.internal"
},
{
"namespace": "openshift-ingress",
"name": "router-default-7jhr9",
"phase": "Pending",
"node": "ip-10-0-130-112.ec2.internal"
},
{
"namespace": "openshift-ingress",
"name": "router-default-rxq5c",
"phase": "Pending",
"node": "ip-10-0-154-238.ec2.internal"
}
] So all of those are pending for worker nodes. Finally, checking the tail of the event log: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/660/pull-ci-openshift-installer-master-e2e-aws/1356/artifacts/e2e-aws/events.json | jq -r '[.items[-10:][] | {lastTimestamp, message}]'
[
{
"lastTimestamp": "2018-11-13T05:36:58Z",
"message": "pulling image \"quay.io/coreos/kube-addon-operator-dev:70cae49142ff69e83ed7b41fa81a585b02cdea7d\""
},
{
"lastTimestamp": "2018-11-13T05:36:13Z",
"message": "Failed to pull image \"quay.io/coreos/kube-addon-operator-dev:70cae49142ff69e83ed7b41fa81a585b02cdea7d\": rpc error: code = Unknown desc = Error reading manifest 70cae49142ff69e83ed7b41fa81a585b02cdea7d in quay.io/coreos/kube-addon-operator-dev: unauthorized: access to the requested resource is not authorized"
},
{
"lastTimestamp": "2018-11-13T05:36:13Z",
"message": "Error: ErrImagePull"
},
{
"lastTimestamp": "2018-11-13T05:36:43Z",
"message": "Back-off pulling image \"quay.io/coreos/kube-addon-operator-dev:70cae49142ff69e83ed7b41fa81a585b02cdea7d\""
},
{
"lastTimestamp": "2018-11-13T05:36:43Z",
"message": "Error: ImagePullBackOff"
},
{
"lastTimestamp": "2018-11-13T05:37:05Z",
"message": "Successfully pulled image \"quay.io/coreos/kube-addon-operator-dev:70cae49142ff69e83ed7b41fa81a585b02cdea7d\""
},
{
"lastTimestamp": "2018-11-13T05:37:08Z",
"message": "Created container"
},
{
"lastTimestamp": "2018-11-13T05:37:08Z",
"message": "Started container"
},
{
"lastTimestamp": "2018-11-13T05:34:31Z",
"message": "Created pod: kube-addon-operator-654588755f-kb9wj"
},
{
"lastTimestamp": "2018-11-13T05:34:31Z",
"message": "Scaled up replica set kube-addon-operator-654588755f to 1"
}
] That looks fine (we have an initial issue pulling the kube-addon operator, but a minute later the same pull goes through). Event's specific to $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/660/pull-ci-openshift-installer-master-e2e-aws/1356/artifacts/e2e-aws/events.json | jq -r '[.items[-10:][] | {lastTimestamp, message}]'
[
{
"lastTimestamp": "2018-11-13T05:36:58Z",
"message": "pulling image \"quay.io/coreos/kube-addon-operator-dev:70cae49142ff69e83ed7b41fa81a585b02cdea7d\""
},
{
"lastTimestamp": "2018-11-13T05:36:13Z",
"message": "Failed to pull image \"quay.io/coreos/kube-addon-operator-dev:70cae49142ff69e83ed7b41fa81a585b02cdea7d\": rpc error: code = Unknown desc = Error reading manifest 70cae49142ff69e83ed7b41fa81a585b02cdea7d in quay.io/coreos/kube-addon-operator-dev: unauthorized: access to the requested resource is not authorized"
},
{
"lastTimestamp": "2018-11-13T05:36:13Z",
"message": "Error: ErrImagePull"
},
{
"lastTimestamp": "2018-11-13T05:36:43Z",
"message": "Back-off pulling image \"quay.io/coreos/kube-addon-operator-dev:70cae49142ff69e83ed7b41fa81a585b02cdea7d\""
},
{
"lastTimestamp": "2018-11-13T05:36:43Z",
"message": "Error: ImagePullBackOff"
},
{
"lastTimestamp": "2018-11-13T05:37:05Z",
"message": "Successfully pulled image \"quay.io/coreos/kube-addon-operator-dev:70cae49142ff69e83ed7b41fa81a585b02cdea7d\""
},
{
"lastTimestamp": "2018-11-13T05:37:08Z",
"message": "Created container"
},
{
"lastTimestamp": "2018-11-13T05:37:08Z",
"message": "Started container"
},
{
"lastTimestamp": "2018-11-13T05:34:31Z",
"message": "Created pod: kube-addon-operator-654588755f-kb9wj"
},
{
"lastTimestamp": "2018-11-13T05:34:31Z",
"message": "Scaled up replica set kube-addon-operator-654588755f to 1"
}
] Events specific to $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/660/pull-ci-openshift-installer-master-e2e-aws/1356/artifacts/e2e-aws/events.json | jq '[.items[] | select(.involvedObject.namespace == "openshift-cluster-dns") | {lastTimestamp, message}]' | tail -n 30
},
{
"lastTimestamp": "2018-11-13T05:53:16Z",
"message": "network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]"
},
{
"lastTimestamp": "2018-11-13T05:35:34Z",
"message": "Created pod: dns-default-pfss6"
},
{
"lastTimestamp": "2018-11-13T05:35:34Z",
"message": "Created pod: dns-default-kp94s"
},
{
"lastTimestamp": "2018-11-13T05:35:34Z",
"message": "Created pod: dns-default-jh6dm"
},
{
"lastTimestamp": "2018-11-13T05:38:12Z",
"message": "Created pod: dns-default-v6tv5"
},
{
"lastTimestamp": "2018-11-13T05:38:14Z",
"message": "Created pod: dns-default-q5vkq"
},
{
"lastTimestamp": "2018-11-13T05:38:28Z",
"message": "Created pod: dns-default-rp9b9"
}
] Logs from one of the routers: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/660/pull-ci-openshift-installer-master-e2e-aws/1356/artifacts/e2e-aws/pods/openshift-ingress_router-default-dbzf8_router.log.gz | gunzip | tail
- Checking http://localhost:80 ...
- Health check ok : 0 retry attempt(s).
I1113 05:45:34.470678 1 router.go:481] Router reloaded:
- Checking http://localhost:80 ...
- Health check ok : 0 retry attempt(s).
W1113 05:45:39.436042 1 router.go:1035] a reencrypt terminated route with host alertmanager-main-openshift-monitoring.router.default.svc.cluster.local does not have the required certificates. The route will still be created but no certificates will be written
I1113 05:45:39.463703 1 router.go:481] Router reloaded:
- Checking http://localhost:80 ...
- Health check ok : 0 retry attempt(s).
W1113 05:55:12.399744 1 reflector.go:272] github.com/openshift/origin/pkg/router/controller/factory/factory.go:112: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old. Logs for some of the DNS containers: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/660/pull-ci-openshift-installer-master-e2e-aws/1356/artifacts/e2e-aws/pods/openshift-cluster-dns_dns-default-jh6dm_dns.log.gz | gunzip
.:5353
2018/11/13 05:35:40 [INFO] CoreDNS-1.1.3
2018/11/13 05:35:40 [INFO] linux/amd64, go1.9.7,
CoreDNS-1.1.3
linux/amd64, go1.9.7,
2018/11/13 05:35:40 [INFO] plugin/reload: Running configuration MD5 = 6dfacbfa08660b953611ad25ea5c84fc
E1113 05:37:17.916905 1 reflector.go:322] github.com/coredns/coredns/plugin/kubernetes/controller.go:313: Failed to watch *v1.Service: Get https://10.3.0.1:443/api/v1/services?resourceVersion=2456&timeoutSeconds=319&watch=true: dial tcp 10.3.0.1:443: getsockopt: connection refused
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/660/pull-ci-openshift-installer-master-e2e-aws/1356/artifacts/e2e-aws/pods/openshift-cluster-dns-operator_cluster-dns-operator-66c5d9c4bd-z2z4g_cluster-dns-operator.log.gz | gunzip
time="2018-11-13T05:35:34Z" level=info msg="Go Version: go1.10.3"
time="2018-11-13T05:35:34Z" level=info msg="Go OS/Arch: linux/amd64"
time="2018-11-13T05:35:34Z" level=info msg="operator-sdk Version: 0.0.6+git"
time="2018-11-13T05:35:34Z" level=info msg="Metrics service cluster-dns-operator created"
time="2018-11-13T05:35:34Z" level=info msg="Watching dns.openshift.io/v1alpha1, ClusterDNS, openshift-cluster-dns-operator, 600000000000"
time="2018-11-13T05:35:34Z" level=info msg="reconciling for update to clusterdns \"default\""
time="2018-11-13T05:35:37Z" level=info msg="syncOperatorStatus: created ClusterOperator openshift-cluster-dns-operator/openshift-dns (UID f344daee-e705-11e8-a7b2-128c46dc2b14)"
ERROR: logging before flag.Parse: W1113 05:37:17.982687 1 reflector.go:341] github.com/openshift/cluster-dns-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer.go:91: watch of *unstructured.Unstructured ended with: unexpected object: &{map[status:Failure message:too old resource version: 1634 (2522) reason:Gone code:410 kind:Status apiVersion:v1 metadata:map[]]}
ERROR: logging before flag.Parse: W1113 05:37:18.067760 1 reflector.go:341] github.com/openshift/cluster-dns-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer.go:91: watch of *unstructured.Unstructured ended with: unexpected object: &{map[kind:Status apiVersion:v1 metadata:map[] status:Failure message:too old resource version: 1436 (2858) reason:Gone code:410]}
time="2018-11-13T05:37:19Z" level=info msg="reconciling for update to clusterdns \"default\""
ERROR: logging before flag.Parse: E1113 05:37:34.879222 1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
time="2018-11-13T05:47:19Z" level=info msg="reconciling for update to clusterdns \"default\"" @squeed, thoughts? |
The e2e-aws flake is probably openshift/cluster-network-operator#32, which may be due to CRI-O's |
Hard to tell if that's a network flake (it could be...). It's not the runAsUser bug; that's already fixed and merged. I think the openshift-sdn is not as tolerant about apiservers moving around. The next time this happens, i just need the logs from the sdn pod ( |
for _, domain := range domains { | ||
defer domain.Free() | ||
dName, err := domain.GetName() | ||
if err != nil { | ||
logger.Errorf("Error getting name for domain: %v", err) | ||
return false, nil | ||
return false, errors.Wrap(err, "get domain name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as 73617ac#r233161650
Also do you think 2 commits should be sufficient? maybe |
Yeah, once we kill the domains, we should be able to clear the other resources with a single pass each. |
80bb4ed
to
bca7e1e
Compare
So I'm clear, I currently have: $ git log --oneline origin/master..libvirt-destroy
bca7e1e data/data/libvirt: {cluster-name}-master-{count} naming
232d306 pkg/destroy/libvirt: Single pass (instead of looping goroutines)
3b69de2 pkg/types/installconfig: Drop LibvirtNetwork.Name
bb724a3 pkg/destroy/libvirt: Use prefix-based deletion
6e4fc34 data/data/libvirt: Rename module.libvirt_base_volume -> module.volume
3db91b6 pkg/destroy/libvirt: Trailing periods for godocs
d14ceef pkg/destroy/libvirt: Simplify AlwaysTrueFilter implementation
0fa0e8c pkg/destroy/libvirt/libvirt: Rename from libvirt_prefix_deprovision.go You'd suggesting squashing bca7e1e, bb724a3, and 6e4fc34 into a renaming commit and 232d306, 3db91b6, and d14ceef into a destroy-implementation commit? Would you put 0fa0e8c into the destroy-implementation commit? What about 3b69de2? |
Yes one that renames, another that changes the destroy
sure
this looks like this should not be part of this PR at all. it looks like it has nothing to do with |
The connection is that with the prefix-based deletion, we're relying on cluster-name prefixes. Not "cluster-name prefixes for most things, but whatever you called the network for that" ;). Dropping the ability to configure a separate network name ensures the vanilla cluster-name prefix search picks up the network as well. |
a6ba0a1
to
01823b7
Compare
Well, that's a rename that changes the destroy ;). I've made these squashes with bca7e1e -> 01823b7, leaving the |
01823b7
to
61d2e55
Compare
I only skimmed the commits here but - sounds awesome to me! |
We'd been defaulting it to ClusterName in InstallConfig.Generate, and I see no reason for the user to want to create a separate name for the network alone. The variable dates back to 4a08942 (steps: bootstrap / etcd / topology support for libvirtm 2018-04-24, coreos/tectonic-installer#3213), where it is not explicitly motivated.
And I've rerolled deletion to use a single call to each deleter, failing fast if they error. That should address cases where we cannot destroy a shut-off domain [1]: $ virsh -c $OPENSHIFT_INSTALL_LIBVIRT_URI list --all Id Name State ---------------------------------------------------- - master0 shut off - test1-worker-0-zd7hd shut off $ bin/openshift-install destroy cluster --dir test --log-level debug DEBUG Deleting libvirt volumes DEBUG Deleting libvirt domains DEBUG Deleting libvirt network DEBUG Exiting deleting libvirt network DEBUG goroutine deleteNetwork complete ERROR Error destroying domain test1-worker-0-zd7hd: virError(Code=55, Domain=10, Message='Requested operation is not valid: domain is not running') DEBUG Exiting deleting libvirt domains DEBUG Exiting deleting libvirt volumes DEBUG goroutine deleteVolumes complete DEBUG Deleting libvirt domains ERROR Error destroying domain test1-worker-0-zd7hd: virError(Code=55, Domain=10, Message='Requested operation is not valid: domain is not running') [...] Now we'll fail-fast in those cases, allowing the caller to clear the stuck domains, after which they can restart deletion. The previous goroutine approach was borrowed from the AWS destroyer. But AWS has a large, complicated resource dependency graph which includes cycles. Libvirt is much simpler, with volumes and a network that are all independent, followed by domains which depend on the network and some of the volumes. With this commit we now take a single pass at destroying those resources starting at the leaf domains and working our way rootwards. I've retained some looping (although no longer in a separate goroutine) for domain deletion. This guards against racing domain creation, as discussed in the new godocs for deleteDomains. Also: * Rename from libvirt_prefix_deprovision.go to libvirt.go. The name is from 998ba30 (cmd,pkg/destroy: add non-terraform destroy, 2018-09-25, openshift#324), but the implementation doesn't need to be represented in the filename. This commit renames to libvirt.go to match the package name, since this file is the guts of this package. * Simplify the AlwaysTrueFilter implementation. No semantic changes, but this saves us a few lines of code. * Add trailing periods for godocs to comply with [2]. [1]: openshift#656 (comment) [2]: https://github.com/golang/go/wiki/CodeReviewComments#comment-sentences
To avoid wiping out the caller's whole libvirt environment, regardless of whether it was associated with our cluster or not. Using cluster-name prefixes still makes me a bit jumpy, so I've added warnings to both the environment-variable and asset-prompt docs warning libvirt users to pick something sufficiently unique. Also: * Use {cluster-name}-master-{count} naming. We used to use master{count}, which diverged from other usage (e.g. AWS, which has used master-{count} since way back in ca443c5 (openstack/nova: replace cloud-init with ignition, 2017-02-27, coreos/tectonic-installer#7). * Rename module.libvirt_base_volume -> module.volume. There's no reason to diverge from the module source for that name.
61d2e55
to
18ca912
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Catching up with 18ca912 (pkg/destroy/libvirt: Use prefix-based deletion, 2018-11-12, openshift#660).
To avoid wiping out the caller's whole libvirt environment, regardless of whether it was associated with our cluster or not. Using cluster-name prefixes still makes me a bit jumpy, so I've added warnings to both the environment-variable and asset-prompt docs warning libvirt users to pick something sufficiently unique.
There are also a number of other libvirt-destroy pivots in separate commits. I can pull any or all of those out into separate pull requests if that makes review easier. The only other change which seems user-visible is 926b351, which converts our looping goroutine deletion to a single pass.
Fixes #354.
Fixes #647.
Fixes #656.