Machine deletions fails if there is no ELB #1084

vivgoyal · 2019-08-29T05:21:27Z

/kind bug

What steps did you take and what happened:
I deleted AWS resources manually and then initiated machine and cluster deletion from CAPA.
As a result of this what I saw is delete machine deployments failing continuously because of:

I0814 17:10:39.243788       1 instances.go:67] [machine-actuator]/cluster.k8s.io/v1alpha1/4d95faba9cb7ee388671ac3cef6ee79b39c25f15/bf038fa5/worker-bf038fa5-nodepool-bf038fa5default-6c86565b8rvqdh "level"=2 "msg"="Looking for existing machine instance by tags"  
I0814 17:10:39.288157       1 machine_controller.go:181] Deleting node "ip-10-0-0-20.us-west-2.compute.internal" for machine "worker-bf038fa5-nodepool-bf038fa5default-6c86565b8rvqdh"
E0814 17:10:39.301721       1 machine_controller.go:183] Error deleting node "ip-10-0-0-20.us-west-2.compute.internal" for machine "worker-bf038fa5-nodepool-bf038fa5default-6c86565b8rvqdh": Delete 
https://bf038fa5-apiserver-261840232.us-west-2.elb.amazonaws.com:6443/api/v1/nodes/ip-10-0-0-20.us-west-2.compute.internal
: dial tcp: lookup bf038fa5-apiserver-261840232.us-west-2.elb.amazonaws.com
on 10.96.0.10:53: no such host

What did you expect to happen:
What I expected was that deletions would succeed since the resources are NOT FOUND.

Anything else you would like to add:
Also, If I delete the ELB, it could be recreated, unless the Cluster is being deleted. That said, it would also mean a different DNS name, which imply that anything referencing the old DNS name would need to be updated, which may not be done automatically. That would include the kubeconfig secret, but more importantly the client config/kubeadm config for all of the existing Machines in the cluster.

Environment:

Cluster-api-provider-aws version: v0.3.7
Kubernetes version: (use kubectl version): 1.14.1
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

vincepri · 2019-08-29T16:28:33Z

How would we connect to the cluster without an elb? 🤔

vivgoyal · 2019-08-29T18:38:41Z

@vincepri Should we reconcile elb on "NotFound" and then try deletion?
Also what more worrying is the part I mentioned in "Anything else you would like to add".

detiber · 2019-08-30T12:38:16Z

/priority important-soon

ncdc · 2019-09-12T14:01:16Z

It will be difficult for us to account for all situations where someone manually manipulates the AWS resources that CAPA creates and manages. If you make changes (edits or deletes) to these resources, it's probably up to you to resolve any issues, such as this one. I'm inclined to close this. WDYT @detiber @vincepri @sethp-nr @rudoi?

ncdc · 2019-09-12T14:02:48Z

I would also be willing to accept a code change to CAPI that attempts to delete the node up to n times and then gives up, without erroring.

detiber · 2019-09-12T14:04:36Z

@ncdc I'm a bit torn on it, there are potentially things we could do to work around the load balancer being a blocker like it is today, such as using a static DNS entry for the apiserver endpoint rather than the LB dns name. I'm not sure it's something we can likely do in the short term, but longer term I don't think this should fail in a catastrophic way as it does today.

ncdc · 2019-09-12T14:06:14Z

How about if we start with my suggestion about not failing on node deletion and see what else trips us up, if anything?

detiber · 2019-09-12T14:21:18Z

How about if we start with my suggestion about not failing on node deletion and see what else trips us up, if anything?

In the general case of worker nodes, that should probably be fine.

ncdc · 2019-09-12T14:24:20Z

I'm not sure it makes sense to distinguish?

detiber · 2019-09-12T14:25:30Z

Today probably not, but when we have control plane management, we'd need to ensure that we handle the etcd membership properly on deletion of a node that is a control plane, which based on the model we are using with cluster-api-upgrade-tool, would require apiserver access.

ncdc · 2019-09-26T15:02:22Z

I filed kubernetes-sigs/cluster-api#1446 to try to delete the Node multiple times, then move on without considering it an error.

ncdc · 2019-09-27T14:53:56Z

The node deletion issue was fixed by kubernetes-sigs/cluster-api#1452, which will be in a future CAPI v0.2.x release. @vivgoyal PTAL and let us know if you think that's sufficient. Thanks!

vivgoyal · 2019-10-02T17:18:37Z

Looks good to close. I haven't tested it yet though. If I find any issues, I can anyday open a new one.

ncdc · 2019-10-02T17:19:21Z

Let's leave it open until you can test it with CAPI v0.2.3 or newer (v1alpha2)

liztio · 2019-10-11T16:36:09Z

/priority awaiting-evidence

k8s-ci-robot · 2019-10-11T16:36:11Z

@liztio: The label(s) priority/awaiting-evidence cannot be applied. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other

In response to this:

/priority awaiting-evidence

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

liztio · 2019-10-11T16:37:49Z

/priority awaiting-more-evidence

fejta-bot · 2020-01-09T16:41:00Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

vincepri · 2020-01-10T17:47:24Z

This should have been fixed

/close

k8s-ci-robot · 2020-01-10T17:47:26Z

@vincepri: Closing this issue.

In response to this:

This should have been fixed

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 29, 2019

detiber added this to the v0.3.x (v1alpha1) milestone Aug 30, 2019

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 30, 2019

ncdc mentioned this issue Sep 26, 2019

Machine deletion: try up to n times to delete the Node, then move on kubernetes-sigs/cluster-api#1446

Closed

ncdc modified the milestones: v0.3.x, v0.4.x Oct 10, 2019

k8s-ci-robot added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Oct 11, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2020

k8s-ci-robot closed this as completed Jan 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine deletions fails if there is no ELB #1084

Machine deletions fails if there is no ELB #1084

vivgoyal commented Aug 29, 2019 •

edited by ncdc

Loading

vincepri commented Aug 29, 2019

vivgoyal commented Aug 29, 2019

detiber commented Aug 30, 2019

ncdc commented Sep 12, 2019

ncdc commented Sep 12, 2019

detiber commented Sep 12, 2019

ncdc commented Sep 12, 2019

detiber commented Sep 12, 2019

ncdc commented Sep 12, 2019

detiber commented Sep 12, 2019

ncdc commented Sep 26, 2019

ncdc commented Sep 27, 2019

vivgoyal commented Oct 2, 2019

ncdc commented Oct 2, 2019

liztio commented Oct 11, 2019

k8s-ci-robot commented Oct 11, 2019

liztio commented Oct 11, 2019

fejta-bot commented Jan 9, 2020

vincepri commented Jan 10, 2020

k8s-ci-robot commented Jan 10, 2020

Machine deletions fails if there is no ELB #1084

Machine deletions fails if there is no ELB #1084

Comments

vivgoyal commented Aug 29, 2019 • edited by ncdc Loading

vincepri commented Aug 29, 2019

vivgoyal commented Aug 29, 2019

detiber commented Aug 30, 2019

ncdc commented Sep 12, 2019

ncdc commented Sep 12, 2019

detiber commented Sep 12, 2019

ncdc commented Sep 12, 2019

detiber commented Sep 12, 2019

ncdc commented Sep 12, 2019

detiber commented Sep 12, 2019

ncdc commented Sep 26, 2019

ncdc commented Sep 27, 2019

vivgoyal commented Oct 2, 2019

ncdc commented Oct 2, 2019

liztio commented Oct 11, 2019

k8s-ci-robot commented Oct 11, 2019

liztio commented Oct 11, 2019

fejta-bot commented Jan 9, 2020

vincepri commented Jan 10, 2020

k8s-ci-robot commented Jan 10, 2020

vivgoyal commented Aug 29, 2019 •

edited by ncdc

Loading