Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine deletions fails if there is no ELB #1084

Closed
vivgoyal opened this issue Aug 29, 2019 · 20 comments
Closed

Machine deletions fails if there is no ELB #1084

vivgoyal opened this issue Aug 29, 2019 · 20 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@vivgoyal
Copy link

vivgoyal commented Aug 29, 2019

/kind bug

What steps did you take and what happened:
I deleted AWS resources manually and then initiated machine and cluster deletion from CAPA.
As a result of this what I saw is delete machine deployments failing continuously because of:

I0814 17:10:39.243788       1 instances.go:67] [machine-actuator]/cluster.k8s.io/v1alpha1/4d95faba9cb7ee388671ac3cef6ee79b39c25f15/bf038fa5/worker-bf038fa5-nodepool-bf038fa5default-6c86565b8rvqdh "level"=2 "msg"="Looking for existing machine instance by tags"  
I0814 17:10:39.288157       1 machine_controller.go:181] Deleting node "ip-10-0-0-20.us-west-2.compute.internal" for machine "worker-bf038fa5-nodepool-bf038fa5default-6c86565b8rvqdh"
E0814 17:10:39.301721       1 machine_controller.go:183] Error deleting node "ip-10-0-0-20.us-west-2.compute.internal" for machine "worker-bf038fa5-nodepool-bf038fa5default-6c86565b8rvqdh": Delete 
https://bf038fa5-apiserver-261840232.us-west-2.elb.amazonaws.com:6443/api/v1/nodes/ip-10-0-0-20.us-west-2.compute.internal
: dial tcp: lookup bf038fa5-apiserver-261840232.us-west-2.elb.amazonaws.com
on 10.96.0.10:53: no such host

What did you expect to happen:
What I expected was that deletions would succeed since the resources are NOT FOUND.

Anything else you would like to add:
Also, If I delete the ELB, it could be recreated, unless the Cluster is being deleted. That said, it would also mean a different DNS name, which imply that anything referencing the old DNS name would need to be updated, which may not be done automatically. That would include the kubeconfig secret, but more importantly the client config/kubeadm config for all of the existing Machines in the cluster.

Environment:

  • Cluster-api-provider-aws version: v0.3.7
  • Kubernetes version: (use kubectl version): 1.14.1
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 29, 2019
@vincepri
Copy link
Member

How would we connect to the cluster without an elb? 🤔

@vivgoyal
Copy link
Author

@vincepri Should we reconcile elb on "NotFound" and then try deletion?
Also what more worrying is the part I mentioned in "Anything else you would like to add".

@detiber detiber added this to the v0.3.x (v1alpha1) milestone Aug 30, 2019
@detiber
Copy link
Member

detiber commented Aug 30, 2019

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 30, 2019
@ncdc
Copy link
Contributor

ncdc commented Sep 12, 2019

It will be difficult for us to account for all situations where someone manually manipulates the AWS resources that CAPA creates and manages. If you make changes (edits or deletes) to these resources, it's probably up to you to resolve any issues, such as this one. I'm inclined to close this. WDYT @detiber @vincepri @sethp-nr @rudoi?

@ncdc
Copy link
Contributor

ncdc commented Sep 12, 2019

I would also be willing to accept a code change to CAPI that attempts to delete the node up to n times and then gives up, without erroring.

@detiber
Copy link
Member

detiber commented Sep 12, 2019

@ncdc I'm a bit torn on it, there are potentially things we could do to work around the load balancer being a blocker like it is today, such as using a static DNS entry for the apiserver endpoint rather than the LB dns name. I'm not sure it's something we can likely do in the short term, but longer term I don't think this should fail in a catastrophic way as it does today.

@ncdc
Copy link
Contributor

ncdc commented Sep 12, 2019

How about if we start with my suggestion about not failing on node deletion and see what else trips us up, if anything?

@detiber
Copy link
Member

detiber commented Sep 12, 2019

How about if we start with my suggestion about not failing on node deletion and see what else trips us up, if anything?

In the general case of worker nodes, that should probably be fine.

@ncdc
Copy link
Contributor

ncdc commented Sep 12, 2019

I'm not sure it makes sense to distinguish?

@detiber
Copy link
Member

detiber commented Sep 12, 2019

Today probably not, but when we have control plane management, we'd need to ensure that we handle the etcd membership properly on deletion of a node that is a control plane, which based on the model we are using with cluster-api-upgrade-tool, would require apiserver access.

@ncdc
Copy link
Contributor

ncdc commented Sep 26, 2019

I filed kubernetes-sigs/cluster-api#1446 to try to delete the Node multiple times, then move on without considering it an error.

@ncdc
Copy link
Contributor

ncdc commented Sep 27, 2019

The node deletion issue was fixed by kubernetes-sigs/cluster-api#1452, which will be in a future CAPI v0.2.x release. @vivgoyal PTAL and let us know if you think that's sufficient. Thanks!

@vivgoyal
Copy link
Author

vivgoyal commented Oct 2, 2019

Looks good to close. I haven't tested it yet though. If I find any issues, I can anyday open a new one.

@ncdc
Copy link
Contributor

ncdc commented Oct 2, 2019

Let's leave it open until you can test it with CAPI v0.2.3 or newer (v1alpha2)

@ncdc ncdc modified the milestones: v0.3.x, v0.4.x Oct 10, 2019
@liztio
Copy link
Contributor

liztio commented Oct 11, 2019

/priority awaiting-evidence

@k8s-ci-robot
Copy link
Contributor

@liztio: The label(s) priority/awaiting-evidence cannot be applied. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other

In response to this:

/priority awaiting-evidence

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@liztio
Copy link
Contributor

liztio commented Oct 11, 2019

/priority awaiting-more-evidence

@k8s-ci-robot k8s-ci-robot added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Oct 11, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2020
@vincepri
Copy link
Member

This should have been fixed

/close

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closing this issue.

In response to this:

This should have been fixed

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

7 participants