Machine deletion: try up to n times to delete the Node, then move on #1446

ncdc · 2019-09-26T15:01:48Z

What steps did you take and what happened:

Do something catastrophic such as manually deleting a CAPA cluster's ELB
kubectl delete machine/foo
Error in the logs (note, this is from v1alpha1, but the issue is still in v1alpha2):

I0814 17:10:39.243788       1 instances.go:67] [machine-actuator]/cluster.k8s.io/v1alpha1/4d95faba9cb7ee388671ac3cef6ee79b39c25f15/bf038fa5/worker-bf038fa5-nodepool-bf038fa5default-6c86565b8rvqdh "level"=2 "msg"="Looking for existing machine instance by tags"  
I0814 17:10:39.288157       1 machine_controller.go:181] Deleting node "ip-10-0-0-20.us-west-2.compute.internal" for machine "worker-bf038fa5-nodepool-bf038fa5default-6c86565b8rvqdh"
E0814 17:10:39.301721       1 machine_controller.go:183] Error deleting node "ip-10-0-0-20.us-west-2.compute.internal" for machine "worker-bf038fa5-nodepool-bf038fa5default-6c86565b8rvqdh": Delete 
https://bf038fa5-apiserver-261840232.us-west-2.elb.amazonaws.com:6443/api/v1/nodes/ip-10-0-0-20.us-west-2.compute.internal
: dial tcp: lookup bf038fa5-apiserver-261840232.us-west-2.elb.amazonaws.com
on 10.96.0.10:53: no such host

The Machine is never deleted

What did you expect to happen:
This error shouldn't block Machine deletion

Anything else you would like to add:
I think it would be reasonable to attempt to delete the Node multiple times over the span of 30-60 seconds. If the deletion fails, we can record an Event, then allow the Machine deletion to continue.

Environment:

Cluster-api version: v0.1.x and v0.2.x

/kind bug

xref kubernetes-sigs/cluster-api-provider-aws#1084 (comment) and my next comment as well

The text was updated successfully, but these errors were encountered:

tahsinrahman · 2019-09-26T16:55:29Z

/assign

tahsinrahman · 2019-09-26T16:58:00Z

@ncdc I'm thinking of wrapping this function inside wait.PollImmediate(), what do you think?

https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/423721d074144de956f70ed95996101e3585758f/pkg/cloud/services/ec2/instances.go#L246-L262

vincepri · 2019-09-26T17:00:33Z

That seems AWS specific function, for this we should probably just retry a specific number of times. We should be able to use https://github.com/kubernetes/apimachinery/blob/master/pkg/util/wait/wait.go#L333 to achieve this, wdyt?

tahsinrahman · 2019-09-26T17:01:25Z

yes, i'm actually talking about this!

tahsinrahman · 2019-09-26T17:02:16Z

any suggestion for interval and timeout duration?

vincepri · 2019-09-26T17:02:50Z

Sounds good, I saw the linked aws code and was confused :D

I'd retry maybe every 2 seconds and for max 10? @ncdc

ncdc · 2019-09-26T17:04:50Z

Do you think it makes sense to try for up to either 30 or 60 seconds? Or is it more likely that if it fails once, it will probably fail every time, in which case trying for that long is an unnecessary delay?

ncdc · 2019-09-26T17:05:07Z

Or should we just try once and not even bother with any more attempts?

vincepri · 2019-09-26T17:09:04Z

Once might be a temporary failure, I'd limit it the retries to 10-15 seconds, if it fails for that long there is a good chance that we won't be able to reach it

ncdc · 2019-09-26T17:11:53Z

Ok, I'm good with interval=2s, timeout=10s

tahsinrahman · 2019-09-26T17:14:20Z

Should we do the same for bastion?

ncdc · 2019-09-26T17:15:09Z

The bastion is CAPA specific and is unrelated to this issue (the bastion doesn't have a corresponding Kubernetes Node).

tahsinrahman · 2019-09-26T17:20:54Z

oops, I mistook it as a capa issue, thats why i was linking the capa function :(

ncdc added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Sep 26, 2019

ncdc added this to the v0.2.x milestone Sep 26, 2019

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 26, 2019

ncdc mentioned this issue Sep 26, 2019

Machine deletions fails if there is no ELB kubernetes-sigs/cluster-api-provider-aws#1084

Closed

k8s-ci-robot assigned tahsinrahman Sep 26, 2019

tahsinrahman mentioned this issue Sep 26, 2019

🐛 Machine deletion: try up to 10s to delete the Node, then move on #1452

Merged

k8s-ci-robot closed this as completed in #1452 Sep 26, 2019

schrej mentioned this issue Nov 8, 2021

✨ Add nodeDeletionTimeout property to Machine #5608

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine deletion: try up to n times to delete the Node, then move on #1446

Machine deletion: try up to n times to delete the Node, then move on #1446

ncdc commented Sep 26, 2019

tahsinrahman commented Sep 26, 2019

tahsinrahman commented Sep 26, 2019

vincepri commented Sep 26, 2019

tahsinrahman commented Sep 26, 2019

tahsinrahman commented Sep 26, 2019

vincepri commented Sep 26, 2019

ncdc commented Sep 26, 2019

ncdc commented Sep 26, 2019

vincepri commented Sep 26, 2019

ncdc commented Sep 26, 2019

tahsinrahman commented Sep 26, 2019

ncdc commented Sep 26, 2019

tahsinrahman commented Sep 26, 2019

Machine deletion: try up to n times to delete the Node, then move on #1446

Machine deletion: try up to n times to delete the Node, then move on #1446

Comments

ncdc commented Sep 26, 2019

tahsinrahman commented Sep 26, 2019

tahsinrahman commented Sep 26, 2019

vincepri commented Sep 26, 2019

tahsinrahman commented Sep 26, 2019

tahsinrahman commented Sep 26, 2019

vincepri commented Sep 26, 2019

ncdc commented Sep 26, 2019

ncdc commented Sep 26, 2019

vincepri commented Sep 26, 2019

ncdc commented Sep 26, 2019

tahsinrahman commented Sep 26, 2019

ncdc commented Sep 26, 2019

tahsinrahman commented Sep 26, 2019