Wait for underlying instances to be terminated before removing node finalizers in termination #947

njtran · 2024-01-19T22:34:21Z

Description

What problem are you trying to solve?
The termination controller calls cloudProvider.Delete(nodeClaim) which deletes the underlying instance. It will remove the finalizer once the cloudProvider.Delete() call succeeds. code

Within AWS, Karpenter currently succeeds and proceeds to remove the finalizer once the ec2.TerminateInstance() call has succeeded, which can result in nodes being deregistered before the instance reaches a Terminated state (as opposed to the Terminating it first reaches when the API call succeeds).

(@tallaxes please correct me if i'm wrong) Within Azure, Karpenter models its return value of the cloudProvider.Delete() call with a retryable error, where Karpenter will continue to call cloudProvider.Delete() which will eventually succeed when the instance is terminated.

One way this could generically be implemented is by creating a cloudProvider.RetryableError() which will allow Karpenter to parse a state of an instance that has been successfully called terminate, but hasn't been fully terminated yet, re-enqueueing the node into the termination controller reconciliation loop.

How important is this feature to you?
Important, as currently any pods that tolerate the Karpenter NoSchedule taint are leaked, and are never given a chance to clean up. In addition, there may be other processes that are interfered with if the node is deregistered preemptively.

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

tallaxes · 2024-01-20T00:05:09Z

Within Azure, Karpenter models its return value of the cloudProvider.Delete() call with a retryable error, where Karpenter will continue to call cloudProvider.Delete() which will eventually succeed when the instance is terminated.

No, in Azure provider, cloudProvider.Delete() currently waits for instance termination, so when it returns the instance is already terminated. So currently early Node deletion is not an issue in Azure provider.

Independently of this specific issue, Yes, we would be interested in

Considering letting CP affect controller reconciliation retry. This would be useful primarily to eliminate some provider-internal retries, and having all retries driven by controllers. Being able to specify RequeueAfter is especially useful in this context, as in some cases (e.g. API throttling) a delay is desirable (e.g. based on "Retry-After"), rather than an immediate requeue.
More generally, better understanding of possible sync/async patterns and expectations around Karpenter CP API. Right now it seems AWS provider is on the side of "fast, but maybe operation did not complete" (e.g. this issue) and AKS provider is on the side "complete, but could take a long time (and will retry internally)".

Now, whether controller retries (provider controlled, or not) are appropriate solution to the issue of early Node deletion in general - I am not sure, but seems like it would work. Would not currently affect AKS provider.

creating a cloudProvider.RetryableError() which will allow Karpenter to parse a state of an instance

Seems like controlling retries would be independent from being able to determine the state of an instance. Generally, would be good to avoid introducing instance state awareness into core, as these differ across providers, and the corresponding state machines are their own cans of worms.

raychinov · 2024-05-31T11:45:55Z

👋 Hey! The initially approved fix seems to be reverted, are there plans for the issue to be addressed in a different way?

jonathan-innis · 2024-06-04T18:37:37Z

Yeah, we reverted the change after realizing that we couldn't continually call TerminateInstances against the EC2 API because this would affect the number of Write TPS that we perform against it. On top of this, this change affected the amount of time that it takes all of our NodeClaims and Nodes to terminate which affected the test times that we were performing on the AWS provider repo to validate the change that we made.

We're still working on doing some more testing around this feature before we can formally merge it (the change is now being tracked in #1195). I'll reopen this for now since the feature hasn't been merged after the revert.

k8s-ci-robot · 2024-06-04T18:37:44Z

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

jmdeal · 2024-06-17T20:41:55Z

The original PR has now been merged
/close

k8s-ci-robot · 2024-06-17T20:41:59Z

@jmdeal: Closing this issue.

In response to this:

The original PR has now been merged
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

njtran added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 19, 2024

njtran removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 22, 2024

jigisha620 mentioned this issue Apr 3, 2024

chore: Add retryable error to cloud provider #1164

Merged

k8s-ci-robot closed this as completed in #1164 Apr 4, 2024

jonathan-innis reopened this Jun 4, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 4, 2024

k8s-ci-robot closed this as completed Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for underlying instances to be terminated before removing node finalizers in termination #947

Wait for underlying instances to be terminated before removing node finalizers in termination #947

njtran commented Jan 19, 2024 •

edited

Loading

tallaxes commented Jan 20, 2024

raychinov commented May 31, 2024

jonathan-innis commented Jun 4, 2024 •

edited

Loading

k8s-ci-robot commented Jun 4, 2024

jmdeal commented Jun 17, 2024

k8s-ci-robot commented Jun 17, 2024

Wait for underlying instances to be terminated before removing node finalizers in termination #947

Wait for underlying instances to be terminated before removing node finalizers in termination #947

Comments

njtran commented Jan 19, 2024 • edited Loading

Description

tallaxes commented Jan 20, 2024

raychinov commented May 31, 2024

jonathan-innis commented Jun 4, 2024 • edited Loading

k8s-ci-robot commented Jun 4, 2024

jmdeal commented Jun 17, 2024

k8s-ci-robot commented Jun 17, 2024

njtran commented Jan 19, 2024 •

edited

Loading

jonathan-innis commented Jun 4, 2024 •

edited

Loading