-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for underlying instances to be terminated before removing node finalizers in termination #947
Comments
No, in Azure provider, cloudProvider.Delete() currently waits for instance termination, so when it returns the instance is already terminated. So currently early Node deletion is not an issue in Azure provider. Independently of this specific issue, Yes, we would be interested in
Now, whether controller retries (provider controlled, or not) are appropriate solution to the issue of early Node deletion in general - I am not sure, but seems like it would work. Would not currently affect AKS provider.
Seems like controlling retries would be independent from being able to determine the state of an instance. Generally, would be good to avoid introducing instance state awareness into core, as these differ across providers, and the corresponding state machines are their own cans of worms. |
👋 Hey! The initially approved fix seems to be reverted, are there plans for the issue to be addressed in a different way? |
Yeah, we reverted the change after realizing that we couldn't continually call TerminateInstances against the EC2 API because this would affect the number of Write TPS that we perform against it. On top of this, this change affected the amount of time that it takes all of our NodeClaims and Nodes to terminate which affected the test times that we were performing on the AWS provider repo to validate the change that we made. We're still working on doing some more testing around this feature before we can formally merge it (the change is now being tracked in #1195). I'll reopen this for now since the feature hasn't been merged after the revert. |
This issue is currently awaiting triage. If Karpenter contributors determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The original PR has now been merged |
@jmdeal: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Description
What problem are you trying to solve?
The termination controller calls
cloudProvider.Delete(nodeClaim)
which deletes the underlying instance. It will remove the finalizer once the cloudProvider.Delete() call succeeds. codeWithin AWS, Karpenter currently succeeds and proceeds to remove the finalizer once the ec2.TerminateInstance() call has succeeded, which can result in nodes being deregistered before the instance reaches a
Terminated
state (as opposed to theTerminating
it first reaches when the API call succeeds).(@tallaxes please correct me if i'm wrong) Within Azure, Karpenter models its return value of the
cloudProvider.Delete()
call with a retryable error, where Karpenter will continue to callcloudProvider.Delete()
which will eventually succeed when the instance is terminated.One way this could generically be implemented is by creating a cloudProvider.RetryableError() which will allow Karpenter to parse a state of an instance that has been successfully called terminate, but hasn't been fully terminated yet, re-enqueueing the node into the termination controller reconciliation loop.
How important is this feature to you?
Important, as currently any pods that tolerate the Karpenter NoSchedule taint are leaked, and are never given a chance to clean up. In addition, there may be other processes that are interfered with if the node is deregistered preemptively.
The text was updated successfully, but these errors were encountered: