-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identifying cloud provider deleted nodes #5054
Identifying cloud provider deleted nodes #5054
Conversation
/assign @x13n |
/assign @MaciekPytel |
// Seek nodes that may have been deleted since last update | ||
// cloudProviderNodeInstances are retrieved by nodeGroup, | ||
// not autoscaled nodes will not be included | ||
for _, instances := range csr.cloudProviderNodeInstances { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't relying on previous state cause issues in case of CA restart? IIUC some nodes will be incorrectly considered as Ready
/Unready
/NotStarted
, potentially leading to bad scaling decisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would definitely occur on a restart. Would it make sense to create a new taint and add it onto the nodes? There might be some additional overhead for the implementation, but should prevent that scenario.
I also thought about adding a check for each node in the deletedNodes
(lines 1001 - 1006) to confirm it doesn't appear in the cloud provider nodes. I don't have good insight to know if that is a case that could happen though. I wouldn't want to keep flagging a node as deleted if the cloud provider node exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't particularly like the idea of introducing more taints to carry over CA internal state between restarts. That feels brittle and complex. I think maybe it is ok to just leave it as is: if the instance was deleted right before CA restart, there will be a slight risk of not scaling up until k8s API catches up with the deletion. This should be fine.
Re-checking deletedNodes
would catch a case in which instance disappears from cloud provider due to some temporary problem and then comes back. Indeed now we would treat it as deleted forever and it would be better to avoid code that cannot recover from such scenario (even though it is unlikely).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me. I agree that the API should eventually correct itself, and the scaling should only momentarily be impacted. I've created a new commit that includes the backfill safety check mentioned and some other minor cosmetic code changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes!
I just realized one more problem with this approach though. Since we cannot tell the difference between nodes that were deleted and the ones that are not autoscaled, this implementation is going to treat nodes as deleted whenever someone opts out a node group from autoscaling. As a result, CA may trigger large scale ups without a good reason. One way to handle this would be to add a TTL to deletedNodes
to avoid keeping nodes there forever in this scenario. (Perhaps an even better option would be some extension of cloud provider API to distinguish deleted from not autoscaled, but I'm not sure that's feasible.) WDYT?
That is definitely a concern. I can add a TTL to the deletedNodes entries. Should I add a configurable value for better control or just a common default value? I had a similar thought about extending the cloud provider implementation. After a brief investigation, it looks like most of the interactions are handled by using information in the node (the |
I think extending the API would be much cleaner, but the need to implement it for all cloud providers calls for a broader discussion. I added this topic to SIG meeting agenda to discuss it on Monday. |
So I guess the conclusion after yesterday's SIG meeting is that we should extend cloud provider interface and preserve the existing (taint-based) behavior in case a specific cloud provider doesn't implement the new function. That way each cloud provider will be able to fix the bug by implementing a function distinguishing deleted from non-autoscaled nodes. |
Should this PR be closed until the cloud provider interface can be extended? What would be the next steps on fixing the bug in the future? Please let me know if there's anything I can offer for assistance. Thanks! |
The cloud provider interface can be extended right now, it's just that all implementations would have to get default implementation that returns NotImplemented error. I thought this PR could be re-purposed into extending the interface and using it to detect deleted nodes. Each cloud provider would then need to implement their part in a separate PR. Are you willing to make that first, cloud provider agnostic, change? |
Got it! I'll start working on a pass at the implementation and submit it as soon as I can. Thanks again! |
If this requires more time, maybe it would make sense to submit change to taint cleanup (ready -> all nodes) on startup separately first? |
Sorry for the delay here. I've split the taint removal into another PR #5200, and will continue working on the effort here. Please let me know if there's anything else needed. Thanks! |
Thanks! The other PR already got merged. |
48b6328
to
f445a6a
Compare
…d provider that are still registered within Kubernetes. Avoids misidentifying not autoscaled nodes as deleted. Simplified implementation to use apiv1.Node instead of new struct. Expanded test cases to include not autoscaled nodes and tracking deleted nodes over multiple updates. Adding check to backfill loop to confirm cloud provider node no longer exists before flagging the node as deleted. Modifying some comments to be more accurate. Replacing erroneous line deletion.
* Adding isNodeDeleted method to CloudProvider interface. Supports detecting whether nodes are fully deleted or are not-autoscaled. Updated cloud providers to provide initial implementation of new method that will return an ErrNotImplemented to maintain existing taint-based deletion clusterstate calculation.
Thanks for the changes! My main concern is whether simplifying the logic is feasible - I think some of the previous approach got carried over, but isn't really necessary with the new interface. |
…eRegistry, and remove old complicated logic. Adjust the naming of the method for cloud instance deletion from NodeExists to HasInstance.
…tance. Changing deletedNodes to store empty struct instead of node values, and modifying the helper function to utilize that information for tests.
Thanks for following up on this! /lgtm |
Hm, looks like I should be able to approve this now as well, let's see: /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fookenc, x13n The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@fookenc Will this release fix be backported to previous CA versions (including 1.22)? |
Unfortunately, this release alone won't fix issues until the cloud providers have implemented the new interface method. Once the cloud providers have been implemented it could be backported, but I don't know how soon that will occur. @x13n do you know if this could be backported to earlier CA versions? |
1.22 is past k8s support window at this point, so unless there's a critical bug to fix I don't think any backport is going to be released. |
- this is a follow-up to kubernetes#5054 - this might fix kubernetes#4456 Signed-off-by: vadasambar <[email protected]>
- this is a follow-up to kubernetes#5054 - this might fix kubernetes#4456 Signed-off-by: vadasambar <[email protected]> fix: make `HasInstance` in aws provider thread-safe Signed-off-by: vadasambar <[email protected]>
- this is a follow-up to kubernetes#5054 - this might fix kubernetes#4456 fix: make `HasInstance` in aws provider thread-safe Signed-off-by: vadasambar <[email protected]> (cherry picked from commit 1cb55fe)
- this is a follow-up to kubernetes#5054 - this might fix kubernetes#4456 fix: make `HasInstance` in aws provider thread-safe Signed-off-by: vadasambar <[email protected]> (cherry picked from commit 1cb55fe)
- this is a follow-up to kubernetes#5054 - this might fix kubernetes#4456 Signed-off-by: vadasambar <[email protected]> fix: make `HasInstance` in aws provider thread-safe Signed-off-by: vadasambar <[email protected]> (cherry picked from commit 1cb55fe)
- this is a follow-up to kubernetes#5054 - this might fix kubernetes#4456 Signed-off-by: vadasambar <[email protected]> fix: make `HasInstance` in aws provider thread-safe Signed-off-by: vadasambar <[email protected]>
- this is a follow-up to kubernetes#5054 - this might fix kubernetes#4456 Signed-off-by: vadasambar <[email protected]> fix: make `HasInstance` in aws provider thread-safe Signed-off-by: vadasambar <[email protected]> (cherry picked from commit 1cb55fe)
Which component this PR applies to?
cluster-autoscaler
What type of PR is this?
/kind bug
What this PR does / why we need it:
Attempts to address the issue raised in #5022 from previously approved PR. Original implementation (previous work was reverted #4896 & #5023) included that changes how Deleted nodes are determined in the cluster state. New iteration utilizes the previously cached Cloud Provider nodes in the ClusterStateRegistry to perform a diff. Cloud Provider node instances are retrieved by Node Group, which should avoid not autoscaled nodes from being flagged erroneously. Test case has been modified to include this scenario. An additional scenario has been included to ensure that previously identified Cloud Provider nodes will still be tracked until they are no longer registered in Kubernetes.
This PR includes the initial work first introduced in #4211. Feedback from that PR indicated that the original intent of determining the deleted nodes was incorrect, which led to the issues reported by other users. The nodes tainted with ToBeDeleted were misidentified as Deleted instead of Ready/Unready, which caused a miscalculation of the node being included as Upcoming. This caused problems described in #3949 and #4456.
Which issue(s) this PR fixes:
Fixes #3949, #4456, #5022
Special notes for your reviewer:
Does this PR introduce a user-facing change?
NONE
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: