Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identifying cloud provider deleted nodes #5054
Identifying cloud provider deleted nodes #5054
Changes from 1 commit
776d731
cf67a30
e59c044
7fc1f6b
ea7059f
08dfc7e
ab4fff6
1198fbc
c94740f
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't relying on previous state cause issues in case of CA restart? IIUC some nodes will be incorrectly considered as
Ready
/Unready
/NotStarted
, potentially leading to bad scaling decisions.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would definitely occur on a restart. Would it make sense to create a new taint and add it onto the nodes? There might be some additional overhead for the implementation, but should prevent that scenario.
I also thought about adding a check for each node in the
deletedNodes
(lines 1001 - 1006) to confirm it doesn't appear in the cloud provider nodes. I don't have good insight to know if that is a case that could happen though. I wouldn't want to keep flagging a node as deleted if the cloud provider node exists.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't particularly like the idea of introducing more taints to carry over CA internal state between restarts. That feels brittle and complex. I think maybe it is ok to just leave it as is: if the instance was deleted right before CA restart, there will be a slight risk of not scaling up until k8s API catches up with the deletion. This should be fine.
Re-checking
deletedNodes
would catch a case in which instance disappears from cloud provider due to some temporary problem and then comes back. Indeed now we would treat it as deleted forever and it would be better to avoid code that cannot recover from such scenario (even though it is unlikely).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me. I agree that the API should eventually correct itself, and the scaling should only momentarily be impacted. I've created a new commit that includes the backfill safety check mentioned and some other minor cosmetic code changes.