-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KCP should pause reconciliation while machines are failed but not eligible for automatic remediation #3230
Comments
/milestone v0.3.x |
/help |
@CecileRobertMichon: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I see some potential downside in the change requested by this issue: Instead I think that KCP should always continue to try to do its work, and block only if the next action might lead the cluster to a potential bad state; e.g.
|
@fabriziopandini 100% agree that once we can tell how an operation would affect quorum that it would be preferable to use that to block actions rather than this approach. |
/milestone v0.4.0 |
/priority important-soon |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Rotten issues close after 30d of inactivity. Send feedback to sig-contributor-experience at kubernetes/community. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
User Story
As an operator I would like to not have KCP continue to do scale up / scale down operations when a machine is in a failed state but is not marked for remediation so that I can triage the problem and not have it compounded by additional operations KCP might decide to take.
Detailed Description
Right now we have code in place to not do scaling operations while we are waiting on machine provisioning and deletion. With #3185 we will have code that handles machines marked by MHC for remediation. @detiber pointed out that we still may have machines that have
FailureMessage
and/orFailureReason
s on them but not the MHC conditions for automatic remediation. This could happen if the user is not using MHC, or if the MHC has hit amaxUnhealthy
quota.We could instead just pause reconciliation of this cluster until all the machines are not failing. Users could manually remediate and either bring them back to healthy states or scale them down. This will also help us ensure we aren't going to interfere with external remediation.
/kind feature
/area control-plane
The text was updated successfully, but these errors were encountered: