Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KCP should pause reconciliation while machines are failed but not eligible for automatic remediation #3230

Closed
benmoss opened this issue Jun 22, 2020 · 13 comments
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@benmoss
Copy link

benmoss commented Jun 22, 2020

User Story

As an operator I would like to not have KCP continue to do scale up / scale down operations when a machine is in a failed state but is not marked for remediation so that I can triage the problem and not have it compounded by additional operations KCP might decide to take.

Detailed Description

Right now we have code in place to not do scaling operations while we are waiting on machine provisioning and deletion. With #3185 we will have code that handles machines marked by MHC for remediation. @detiber pointed out that we still may have machines that have FailureMessage and/or FailureReasons on them but not the MHC conditions for automatic remediation. This could happen if the user is not using MHC, or if the MHC has hit a maxUnhealthy quota.

We could instead just pause reconciliation of this cluster until all the machines are not failing. Users could manually remediate and either bring them back to healthy states or scale them down. This will also help us ensure we aren't going to interfere with external remediation.

/kind feature
/area control-plane

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. area/control-plane Issues or PRs related to control-plane lifecycle management labels Jun 22, 2020
@vincepri
Copy link
Member

/milestone v0.3.x

@k8s-ci-robot k8s-ci-robot added this to the v0.3.x milestone Jun 24, 2020
@CecileRobertMichon
Copy link
Contributor

/help

@k8s-ci-robot
Copy link
Contributor

@CecileRobertMichon:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jun 24, 2020
@fabriziopandini
Copy link
Member

I see some potential downside in the change requested by this issue:
if we are in the middle of the first deployment of a 5 node CP (scale up from 0 to 5), and e.g. there is a problem on the 4th node, this PR will block the CP creation and wait for user intervention before going up to 5 nodes.

Instead I think that KCP should always continue to try to do its work, and block only if the next action might lead the cluster to a potential bad state; e.g.

  • in case of scale up you have too have enough healthy members to mantain quorum if the new replica does not comes up (except 1 to 2)

@detiber
Copy link
Member

detiber commented Jun 29, 2020

@fabriziopandini 100% agree that once we can tell how an operation would affect quorum that it would be preferable to use that to block actions rather than this approach.

@vincepri
Copy link
Member

vincepri commented Aug 3, 2020

/milestone v0.4.0

@k8s-ci-robot k8s-ci-robot modified the milestones: v0.3.x, v0.4.0 Aug 3, 2020
@vincepri
Copy link
Member

vincepri commented Aug 3, 2020

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 3, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 1, 2020
@fabriziopandini
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 1, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 1, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

7 participants