KCP should pause reconciliation while machines are failed but not eligible for automatic remediation #3230

benmoss · 2020-06-22T20:15:58Z

User Story

As an operator I would like to not have KCP continue to do scale up / scale down operations when a machine is in a failed state but is not marked for remediation so that I can triage the problem and not have it compounded by additional operations KCP might decide to take.

Detailed Description

Right now we have code in place to not do scaling operations while we are waiting on machine provisioning and deletion. With #3185 we will have code that handles machines marked by MHC for remediation. @detiber pointed out that we still may have machines that have FailureMessage and/or FailureReasons on them but not the MHC conditions for automatic remediation. This could happen if the user is not using MHC, or if the MHC has hit a maxUnhealthy quota.

We could instead just pause reconciliation of this cluster until all the machines are not failing. Users could manually remediate and either bring them back to healthy states or scale them down. This will also help us ensure we aren't going to interfere with external remediation.

/kind feature
/area control-plane

The text was updated successfully, but these errors were encountered:

vincepri · 2020-06-24T17:12:19Z

/milestone v0.3.x

CecileRobertMichon · 2020-06-24T17:12:29Z

/help

k8s-ci-robot · 2020-06-24T17:12:30Z

@CecileRobertMichon:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fabriziopandini · 2020-06-29T13:02:40Z

I see some potential downside in the change requested by this issue:
if we are in the middle of the first deployment of a 5 node CP (scale up from 0 to 5), and e.g. there is a problem on the 4th node, this PR will block the CP creation and wait for user intervention before going up to 5 nodes.

Instead I think that KCP should always continue to try to do its work, and block only if the next action might lead the cluster to a potential bad state; e.g.

in case of scale up you have too have enough healthy members to mantain quorum if the new replica does not comes up (except 1 to 2)

detiber · 2020-06-29T13:45:41Z

@fabriziopandini 100% agree that once we can tell how an operation would affect quorum that it would be preferable to use that to block actions rather than this approach.

vincepri · 2020-08-03T18:09:10Z

/milestone v0.4.0

vincepri · 2020-08-03T18:10:30Z

/priority important-soon

fejta-bot · 2020-11-01T18:31:38Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fabriziopandini · 2020-11-01T19:31:50Z

/remove-lifecycle stale

fejta-bot · 2021-01-30T19:43:26Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2021-03-01T20:28:58Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fejta-bot · 2021-03-31T21:14:23Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-03-31T21:14:28Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. area/control-plane Issues or PRs related to control-plane lifecycle management labels Jun 22, 2020

benmoss mentioned this issue Jun 22, 2020

✨ Add MHC remediation to KCP #3185

Closed

k8s-ci-robot added this to the v0.3.x milestone Jun 24, 2020

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jun 24, 2020

k8s-ci-robot modified the milestones: v0.3.x, v0.4.0 Aug 3, 2020

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 3, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 1, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 1, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 1, 2021

k8s-ci-robot closed this as completed Mar 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KCP should pause reconciliation while machines are failed but not eligible for automatic remediation #3230

KCP should pause reconciliation while machines are failed but not eligible for automatic remediation #3230

benmoss commented Jun 22, 2020 •

edited

Loading

vincepri commented Jun 24, 2020

CecileRobertMichon commented Jun 24, 2020

k8s-ci-robot commented Jun 24, 2020

fabriziopandini commented Jun 29, 2020

detiber commented Jun 29, 2020

vincepri commented Aug 3, 2020

vincepri commented Aug 3, 2020

fejta-bot commented Nov 1, 2020

fabriziopandini commented Nov 1, 2020

fejta-bot commented Jan 30, 2021

fejta-bot commented Mar 1, 2021

fejta-bot commented Mar 31, 2021

k8s-ci-robot commented Mar 31, 2021

KCP should pause reconciliation while machines are failed but not eligible for automatic remediation #3230

KCP should pause reconciliation while machines are failed but not eligible for automatic remediation #3230

Comments

benmoss commented Jun 22, 2020 • edited Loading

vincepri commented Jun 24, 2020

CecileRobertMichon commented Jun 24, 2020

k8s-ci-robot commented Jun 24, 2020

fabriziopandini commented Jun 29, 2020

detiber commented Jun 29, 2020

vincepri commented Aug 3, 2020

vincepri commented Aug 3, 2020

fejta-bot commented Nov 1, 2020

fabriziopandini commented Nov 1, 2020

fejta-bot commented Jan 30, 2021

fejta-bot commented Mar 1, 2021

fejta-bot commented Mar 31, 2021

k8s-ci-robot commented Mar 31, 2021

benmoss commented Jun 22, 2020 •

edited

Loading