Provide a way to hook into the rolling update #4229

sbueringer · 2021-02-24T18:25:43Z

User Story

As a service provider, I would like to be able to hook into the rolling update of control plane
and worker nodes to be able to implement additional checks before the rolling update continues.

Detailed Description

Some context: we're providing Kubernetes as a Service and are currently reimplementing our
existing solution. To provide a smooth update experience we have several checks which verify if
a Node is ready to be used by customers. In our current solution we're updating Nodes sequentially
after we updated a Node, we verify that it's completely functional and then continue the update by
deleting and creating the next Node etc. .

As far as I'm aware, there is currently no way to do this in CAPI. To me it looks like e.g. the
MachineSet controller only evaluates the Ready condition of the Node in the workload cluster when
deciding if a Machine is ready (code). I assume
the Machine readiness then also informs the decision when the next Machine will be updated etc..

Some examples what we're considering right now before we declare a Node ready:

Verify that all our Node daemons are up (kube-proxy, calico, fluent-bit, metric exporter, ...)
Verify that the CSI plugin is up and registered on the Node
Verify that cloud controller manager reconciled loadbalancer member for services of type LoadBalancer
Apply dynamic KubeletConfiguration and wait until it's active
Remove our own not-ready taint so customer Pods are scheduled on the new Node until we consider the Node ready

An important point is that we always want to ensure we have minimum amount of "completely" ready Nodes.

Anything else you would like to add:

References:

Slack discussion
Potential solutions might interfere with Define roadmap for a condition based contract #3153

/kind feature

fabriziopandini · 2021-03-05T20:47:37Z

/milestone v0.4.0
/area control plane
/area machine

k8s-ci-robot · 2021-03-05T20:47:38Z

@fabriziopandini: The label(s) area/control, area/plane cannot be applied, because the repository doesn't have them.

In response to this:

/milestone v0.4.0
/area control plane
/area machine

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vincepri · 2021-03-15T17:13:05Z

@sbueringer Are you interested in having this change for v1alpha4?

sbueringer · 2021-03-15T18:33:09Z

@vincepri Yes.

vincepri · 2021-03-15T19:05:11Z

There is some good aspects that we could define as part of the MachineHealthCheck (maybe?) or similar struct. For example, for generic "pod readiness", we could require that a certain number of pods show up as ready before proceeding with an update.

Then there is some customizable aspect, for things that aren't generic and would need custom code to be done, which could be done with special annotations.

vincepri · 2021-03-15T19:05:51Z

Nevertheless, this effort might require a small proposal to continue, and if we're aiming for v1alpha4, it should be made non-breaking and delivered potentially in a patch release.

/milestone v0.4.x

sbueringer · 2021-03-15T20:12:06Z

@vincepri sounds good. I'll think about it and come back with a few ideas. Just to limit the scope of the proposal a bit.

Just that I get it right. You meant specifying it as part of MachineHealthCheck so that those readiness criterias can be used for MachineHealthChecks as well as during updates? Not extending MachineHealthChecks and leveraging them through MachineHealthChecks during updates?

vincepri · 2021-03-15T20:39:20Z

You meant specifying it as part of MachineHealthCheck so that those readiness criterias can be used for MachineHealthChecks as well as during updates?

Yes, and if we have some data structure that it allows us to define pod-based health checks, we could use the same to define checks in MachineDeployment and KCP -- and possibly share the codebase too :)

sbueringer · 2021-03-30T07:05:32Z

Just a short update. I'm still interested in this. Just taking a bit of time to get more familiar with the current state of the project, to be able to better judge how this fits.

sbueringer · 2021-06-29T10:31:24Z

Currently not really working on it
/unassign

k8s-triage-robot · 2021-09-27T10:45:22Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2021-11-21T17:52:42Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2021-12-21T18:14:04Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2021-12-21T18:14:15Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vincepri · 2021-12-21T18:28:53Z

/reopen
/lifecycle frozen

k8s-ci-robot · 2021-12-21T18:29:05Z

@vincepri: Reopened this issue.

In response to this:

/reopen
/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

killianmuldoon · 2022-07-13T10:44:49Z

Can we mark this as done as part of #6546? @sbueringer WDYT?

chrischdi · 2022-07-13T10:59:35Z

👍 for ControlPlane nodes.

Regarding MachineDeployments: it would currently not be possible to do things on a per MachineDeployment base, only on "after all MachineDeployments" base. It also would not reflect state to MachineDeployments. But this points are up for discussion I think :-)

sbueringer · 2022-07-13T12:11:39Z

I don't think that Runtime Hooks today works for the described use case above. The idea was to do additional actions per-machine.

sbueringer · 2022-07-13T12:12:37Z

But I'm also fine with just closing this issue. I don't have that requirement anymore and there was no other demand for it in the last almost 1,5 years.

killianmuldoon · 2022-07-13T12:20:24Z

Given the feedback above let's leave it open - it seems like this might be a use case for an additional runtime hook if someone from the community is interested in it.

faiq · 2022-08-26T17:24:58Z

/assign @faiq

i'll try my hand at it.

fabriziopandini · 2022-10-03T17:16:26Z

/triage accepted
@faiq any update on this. let me know if we can help to move this forward

bavarianbidi · 2023-02-08T06:23:05Z

@faiq are you still working on this? I'm very interested in a hook on a machine level

faiq · 2023-02-08T17:26:08Z

@bavarianbidi please by all means go ahead. i haven't had as much time to work on this as I'd like!

faiq · 2023-02-08T17:26:15Z

/unassign

bavarianbidi · 2023-04-05T13:49:56Z

/assign @bavarianbidi

fabriziopandini · 2023-04-05T18:48:30Z

it seems there are some overlaps between what is discussed in this issue and #7647, might be better to reconcile the two efforts before moving on with implementation

k8s-triage-robot · 2024-04-04T19:21:24Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

bavarianbidi · 2024-04-09T17:49:08Z

/unassign

as i didn't work with CAPI ATM 😞

I've already pitched the issue to some colleagues and shared a very rough implementation with them - so 🤞 they will jump into this 🙏

fabriziopandini · 2024-04-12T14:39:28Z

/priority backlog

fabriziopandini · 2024-04-22T13:09:25Z

The Cluster API project currently lacks enough active contributors to adequately respond to all issues and PRs.

If there is concrete interest to make this move forward and some more details to discuss about what we want to do, we can re-open

/close

k8s-ci-robot · 2024-04-22T13:09:30Z

@fabriziopandini: Closing this issue.

In response to this:

The Cluster API project currently lacks enough active contributors to adequately respond to all issues and PRs.

If there is concrete interest to make this move forward and some more details to discuss about what we want to do, we can re-open

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 24, 2021

fabriziopandini added the area/machine Issues or PRs related to machine lifecycle management label Mar 5, 2021

k8s-ci-robot added this to the v0.4.0 milestone Mar 5, 2021

fabriziopandini added the area/control-plane Issues or PRs related to control-plane lifecycle management label Mar 5, 2021

k8s-ci-robot modified the milestones: v0.4.0, v0.4.x Mar 15, 2021

CecileRobertMichon modified the milestones: v0.4.x, v0.4 Mar 22, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 27, 2021

vincepri modified the milestones: v0.4, v1.1 Oct 22, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 21, 2021

k8s-ci-robot closed this as completed Dec 21, 2021

fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

fabriziopandini removed this from the v1.2 milestone Jul 29, 2022

fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

k8s-ci-robot assigned faiq Aug 26, 2022

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Oct 3, 2022

bavarianbidi mentioned this issue Feb 8, 2023

Mismatch between ControlPlaneReady and Conditions.ControlPlaneReady #7099

Closed

k8s-ci-robot unassigned faiq Feb 8, 2023

k8s-ci-robot assigned bavarianbidi Apr 5, 2023

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Apr 4, 2024

k8s-ci-robot unassigned bavarianbidi Apr 9, 2024

k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Apr 12, 2024

k8s-ci-robot closed this as completed Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide a way to hook into the rolling update #4229

Provide a way to hook into the rolling update #4229

sbueringer commented Feb 24, 2021 •

edited

Loading

fabriziopandini commented Mar 5, 2021

k8s-ci-robot commented Mar 5, 2021

vincepri commented Mar 15, 2021

sbueringer commented Mar 15, 2021

vincepri commented Mar 15, 2021

vincepri commented Mar 15, 2021

sbueringer commented Mar 15, 2021 •

edited

Loading

vincepri commented Mar 15, 2021 •

edited

Loading

sbueringer commented Mar 30, 2021

sbueringer commented Jun 29, 2021

k8s-triage-robot commented Sep 27, 2021

k8s-triage-robot commented Nov 21, 2021

k8s-triage-robot commented Dec 21, 2021

k8s-ci-robot commented Dec 21, 2021

vincepri commented Dec 21, 2021

k8s-ci-robot commented Dec 21, 2021

killianmuldoon commented Jul 13, 2022

chrischdi commented Jul 13, 2022

sbueringer commented Jul 13, 2022

sbueringer commented Jul 13, 2022

killianmuldoon commented Jul 13, 2022

faiq commented Aug 26, 2022

fabriziopandini commented Oct 3, 2022

bavarianbidi commented Feb 8, 2023

faiq commented Feb 8, 2023

faiq commented Feb 8, 2023

bavarianbidi commented Apr 5, 2023

fabriziopandini commented Apr 5, 2023

k8s-triage-robot commented Apr 4, 2024

bavarianbidi commented Apr 9, 2024

fabriziopandini commented Apr 12, 2024

fabriziopandini commented Apr 22, 2024

k8s-ci-robot commented Apr 22, 2024

Provide a way to hook into the rolling update #4229

Provide a way to hook into the rolling update #4229

Comments

sbueringer commented Feb 24, 2021 • edited Loading

fabriziopandini commented Mar 5, 2021

k8s-ci-robot commented Mar 5, 2021

vincepri commented Mar 15, 2021

sbueringer commented Mar 15, 2021

vincepri commented Mar 15, 2021

vincepri commented Mar 15, 2021

sbueringer commented Mar 15, 2021 • edited Loading

vincepri commented Mar 15, 2021 • edited Loading

sbueringer commented Mar 30, 2021

sbueringer commented Jun 29, 2021

k8s-triage-robot commented Sep 27, 2021

k8s-triage-robot commented Nov 21, 2021

k8s-triage-robot commented Dec 21, 2021

k8s-ci-robot commented Dec 21, 2021

vincepri commented Dec 21, 2021

k8s-ci-robot commented Dec 21, 2021

killianmuldoon commented Jul 13, 2022

chrischdi commented Jul 13, 2022

sbueringer commented Jul 13, 2022

sbueringer commented Jul 13, 2022

killianmuldoon commented Jul 13, 2022

faiq commented Aug 26, 2022

fabriziopandini commented Oct 3, 2022

bavarianbidi commented Feb 8, 2023

faiq commented Feb 8, 2023

faiq commented Feb 8, 2023

bavarianbidi commented Apr 5, 2023

fabriziopandini commented Apr 5, 2023

k8s-triage-robot commented Apr 4, 2024

bavarianbidi commented Apr 9, 2024

fabriziopandini commented Apr 12, 2024

fabriziopandini commented Apr 22, 2024

k8s-ci-robot commented Apr 22, 2024

sbueringer commented Feb 24, 2021 •

edited

Loading

sbueringer commented Mar 15, 2021 •

edited

Loading

vincepri commented Mar 15, 2021 •

edited

Loading