Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a way to hook into the rolling update #4229

Closed
sbueringer opened this issue Feb 24, 2021 · 33 comments
Closed

Provide a way to hook into the rolling update #4229

sbueringer opened this issue Feb 24, 2021 · 33 comments
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management area/machine Issues or PRs related to machine lifecycle management kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/backlog Higher priority than priority/awaiting-more-evidence.

Comments

@sbueringer
Copy link
Member

sbueringer commented Feb 24, 2021

User Story

As a service provider, I would like to be able to hook into the rolling update of control plane
and worker nodes to be able to implement additional checks before the rolling update continues.

Detailed Description

Some context: we're providing Kubernetes as a Service and are currently reimplementing our
existing solution. To provide a smooth update experience we have several checks which verify if
a Node is ready to be used by customers. In our current solution we're updating Nodes sequentially
after we updated a Node, we verify that it's completely functional and then continue the update by
deleting and creating the next Node etc. .

As far as I'm aware, there is currently no way to do this in CAPI. To me it looks like e.g. the
MachineSet controller only evaluates the Ready condition of the Node in the workload cluster when
deciding if a Machine is ready (code). I assume
the Machine readiness then also informs the decision when the next Machine will be updated etc..

Some examples what we're considering right now before we declare a Node ready:

  • Verify that all our Node daemons are up (kube-proxy, calico, fluent-bit, metric exporter, ...)
  • Verify that the CSI plugin is up and registered on the Node
  • Verify that cloud controller manager reconciled loadbalancer member for services of type LoadBalancer
  • Apply dynamic KubeletConfiguration and wait until it's active
  • Remove our own not-ready taint so customer Pods are scheduled on the new Node until we consider the Node ready

An important point is that we always want to ensure we have minimum amount of "completely" ready Nodes.

Anything else you would like to add:

References:

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 24, 2021
@fabriziopandini fabriziopandini added the area/machine Issues or PRs related to machine lifecycle management label Mar 5, 2021
@fabriziopandini
Copy link
Member

/milestone v0.4.0
/area control plane
/area machine

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: The label(s) area/control, area/plane cannot be applied, because the repository doesn't have them.

In response to this:

/milestone v0.4.0
/area control plane
/area machine

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added this to the v0.4.0 milestone Mar 5, 2021
@fabriziopandini fabriziopandini added the area/control-plane Issues or PRs related to control-plane lifecycle management label Mar 5, 2021
@vincepri
Copy link
Member

@sbueringer Are you interested in having this change for v1alpha4?

@sbueringer
Copy link
Member Author

@vincepri Yes.

@vincepri
Copy link
Member

There is some good aspects that we could define as part of the MachineHealthCheck (maybe?) or similar struct. For example, for generic "pod readiness", we could require that a certain number of pods show up as ready before proceeding with an update.

Then there is some customizable aspect, for things that aren't generic and would need custom code to be done, which could be done with special annotations.

@vincepri
Copy link
Member

Nevertheless, this effort might require a small proposal to continue, and if we're aiming for v1alpha4, it should be made non-breaking and delivered potentially in a patch release.

/milestone v0.4.x

@k8s-ci-robot k8s-ci-robot modified the milestones: v0.4.0, v0.4.x Mar 15, 2021
@sbueringer
Copy link
Member Author

sbueringer commented Mar 15, 2021

@vincepri sounds good. I'll think about it and come back with a few ideas. Just to limit the scope of the proposal a bit.

Just that I get it right. You meant specifying it as part of MachineHealthCheck so that those readiness criterias can be used for MachineHealthChecks as well as during updates? Not extending MachineHealthChecks and leveraging them through MachineHealthChecks during updates?

@vincepri
Copy link
Member

vincepri commented Mar 15, 2021

You meant specifying it as part of MachineHealthCheck so that those readiness criterias can be used for MachineHealthChecks as well as during updates?

Yes, and if we have some data structure that it allows us to define pod-based health checks, we could use the same to define checks in MachineDeployment and KCP -- and possibly share the codebase too :)

@CecileRobertMichon CecileRobertMichon modified the milestones: v0.4.x, v0.4 Mar 22, 2021
@sbueringer
Copy link
Member Author

Just a short update. I'm still interested in this. Just taking a bit of time to get more familiar with the current state of the project, to be able to better judge how this fits.

@sbueringer
Copy link
Member Author

Currently not really working on it
/unassign

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 27, 2021
@vincepri vincepri modified the milestones: v0.4, v1.1 Oct 22, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 21, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vincepri
Copy link
Member

/reopen
/lifecycle frozen

@k8s-ci-robot
Copy link
Contributor

@vincepri: Reopened this issue.

In response to this:

/reopen
/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@killianmuldoon
Copy link
Contributor

Can we mark this as done as part of #6546? @sbueringer WDYT?

@chrischdi
Copy link
Member

👍 for ControlPlane nodes.

Regarding MachineDeployments: it would currently not be possible to do things on a per MachineDeployment base, only on "after all MachineDeployments" base. It also would not reflect state to MachineDeployments. But this points are up for discussion I think :-)

@sbueringer
Copy link
Member Author

I don't think that Runtime Hooks today works for the described use case above. The idea was to do additional actions per-machine.

@sbueringer
Copy link
Member Author

But I'm also fine with just closing this issue. I don't have that requirement anymore and there was no other demand for it in the last almost 1,5 years.

@killianmuldoon
Copy link
Contributor

Given the feedback above let's leave it open - it seems like this might be a use case for an additional runtime hook if someone from the community is interested in it.

@fabriziopandini fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022
@fabriziopandini fabriziopandini removed this from the v1.2 milestone Jul 29, 2022
@fabriziopandini fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022
@faiq
Copy link
Contributor

faiq commented Aug 26, 2022

/assign @faiq

i'll try my hand at it.

@fabriziopandini
Copy link
Member

/triage accepted
@faiq any update on this. let me know if we can help to move this forward

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Oct 3, 2022
@bavarianbidi
Copy link
Contributor

@faiq are you still working on this? I'm very interested in a hook on a machine level

@faiq
Copy link
Contributor

faiq commented Feb 8, 2023

@bavarianbidi please by all means go ahead. i haven't had as much time to work on this as I'd like!

@faiq
Copy link
Contributor

faiq commented Feb 8, 2023

/unassign

@bavarianbidi
Copy link
Contributor

/assign @bavarianbidi

@fabriziopandini
Copy link
Member

it seems there are some overlaps between what is discussed in this issue and #7647, might be better to reconcile the two efforts before moving on with implementation

@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Apr 4, 2024
@bavarianbidi
Copy link
Contributor

/unassign

as i didn't work with CAPI ATM 😞

I've already pitched the issue to some colleagues and shared a very rough implementation with them - so 🤞 they will jump into this 🙏

@fabriziopandini
Copy link
Member

/priority backlog

@k8s-ci-robot k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Apr 12, 2024
@fabriziopandini
Copy link
Member

The Cluster API project currently lacks enough active contributors to adequately respond to all issues and PRs.

If there is concrete interest to make this move forward and some more details to discuss about what we want to do, we can re-open

/close

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

The Cluster API project currently lacks enough active contributors to adequately respond to all issues and PRs.

If there is concrete interest to make this move forward and some more details to discuss about what we want to do, we can re-open

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management area/machine Issues or PRs related to machine lifecycle management kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

No branches or pull requests

10 participants