Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add non-graceful node shutdown KEP #1116

Merged
merged 2 commits into from
Feb 3, 2022

Conversation

yastij
Copy link
Member

@yastij yastij commented Jun 26, 2019

Signed-off-by: Yassine TIJANI [email protected]

Opening a PR for a first round of reviews, documented other alternatives and how to handle version skew.

/assign @smarterclayton @liggitt @yujuhong @saad-ali
/cc @jingxu97 @derekwaynecarr @andrewsykim

/sig node
/sig cloud-provider
/sig storage
/sig scalability

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/storage Categorizes an issue or PR as relevant to SIG Storage. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Jun 26, 2019
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 26, 2019
@yastij yastij force-pushed the node-shutdown-kep branch 2 times, most recently from a424833 to 90bdbf3 Compare June 26, 2019 15:13
Copy link
Contributor

@dashpole dashpole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful for others to think about the kubelet changes in two parts:

  1. Introduce a new taint which will prevent the kubelet from mounting any volumes
  2. The kubelet gains new behavior in scenarios where it can't acquire its node lease. The behavior is equivalent to having a set of taints applied to the node. These taints can still be tolerated by pods.

(1) seems reasonable to me, although I would suggest scoping the taint to a behavior, rather than a scenario.
(2) makes sense to me.


we need also to apply changes to the kubelet, at startup the kubelet will:

1. acquire the Lease:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure the kubelet tries to acquire its lease today. What is the current behavior of the kubelet when it can't acquire its lease?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dashpole - Yes Kubelet uses the Lease object today to report heartbeat, but I don't think it is used as Lock (i.e. at the moment anyone can update the lease of node, even though it's held by a kubelet)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes sense. That would be helpful to include in the KEP somewhere

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack


This behaviour was very racy, as while the node was being deleted, it could come back and register itself, list its pods and start the containers while volumes are being detached, leading to data corruption. To mitigate this, we explicitly set requirements on handling the node objects for cloud providers[3], i.e. not deleting node object unless the instance do not exist anymore.

This made it noticeable to users, several were complaining about it, as a temporary mitigation we introduced taint called node-shutdown enabling cluster administrators to write automation against it, although most of the solutions at the time were racy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more specific on the current state? If I read kubernetes/kubernetes#58635 correctly, the node shutdown taint addition was reverted. When it was last discussed with sig-node, we asked for the taint to be more specific to behavior (e.g. "no-mount" to mirror no-execute), rather than a scenario, "node-shutdown".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes/kubernetes#58635 was reverted due to some issues, but the implementation went it. given that it's a taint on the node object I'd expect we cannot remove it anytime soon due API guarentees.

defering to @smarterclayton @liggitt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only has API guarantees if it was released with a minor version. Can you point me to the PR(s) for the "implementation" that went in, and weren't reverted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer. There were no API changes involved in that PR, and the kubelet doesn't use it, so there isn't anything preventing us from changing it if we think that is appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a taint is an API change, just like adding/removing/renaming a label. It's a slightly different set of constraints, but for instance beta.x.node.kubernetes.io will outlive us all, so thinking about any field a controller sets for the consumption of others as an API is accurate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks @smarterclayton. We can't stop adding the taint to nodes for API reasons. However, we aren't required to use this particular taint for this KEP, or add any behavior to the kubelet based on the node shutdown taint. For the purpose of this KEP, we should consider the choice and design of the taints used as in-scope, especially given the existing taint has not been approved in a KEP as far as i'm aware.

keps/sig-node/20190626-node-shutdown-state.md Outdated Show resolved Hide resolved
keps/sig-node/20190626-node-shutdown-state.md Outdated Show resolved Hide resolved

* Kube-Controller-Manager
* Kubelets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add "user workloads" here. A workload has to describe the tradeoffs between availability and partition tolerance that it will accept. That includes:

  1. access to exclusive devices/volumes (for instance iSCSI is really an RWM server but they may want kube to enforce RWO)
  2. toleration of node shutdown at the pod level (the amount of outage the pod tolerates)
  3. the type of workload controller (replica set or stateful set or DaemonSet)
  4. additional future signals

The outcome we need to reach to move Kube forward is to

  1. Consider a change to the default behavior that tightens the guarantees on pod behavior so pods are still safe by default (no unanticipated behavior) but users don't have to know to ask for better guarantees
  2. Provide the tunable that allows a workload to relax / tighten those constraints

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on 2. how do you see users describing constrains around their workload's guarantees ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has to be something similar to the existing toleration (unready state) which is applied on the control plane side. Questions include: do we need a new toleration to be backwards compatible? is it an error or a feature to have different toleration lengths (tolerate api server down for 600s but tolerate node unready for 30s)? should stateful sets encourage setting both tolerations? does setting the toleration change statefulset behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for data consistency - SCSI persistent reservations or lockfiles could already be used to provide RWO on a RWM media

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the underlying storage doesn't support it, then maybe a mutex could be implemented by the cloud provider / PV plugin.

i. Tolerate the node shutdown taint
ii. Is a static Pod
b. success: start normally

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expected to see something here about shutdown - what part of shutdown produces a different signal than "partitioned from the master"?

Network partition != known shutdown cleanly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton - it is implied here network partition won't be detected as a shutdown of the node

b. success: start normally



Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also expected to see some discussion of the workload specifying that it only tolerates a specific maximum separation from the master in seconds before the kubelet shuts down the container. That is the necessary component of a partitioned kubelet that allows another actor (on the control plane side) to reschedule the workload.

Copy link
Member Author

@yastij yastij Jun 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand here. Do you suggest adding an example on how stateful workloads can tolerate being separated from its master in case of a node shutdown ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the latest kep design, we want to limit to the case for only node shutdown case (or unrecoverable hardware failure). In these cases, system (admin) is certain that node in a shutdown state and kubelet and all its workload are not running.

@yastij
Copy link
Member Author

yastij commented Jun 27, 2019

/cc @wojtek-t

@yastij
Copy link
Member Author

yastij commented Jul 1, 2019

/cc @cdickmann

@k8s-ci-robot
Copy link
Contributor

@yastij: GitHub didn't allow me to request PR reviews from the following users: cdickmann.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @cdickmann

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

2. apply a node-shutdown taint
3. tries to hold node heartbeat lease
3.a. fails: backoff and retry if max retries hit -> aborts
3.b. success: hold the lease for this node: if the node comes back at this moment it won’t be able to acquire the lease -> pods won’t start
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That isn't true currently - node-lease isn't used currently to block anything - we simply write it and not use any leader-election code. We assumed that noone else will be writing it.

Changing that is potentially a non-trivial task on Kubelet side and would need to be consulted with node folks.
@yujuhong @wangzhen127

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aaah ok - you're writing about changing the kubelet code below.

3. tries to hold node heartbeat lease
3.a. fails: backoff and retry if max retries hit -> aborts
3.b. success: hold the lease for this node: if the node comes back at this moment it won’t be able to acquire the lease -> pods won’t start
4. register node as being processed in order to renew the Lease periodically
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this one - can you clarify?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when detecting a node is shutdown we'll start operations (deleting the pod, removing the attachement, executing the detach call etc..) we need to register that a node is being processed and hold the lease until we're done

@andrewsykim
Copy link
Member

/assign


kube-controller-manager needs to provide a controller that:

Calls the cloud provider
Copy link
Member

@andrewsykim andrewsykim Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes in-tree cloud provider, would it be possible to assume out-of-tree first for this KEP?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes out-of-tree cloud provider, this call is made through the cloud interface.

I'll add a clear statement on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but if you're calling the CPI from kube-controller-manager that is in-tree right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should package it as part of the cloud-controller-manager ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is a bit aggressive, but I would take it a step further and say only enable it as part of cloud-controller-manager. Piling more features into in-tree CPI makes the migration process harder than it already is. This is a great incentive to migrate to the out-of-tree model. Would like to hear other people's thoughts on this, maybe the trade-off isn't worth it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on avoiding to pile up features for in-tree CPI


### Version Skew Strategy

This feature requires version parity between the control plane and the kubelet, this behavior shouldn't be enabled in case of older Kubelets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems problematic? We need to support version skew for the upgrade case. What happens when we promote NodeShutdownFailover to Beta (enabled by default) and the control plane has it enabled but upgrading kubelets don't? If a kubelet goes down during upgrade, the control plane will hold the lease and the kubelet will just ignore it?

Copy link
Member Author

@yastij yastij Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewsykim - good point, maybe it'll require adding a flag --node-shutdown-failover that would default to false ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud: is there a way to toggle it on the lease object instead? Maybe an annotation on the lease indicates to the kubelet to use the node shutdown failover mechanism. That way the feature is only enabled on the control plane?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that could work. we need a mechanism that can't change from a release to the next

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 1, 2022
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 2, 2022
@xing-yang
Copy link
Contributor

@jingxu97 Addressed your comment. PTAL. Thanks.

@jingxu97
Copy link
Contributor

jingxu97 commented Feb 2, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 2, 2022
@mrunalp
Copy link
Contributor

mrunalp commented Feb 2, 2022

This makes sense to me overall.
@derekwaynecarr do you want to make a pass?

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what happens in the KEP if the taint is applied when the node is not under going shutdown and is healthy? Is there any consequence of misapplying the taint that I may be missing?

If you can clarify the above, the rest seems like a generally good improved ux over present state.

1. After 300 seconds (default), the Taint Manager tries to delete Pods on the Node after detecting that the Node is NotReady. The Pods will be stuck in terminating status.

Proposed logic change:
1. [Proposed change] This proposal requires a user to apply a `out-of-service` taint on a node when the user has confirmed that this node is shutdown or in a non-recoverable state due to the hardware failure or broken OS. Note that user should only add this taint if the node is not coming back at least for some time. If the node is in the middle of restarting, this taint should not be used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any guidance for when a user should apply this taint? I am trying to think about what automated system would apply this taint.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 2, 2022
@xing-yang
Copy link
Contributor

@derekwaynecarr Addressed your comments. Please take a look. Thanks.

@jingxu97
Copy link
Contributor

jingxu97 commented Feb 2, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 2, 2022
@derekwaynecarr
Copy link
Member

for sig-node, this generally looks good to me.

defer to @gnufied for resolution on his question about overlapping taint.

/lgtm

@gnufied
Copy link
Member

gnufied commented Feb 3, 2022

/lgtm

@xing-yang
Copy link
Contributor

/hold cancel

Got lgtm from sig-node

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 3, 2022
@k8s-ci-robot k8s-ci-robot merged commit 9136495 into kubernetes:master Feb 3, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Feb 3, 2022
@thockin
Copy link
Member

thockin commented Feb 3, 2022

w00t!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.