add non-graceful node shutdown KEP #1116

yastij · 2019-06-26T15:07:28Z

Signed-off-by: Yassine TIJANI [email protected]

Opening a PR for a first round of reviews, documented other alternatives and how to handle version skew.

/assign @smarterclayton @liggitt @yujuhong @saad-ali
/cc @jingxu97 @derekwaynecarr @andrewsykim

/sig node
/sig cloud-provider
/sig storage
/sig scalability

dashpole

It might be helpful for others to think about the kubelet changes in two parts:

Introduce a new taint which will prevent the kubelet from mounting any volumes
The kubelet gains new behavior in scenarios where it can't acquire its node lease. The behavior is equivalent to having a set of taints applied to the node. These taints can still be tolerated by pods.

(1) seems reasonable to me, although I would suggest scoping the taint to a behavior, rather than a scenario.
(2) makes sense to me.

dashpole · 2019-06-26T20:17:43Z

keps/sig-node/20190626-node-shutdown-state.md

+
+we need also to apply changes to the kubelet, at startup the kubelet will:
+
+1. acquire the Lease:


I'm pretty sure the kubelet tries to acquire its lease today. What is the current behavior of the kubelet when it can't acquire its lease?

@dashpole - Yes Kubelet uses the Lease object today to report heartbeat, but I don't think it is used as Lock (i.e. at the moment anyone can update the lease of node, even though it's held by a kubelet)

Ah, that makes sense. That would be helpful to include in the KEP somewhere

dashpole · 2019-06-26T20:25:04Z

keps/sig-node/20190626-node-shutdown-state.md

+
+This behaviour was very racy, as while the node was being deleted, it could come back and register itself, list its pods and start the containers while volumes are being detached, leading to data corruption. To mitigate this, we explicitly set requirements on handling the node objects for cloud providers[3], i.e. not deleting node object unless the instance do not exist anymore.
+
+This made it noticeable to users, several were complaining about it, as a temporary mitigation we introduced taint called node-shutdown enabling cluster administrators to write automation against it, although most of the solutions at the time were racy.


Can you be more specific on the current state? If I read kubernetes/kubernetes#58635 correctly, the node shutdown taint addition was reverted. When it was last discussed with sig-node, we asked for the taint to be more specific to behavior (e.g. "no-mount" to mirror no-execute), rather than a scenario, "node-shutdown".

kubernetes/kubernetes#58635 was reverted due to some issues, but the implementation went it. given that it's a taint on the node object I'd expect we cannot remove it anytime soon due API guarentees.

defering to @smarterclayton @liggitt

It only has API guarantees if it was released with a minor version. Can you point me to the PR(s) for the "implementation" that went in, and weren't reverted?

kubernetes/kubernetes#60009 added it back

Thanks for the pointer. There were no API changes involved in that PR, and the kubelet doesn't use it, so there isn't anything preventing us from changing it if we think that is appropriate.

Adding a taint is an API change, just like adding/removing/renaming a label. It's a slightly different set of constraints, but for instance beta.x.node.kubernetes.io will outlive us all, so thinking about any field a controller sets for the consumption of others as an API is accurate.

I see, thanks @smarterclayton. We can't stop adding the taint to nodes for API reasons. However, we aren't required to use this particular taint for this KEP, or add any behavior to the kubelet based on the node shutdown taint. For the purpose of this KEP, we should consider the choice and design of the taints used as in-scope, especially given the existing taint has not been approved in a KEP as far as i'm aware.

keps/sig-node/20190626-node-shutdown-state.md

smarterclayton · 2019-06-26T21:47:00Z

keps/sig-node/20190626-node-shutdown-state.md

+
+* Kube-Controller-Manager
+* Kubelets
+


I would add "user workloads" here. A workload has to describe the tradeoffs between availability and partition tolerance that it will accept. That includes:

access to exclusive devices/volumes (for instance iSCSI is really an RWM server but they may want kube to enforce RWO)

toleration of node shutdown at the pod level (the amount of outage the pod tolerates)

the type of workload controller (replica set or stateful set or DaemonSet)

additional future signals

The outcome we need to reach to move Kube forward is to

Consider a change to the default behavior that tightens the guarantees on pod behavior so pods are still safe by default (no unanticipated behavior) but users don't have to know to ask for better guarantees

Provide the tunable that allows a workload to relax / tighten those constraints

on 2. how do you see users describing constrains around their workload's guarantees ?

It has to be something similar to the existing toleration (unready state) which is applied on the control plane side. Questions include: do we need a new toleration to be backwards compatible? is it an error or a feature to have different toleration lengths (tolerate api server down for 600s but tolerate node unready for 30s)? should stateful sets encourage setting both tolerations? does setting the toleration change statefulset behavior?

for data consistency - SCSI persistent reservations or lockfiles could already be used to provide RWO on a RWM media

if the underlying storage doesn't support it, then maybe a mutex could be implemented by the cloud provider / PV plugin.

smarterclayton · 2019-06-26T21:49:10Z

keps/sig-node/20190626-node-shutdown-state.md

+     i. Tolerate the node shutdown taint
+     ii. Is a static Pod 
+  b. success: start normally
+


I expected to see something here about shutdown - what part of shutdown produces a different signal than "partitioned from the master"?

Network partition != known shutdown cleanly

@smarterclayton - it is implied here network partition won't be detected as a shutdown of the node

smarterclayton · 2019-06-26T21:50:09Z

keps/sig-node/20190626-node-shutdown-state.md

+  b. success: start normally
+
+
+


I also expected to see some discussion of the workload specifying that it only tolerates a specific maximum separation from the master in seconds before the kubelet shuts down the container. That is the necessary component of a partitioned kubelet that allows another actor (on the control plane side) to reschedule the workload.

I'm not sure I understand here. Do you suggest adding an example on how stateful workloads can tolerate being separated from its master in case of a node shutdown ?

in the latest kep design, we want to limit to the case for only node shutdown case (or unrecoverable hardware failure). In these cases, system (admin) is certain that node in a shutdown state and kubelet and all its workload are not running.

yastij · 2019-06-27T10:53:40Z

/cc @wojtek-t

yastij · 2019-07-01T18:48:43Z

/cc @cdickmann

k8s-ci-robot · 2019-07-01T18:48:44Z

@yastij: GitHub didn't allow me to request PR reviews from the following users: cdickmann.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @cdickmann

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wojtek-t · 2019-07-02T10:01:46Z

keps/sig-node/20190626-node-shutdown-state.md

+2. apply a node-shutdown taint
+3. tries to hold node heartbeat lease
+  3.a. fails: backoff and retry if max retries hit -> aborts
+  3.b. success: hold the lease for this node: if the node comes back at this moment it won’t be able to acquire the lease -> pods won’t start


That isn't true currently - node-lease isn't used currently to block anything - we simply write it and not use any leader-election code. We assumed that noone else will be writing it.

Changing that is potentially a non-trivial task on Kubelet side and would need to be consulted with node folks.
@yujuhong @wangzhen127

aaah ok - you're writing about changing the kubelet code below.

wojtek-t · 2019-07-02T10:02:25Z

keps/sig-node/20190626-node-shutdown-state.md

+3. tries to hold node heartbeat lease
+  3.a. fails: backoff and retry if max retries hit -> aborts
+  3.b. success: hold the lease for this node: if the node comes back at this moment it won’t be able to acquire the lease -> pods won’t start
+4. register node as being processed in order to renew the Lease periodically


I don't understand this one - can you clarify?

when detecting a node is shutdown we'll start operations (deleting the pod, removing the attachement, executing the detach call etc..) we need to register that a node is being processed and hold the lease until we're done

andrewsykim · 2019-07-03T18:14:21Z

/assign

andrewsykim · 2019-07-09T15:55:35Z

keps/sig-node/20190626-node-shutdown-state.md

+
+kube-controller-manager needs to provide a controller that:
+
+Calls the cloud provider


This assumes in-tree cloud provider, would it be possible to assume out-of-tree first for this KEP?

This assumes out-of-tree cloud provider, this call is made through the cloud interface.

I'll add a clear statement on this.

Right, but if you're calling the CPI from kube-controller-manager that is in-tree right?

maybe we should package it as part of the cloud-controller-manager ?

Maybe this is a bit aggressive, but I would take it a step further and say only enable it as part of cloud-controller-manager. Piling more features into in-tree CPI makes the migration process harder than it already is. This is a great incentive to migrate to the out-of-tree model. Would like to hear other people's thoughts on this, maybe the trade-off isn't worth it.

I agree on avoiding to pile up features for in-tree CPI

andrewsykim · 2019-07-09T16:08:44Z

keps/sig-node/20190626-node-shutdown-state.md

+
+### Version Skew Strategy
+
+This feature requires version parity between the control plane and the kubelet, this behavior shouldn't be enabled in case of older Kubelets.


This seems problematic? We need to support version skew for the upgrade case. What happens when we promote NodeShutdownFailover to Beta (enabled by default) and the control plane has it enabled but upgrading kubelets don't? If a kubelet goes down during upgrade, the control plane will hold the lease and the kubelet will just ignore it?

@andrewsykim - good point, maybe it'll require adding a flag --node-shutdown-failover that would default to false ?

Thinking out loud: is there a way to toggle it on the lease object instead? Maybe an annotation on the lease indicates to the kubelet to use the node shutdown failover mechanism. That way the feature is only enabled on the control plane?

that could work. we need a mechanism that can't change from a release to the next

keps/sig-storage/2268-non-graceful-shutdown/README.md

xing-yang · 2022-02-02T02:11:39Z

@jingxu97 Addressed your comment. PTAL. Thanks.

jingxu97 · 2022-02-02T05:23:41Z

/lgtm

Signed-off-by: Yassine TIJANI <[email protected]>

mrunalp · 2022-02-02T16:59:56Z

This makes sense to me overall.
@derekwaynecarr do you want to make a pass?

derekwaynecarr

Can you clarify what happens in the KEP if the taint is applied when the node is not under going shutdown and is healthy? Is there any consequence of misapplying the taint that I may be missing?

If you can clarify the above, the rest seems like a generally good improved ux over present state.

derekwaynecarr · 2022-02-02T20:46:50Z

keps/sig-storage/2268-non-graceful-shutdown/README.md

+1. After 300 seconds (default), the Taint Manager tries to delete Pods on the Node after detecting that the Node is NotReady. The Pods will be stuck in terminating status.
+
+Proposed logic change:
+1. [Proposed change] This proposal requires a user to apply a `out-of-service` taint on a node when the user has confirmed that this node is shutdown or in a non-recoverable state due to the hardware failure or broken OS. Note that user should only add this taint if the node is not coming back at least for some time. If the node is in the middle of restarting, this taint should not be used.


Is there any guidance for when a user should apply this taint? I am trying to think about what automated system would apply this taint.

keps/sig-storage/2268-non-graceful-shutdown/README.md

xing-yang · 2022-02-02T22:40:03Z

@derekwaynecarr Addressed your comments. Please take a look. Thanks.

jingxu97 · 2022-02-02T22:58:10Z

/lgtm

derekwaynecarr · 2022-02-03T14:46:45Z

for sig-node, this generally looks good to me.

defer to @gnufied for resolution on his question about overlapping taint.

/lgtm

gnufied · 2022-02-03T14:50:19Z

/lgtm

xing-yang · 2022-02-03T15:01:19Z

/hold cancel

Got lgtm from sig-node

thockin · 2022-02-03T16:02:40Z

w00t!

k8s-ci-robot assigned liggitt, saad-ali, smarterclayton and yujuhong Jun 26, 2019

k8s-ci-robot requested review from andrewsykim, derekwaynecarr and jingxu97 June 26, 2019 15:07

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 26, 2019

yastij force-pushed the node-shutdown-kep branch 2 times, most recently from a424833 to 90bdbf3 Compare June 26, 2019 15:13

dashpole reviewed Jun 26, 2019

View reviewed changes

smarterclayton reviewed Jun 26, 2019

View reviewed changes

k8s-ci-robot requested a review from wojtek-t June 27, 2019 10:53

wojtek-t reviewed Jul 2, 2019

View reviewed changes

k8s-ci-robot assigned andrewsykim Jul 3, 2019

andrewsykim reviewed Jul 9, 2019

View reviewed changes

andrewsykim mentioned this pull request Jul 10, 2019

Handle volume scheduling when nodes are shutdown kubernetes/cloud-provider#31

Closed

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 1, 2022

jingxu97 reviewed Feb 1, 2022

View reviewed changes

keps/sig-storage/2268-non-graceful-shutdown/README.md Show resolved Hide resolved

xing-yang force-pushed the node-shutdown-kep branch from 4e3495c to 006a443 Compare February 2, 2022 02:03

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 2, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 2, 2022

yastij and others added 2 commits February 2, 2022 13:31

Add node shutdown KEP

ee97d63

Signed-off-by: Yassine TIJANI <[email protected]>

Add a note when the node is healthy

ef6a71f

derekwaynecarr reviewed Feb 2, 2022

View reviewed changes

jingxu97 reviewed Feb 2, 2022

View reviewed changes

keps/sig-storage/2268-non-graceful-shutdown/README.md Show resolved Hide resolved

xing-yang force-pushed the node-shutdown-kep branch from 006a443 to ef6a71f Compare February 2, 2022 22:38

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 2, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 2, 2022

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 3, 2022

k8s-ci-robot merged commit 9136495 into kubernetes:master Feb 3, 2022

k8s-ci-robot added this to the v1.24 milestone Feb 3, 2022

thockin mentioned this pull request Mar 18, 2022

Handle Non-graceful Node Shutdown kubernetes/kubernetes#108486

Merged

k-keiichi-rh mentioned this pull request May 6, 2022

[RFC] Support for out-of-service taint medik8s/self-node-remediation#17

Closed

k-keiichi-rh mentioned this pull request Feb 2, 2023

Add out of service taint remediation medik8s/self-node-remediation#86

Merged

k-keiichi-rh mentioned this pull request Oct 12, 2023

Enable out-of-service taint in FAR medik8s/fence-agents-remediation#92

Merged

4 tasks

k-keiichi-rh mentioned this pull request Nov 29, 2023

Add e2e tests for OutOfServiceTaint remediation medik8s/self-node-remediation#166

Merged


		we need also to apply changes to the kubelet, at startup the kubelet will:

		1. acquire the Lease:


		This behaviour was very racy, as while the node was being deleted, it could come back and register itself, list its pods and start the containers while volumes are being detached, leading to data corruption. To mitigate this, we explicitly set requirements on handling the node objects for cloud providers[3], i.e. not deleting node object unless the instance do not exist anymore.

		This made it noticeable to users, several were complaining about it, as a temporary mitigation we introduced taint called node-shutdown enabling cluster administrators to write automation against it, although most of the solutions at the time were racy.


		kube-controller-manager needs to provide a controller that:

		Calls the cloud provider


		### Version Skew Strategy

		This feature requires version parity between the control plane and the kubelet, this behavior shouldn't be enabled in case of older Kubelets.

add non-graceful node shutdown KEP #1116

add non-graceful node shutdown KEP #1116

Conversation

yastij commented Jun 26, 2019

dashpole left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yastij Jun 27, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yastij commented Jun 27, 2019

yastij commented Jul 1, 2019

k8s-ci-robot commented Jul 1, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewsykim commented Jul 3, 2019

andrewsykim Jul 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yastij Jul 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xing-yang commented Feb 2, 2022

jingxu97 commented Feb 2, 2022

mrunalp commented Feb 2, 2022

derekwaynecarr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xing-yang commented Feb 2, 2022

jingxu97 commented Feb 2, 2022

derekwaynecarr commented Feb 3, 2022

gnufied commented Feb 3, 2022

xing-yang commented Feb 3, 2022

thockin commented Feb 3, 2022

yastij Jun 27, 2019 •

edited

Loading

andrewsykim Jul 9, 2019 •

edited

Loading

yastij Jul 9, 2019 •

edited

Loading