-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add non-graceful node shutdown KEP #1116
Conversation
a424833
to
90bdbf3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be helpful for others to think about the kubelet changes in two parts:
- Introduce a new taint which will prevent the kubelet from mounting any volumes
- The kubelet gains new behavior in scenarios where it can't acquire its node lease. The behavior is equivalent to having a set of taints applied to the node. These taints can still be tolerated by pods.
(1) seems reasonable to me, although I would suggest scoping the taint to a behavior, rather than a scenario.
(2) makes sense to me.
|
||
we need also to apply changes to the kubelet, at startup the kubelet will: | ||
|
||
1. acquire the Lease: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure the kubelet tries to acquire its lease today. What is the current behavior of the kubelet when it can't acquire its lease?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dashpole - Yes Kubelet uses the Lease object today to report heartbeat, but I don't think it is used as Lock (i.e. at the moment anyone can update the lease of node, even though it's held by a kubelet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that makes sense. That would be helpful to include in the KEP somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
|
||
This behaviour was very racy, as while the node was being deleted, it could come back and register itself, list its pods and start the containers while volumes are being detached, leading to data corruption. To mitigate this, we explicitly set requirements on handling the node objects for cloud providers[3], i.e. not deleting node object unless the instance do not exist anymore. | ||
|
||
This made it noticeable to users, several were complaining about it, as a temporary mitigation we introduced taint called node-shutdown enabling cluster administrators to write automation against it, although most of the solutions at the time were racy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be more specific on the current state? If I read kubernetes/kubernetes#58635 correctly, the node shutdown taint addition was reverted. When it was last discussed with sig-node, we asked for the taint to be more specific to behavior (e.g. "no-mount" to mirror no-execute), rather than a scenario, "node-shutdown".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes/kubernetes#58635 was reverted due to some issues, but the implementation went it. given that it's a taint on the node object I'd expect we cannot remove it anytime soon due API guarentees.
defering to @smarterclayton @liggitt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only has API guarantees if it was released with a minor version. Can you point me to the PR(s) for the "implementation" that went in, and weren't reverted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes/kubernetes#60009 added it back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer. There were no API changes involved in that PR, and the kubelet doesn't use it, so there isn't anything preventing us from changing it if we think that is appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a taint is an API change, just like adding/removing/renaming a label. It's a slightly different set of constraints, but for instance beta.x.node.kubernetes.io will outlive us all, so thinking about any field a controller sets for the consumption of others as an API is accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks @smarterclayton. We can't stop adding the taint to nodes for API reasons. However, we aren't required to use this particular taint for this KEP, or add any behavior to the kubelet based on the node shutdown taint. For the purpose of this KEP, we should consider the choice and design of the taints used as in-scope, especially given the existing taint has not been approved in a KEP as far as i'm aware.
|
||
* Kube-Controller-Manager | ||
* Kubelets | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add "user workloads" here. A workload has to describe the tradeoffs between availability and partition tolerance that it will accept. That includes:
- access to exclusive devices/volumes (for instance iSCSI is really an RWM server but they may want kube to enforce RWO)
- toleration of node shutdown at the pod level (the amount of outage the pod tolerates)
- the type of workload controller (replica set or stateful set or DaemonSet)
- additional future signals
The outcome we need to reach to move Kube forward is to
- Consider a change to the default behavior that tightens the guarantees on pod behavior so pods are still safe by default (no unanticipated behavior) but users don't have to know to ask for better guarantees
- Provide the tunable that allows a workload to relax / tighten those constraints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on 2. how do you see users describing constrains around their workload's guarantees ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has to be something similar to the existing toleration (unready state) which is applied on the control plane side. Questions include: do we need a new toleration to be backwards compatible? is it an error or a feature to have different toleration lengths (tolerate api server down for 600s but tolerate node unready for 30s)? should stateful sets encourage setting both tolerations? does setting the toleration change statefulset behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for data consistency - SCSI persistent reservations or lockfiles could already be used to provide RWO on a RWM media
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the underlying storage doesn't support it, then maybe a mutex could be implemented by the cloud provider / PV plugin.
i. Tolerate the node shutdown taint | ||
ii. Is a static Pod | ||
b. success: start normally | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expected to see something here about shutdown - what part of shutdown produces a different signal than "partitioned from the master"?
Network partition != known shutdown cleanly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton - it is implied here network partition won't be detected as a shutdown of the node
b. success: start normally | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also expected to see some discussion of the workload specifying that it only tolerates a specific maximum separation from the master in seconds before the kubelet shuts down the container. That is the necessary component of a partitioned kubelet that allows another actor (on the control plane side) to reschedule the workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand here. Do you suggest adding an example on how stateful workloads can tolerate being separated from its master in case of a node shutdown ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the latest kep design, we want to limit to the case for only node shutdown case (or unrecoverable hardware failure). In these cases, system (admin) is certain that node in a shutdown state and kubelet and all its workload are not running.
/cc @wojtek-t |
/cc @cdickmann |
@yastij: GitHub didn't allow me to request PR reviews from the following users: cdickmann. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
2. apply a node-shutdown taint | ||
3. tries to hold node heartbeat lease | ||
3.a. fails: backoff and retry if max retries hit -> aborts | ||
3.b. success: hold the lease for this node: if the node comes back at this moment it won’t be able to acquire the lease -> pods won’t start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That isn't true currently - node-lease isn't used currently to block anything - we simply write it and not use any leader-election code. We assumed that noone else will be writing it.
Changing that is potentially a non-trivial task on Kubelet side and would need to be consulted with node folks.
@yujuhong @wangzhen127
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aaah ok - you're writing about changing the kubelet code below.
3. tries to hold node heartbeat lease | ||
3.a. fails: backoff and retry if max retries hit -> aborts | ||
3.b. success: hold the lease for this node: if the node comes back at this moment it won’t be able to acquire the lease -> pods won’t start | ||
4. register node as being processed in order to renew the Lease periodically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this one - can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when detecting a node is shutdown we'll start operations (deleting the pod, removing the attachement, executing the detach call etc..) we need to register that a node is being processed and hold the lease until we're done
/assign |
|
||
kube-controller-manager needs to provide a controller that: | ||
|
||
Calls the cloud provider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes in-tree cloud provider, would it be possible to assume out-of-tree first for this KEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes out-of-tree cloud provider, this call is made through the cloud interface.
I'll add a clear statement on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but if you're calling the CPI from kube-controller-manager that is in-tree right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should package it as part of the cloud-controller-manager ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is a bit aggressive, but I would take it a step further and say only enable it as part of cloud-controller-manager. Piling more features into in-tree CPI makes the migration process harder than it already is. This is a great incentive to migrate to the out-of-tree model. Would like to hear other people's thoughts on this, maybe the trade-off isn't worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on avoiding to pile up features for in-tree CPI
|
||
### Version Skew Strategy | ||
|
||
This feature requires version parity between the control plane and the kubelet, this behavior shouldn't be enabled in case of older Kubelets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems problematic? We need to support version skew for the upgrade case. What happens when we promote NodeShutdownFailover
to Beta (enabled by default) and the control plane has it enabled but upgrading kubelets don't? If a kubelet goes down during upgrade, the control plane will hold the lease and the kubelet will just ignore it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewsykim - good point, maybe it'll require adding a flag --node-shutdown-failover
that would default to false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud: is there a way to toggle it on the lease object instead? Maybe an annotation on the lease indicates to the kubelet to use the node shutdown failover mechanism. That way the feature is only enabled on the control plane?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that could work. we need a mechanism that can't change from a release to the next
4e3495c
to
006a443
Compare
@jingxu97 Addressed your comment. PTAL. Thanks. |
/lgtm |
Signed-off-by: Yassine TIJANI <[email protected]>
This makes sense to me overall. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what happens in the KEP if the taint is applied when the node is not under going shutdown and is healthy? Is there any consequence of misapplying the taint that I may be missing?
If you can clarify the above, the rest seems like a generally good improved ux over present state.
1. After 300 seconds (default), the Taint Manager tries to delete Pods on the Node after detecting that the Node is NotReady. The Pods will be stuck in terminating status. | ||
|
||
Proposed logic change: | ||
1. [Proposed change] This proposal requires a user to apply a `out-of-service` taint on a node when the user has confirmed that this node is shutdown or in a non-recoverable state due to the hardware failure or broken OS. Note that user should only add this taint if the node is not coming back at least for some time. If the node is in the middle of restarting, this taint should not be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any guidance for when a user should apply this taint? I am trying to think about what automated system would apply this taint.
006a443
to
ef6a71f
Compare
@derekwaynecarr Addressed your comments. Please take a look. Thanks. |
/lgtm |
for sig-node, this generally looks good to me. defer to @gnufied for resolution on his question about overlapping taint. /lgtm |
/lgtm |
/hold cancel Got |
w00t! |
Signed-off-by: Yassine TIJANI [email protected]
Opening a PR for a first round of reviews, documented other alternatives and how to handle version skew.
/assign @smarterclayton @liggitt @yujuhong @saad-ali
/cc @jingxu97 @derekwaynecarr @andrewsykim
/sig node
/sig cloud-provider
/sig storage
/sig scalability