-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add node shutdown KEP #2001
Add node shutdown KEP #2001
Conversation
7f8e11e
to
fa57c03
Compare
9112cdf
to
7dde60e
Compare
/cc @karan /cc @SergeyKanzhelev |
@bobbypage @mrunalp thanks for putting this together! I have no major issues with the proposal as-is other than mechanical questions about testing. I am happy if we want to merge and iterate on just that section of the KEP as the other parts of the KEP all lgtm. I will let @dchen1107 take a final pass. /assign @dchen1107 |
7dde60e
to
32e4c47
Compare
Thanks @derekwaynecarr for taking a look! Will followup with @dchen1107 for a final pass. I put this KEP on agenda for upcoming SIG-Node, me and @mrunalp are happy to discuss this in more detail there. |
We discussed this KEP at today's SIG node meeting. A couple more feedback raised at the meeting:
We discussed it briefly, and we should include Windows Node support in KEP, but not alpha blocker here.
|
@bobbypage I approved your KEP for now. Please address all above comments and ping me for another review. /approve |
* Don’t handle node shutdown events at all, and have users drain nodes before | ||
shutting them down. | ||
* This is not always possible, for example if the shutdown is controlled by | ||
some external system (e.g. Preemptible VMs). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does "Preemptible VMs" in all cloud providers trigger the shutdown event on termination?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(assuming the OS for the VM is running a compatible systemd version)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question -- every cloud provider will eventually have to shutdown the VM and terminate it, so at some point the VM shutdown event should be sent.
For example, on GCE when a GCE Preemptible VM is terminated it gets 30 second time to shutdown, and the shutdown event is delivered at t-30 where t is when it will be forced shutdown. On AWS spot instances, the period is 2 minutes, but I'm unclear if the shutdown event is delivered a t-2min or at t itself.
In addition to the systemd shutdown event, each cloud provider usually has a specific metadata server local to the VM that can be used to poll for preemption events. If the specific cloud provider doesn't actually trigger shutdown prior to the VM getting preempted, one workaround would be to have a cloud specific daemonset deployed that can poll the metadata server of the VM for preemption event and upon receiving the terminating event, simply trigger a node shutdown, i.e. systemctl poweroff
, which will initiate kubelet graceful shutdown.
Thanks @dchen1107 and @derekwaynecarr and rest of the folks from SIG-Node today for your feedback on this proposal. I will followup to address the comments here and would like to merge this and mark it implementable, so we can get the KEP in before the enhancements freeze for 1.20 which is October 6. |
32e4c47
to
2fb4744
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will followup to address the comments here and would like to merge this and mark it implementable, so we can get the KEP in before the enhancements freeze for 1.20 which is October 6.
Hi there, If you want this in 1.20, you need to: Update the related Issue to get it in the milestone, add graduation criteria (alpha, beta, etc) and mark this as implementable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bobbypage Thanks for the KEP, LGTM. Added some simple comments and suggestions, but nothing major :)
ee49010
to
d2c6556
Compare
* Change ready status to false during node shutdown * Add note about new KubeletConfig option, `ShutdownGracePeriodCriticalPods`, to configure shutdown gracePeriod for critical pods * Update status to implementable
d2c6556
to
e49f2ca
Compare
/retest |
I've updated the KEP based on the feedback so far (changed to use ReadyStatus and option to configure grace period for critical pods) as mentioned in #2001 (comment) . I've also updated the KEP and corresponding enhancement issue (#2000) to implementable status targeting 1.20 as discussed during the SIG-Node meeting. Please let me know if there's any other concerns. Pinging @dchen1107 for final approval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted 2 things from an enhancements team POV
a80cffb
to
f419e61
Compare
/retest |
@bobbypage thanks for addressing our comments except Windows specific ones. Mark agreed to send the followup PRs to the KEP later. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bobbypage, dchen1107 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
“critical system pods”, and regular pods. Critical system pods should be | ||
terminated last, because for example, if the logging pod is terminated first, | ||
logs from the other workloads will not be captured. Critical system pods are | ||
identified as those that are in the `system-cluster-critical` or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this when it came out, but this is super concerning to me. I don't think these priority classes are "special" in that the Kubelet should hardcode their use. By hardcoding these, we FORCE workloads that interact with the kubelet or other system infra pods to be in these two priority classes, which breaks the orthogonality of scheduling and resource behavior from the kubelet.
Either there needs to be something that selects for priority classes to treat "specially", or these need to be configurable at startup time in the kubelet. The former is more flexible, the latter may be more acceptable, but would not allow a service provider to allow users to make that orthogonal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bobbypage while reviewing the KEP (i was reading through it and thinking about the implications of the intersection with grace period when i noticed this), I think this has to be addressed before we go to beta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for providing your feedback and comments.
@mrunalp and I discussed this topic in length as part of KEP design and we decided to use priority class of system-cluster-critical
and system-node-critical
to separate "core system workloads (e.g logging deamonsets, etc) vs regular pods and use that information to determine shutdown ordering.
Unfortunately as I'm sure you're aware there is no existing declarative mechanism to describe pod shutdown ordering and as such we decided to use pod priority as a signal instead. This is similar to for example pod admission/preemption and OOM score adjustment which also uses IsCriticalPod()
as a signal today.
@mrunalp and I definitely agree this is not perfect, and happy to discuss if you have some alternative ideas on how to improve it and a potentially better signal for us to use to determine pod shutdown order logic instead for beta moving forward.
We had a chat today with @smarterclayton @mrunalp @SergeyKanzhelev regarding some of questions in #2001 (comment) The main item we discussed was around the current design of having two shutdown phases, first being shutting down user workloads followed by "critical node system workloads" and the current pattern of using Some notes from our discussion:
|
I think one more small piece of feedback was to allow to use leftover of one phase to extend another. I.e. if user pods terminated very quickly, let critical pods to use the rest of the time. |
To circle back on #2001 (comment) regarding supporting "custom" priority classes for node shutdown other than Ultimately we decided that it's not clear how many users are actually making use of custom priority classes and would want to partition the shutdown time per specific priority class, making it a bit of niche requirement and more complicated. We decided to proceed to beta with the current design. If we'll get more feedback / data that supporting configurable shutdown time per "custom" priority class, we can always add this capability as followup post beta. |
Enhancement issue: #2000