Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add node shutdown KEP #2001

Merged
merged 3 commits into from
Oct 2, 2020
Merged

Conversation

bobbypage
Copy link
Member

@bobbypage bobbypage commented Sep 21, 2020

Enhancement issue: #2000

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 21, 2020
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Sep 21, 2020
@bobbypage bobbypage force-pushed the node-shutdown-kep branch 2 times, most recently from 7f8e11e to fa57c03 Compare September 21, 2020 21:49
@bobbypage bobbypage mentioned this pull request Sep 21, 2020
26 tasks
@bobbypage bobbypage force-pushed the node-shutdown-kep branch 4 times, most recently from 9112cdf to 7dde60e Compare September 21, 2020 22:37
@karan
Copy link

karan commented Sep 22, 2020

/cc @karan

/cc @SergeyKanzhelev

@k8s-ci-robot k8s-ci-robot requested a review from karan September 22, 2020 17:06
@derekwaynecarr
Copy link
Member

@bobbypage @mrunalp thanks for putting this together!

I have no major issues with the proposal as-is other than mechanical questions about testing.

I am happy if we want to merge and iterate on just that section of the KEP as the other parts of the KEP all lgtm.

I will let @dchen1107 take a final pass.

/assign @dchen1107

@bobbypage
Copy link
Member Author

Thanks @derekwaynecarr for taking a look! Will followup with @dchen1107 for a final pass. I put this KEP on agenda for upcoming SIG-Node, me and @mrunalp are happy to discuss this in more detail there.

@dchen1107
Copy link
Member

We discussed this KEP at today's SIG node meeting. A couple more feedback raised at the meeting:

  • Suggested by @marosset: Windows has similar APIs to register for shutdown events and delay shutdown. I'd love to see Windows support added here as well but not sure we'll have resources to work on this at the same time as the Linux implementations. I'll add links to the windows APIs to the doc PR tho.

We discussed it briefly, and we should include Windows Node support in KEP, but not alpha blocker here.

  • Suggested by me in addition to two issues I raised above (1) 2s for critical system pods 2) reset node the config through kubelet's config): Instead of introducing the new taint, and new node condition, Kubelet could send node status by marking the node NotReady with detail message like: the node is shutting down.

@dchen1107
Copy link
Member

@bobbypage I approved your KEP for now. Please address all above comments and ping me for another review.

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 30, 2020
* Don’t handle node shutdown events at all, and have users drain nodes before
shutting them down.
* This is not always possible, for example if the shutdown is controlled by
some external system (e.g. Preemptible VMs).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "Preemptible VMs" in all cloud providers trigger the shutdown event on termination?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(assuming the OS for the VM is running a compatible systemd version)

Copy link
Member Author

@bobbypage bobbypage Sep 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question -- every cloud provider will eventually have to shutdown the VM and terminate it, so at some point the VM shutdown event should be sent.

For example, on GCE when a GCE Preemptible VM is terminated it gets 30 second time to shutdown, and the shutdown event is delivered at t-30 where t is when it will be forced shutdown. On AWS spot instances, the period is 2 minutes, but I'm unclear if the shutdown event is delivered a t-2min or at t itself.

In addition to the systemd shutdown event, each cloud provider usually has a specific metadata server local to the VM that can be used to poll for preemption events. If the specific cloud provider doesn't actually trigger shutdown prior to the VM getting preempted, one workaround would be to have a cloud specific daemonset deployed that can poll the metadata server of the VM for preemption event and upon receiving the terminating event, simply trigger a node shutdown, i.e. systemctl poweroff, which will initiate kubelet graceful shutdown.

@bobbypage
Copy link
Member Author

Thanks @dchen1107 and @derekwaynecarr and rest of the folks from SIG-Node today for your feedback on this proposal. I will followup to address the comments here and would like to merge this and mark it implementable, so we can get the KEP in before the enhancements freeze for 1.20 which is October 6.

Copy link
Member

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will followup to address the comments here and would like to merge this and mark it implementable, so we can get the KEP in before the enhancements freeze for 1.20 which is October 6.

Hi there, If you want this in 1.20, you need to: Update the related Issue to get it in the milestone, add graduation criteria (alpha, beta, etc) and mark this as implementable.

Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bobbypage Thanks for the KEP, LGTM. Added some simple comments and suggestions, but nothing major :)

* Change ready status to false during node shutdown
* Add note about new KubeletConfig option,
`ShutdownGracePeriodCriticalPods`, to configure shutdown
gracePeriod for critical pods
* Update status to implementable
@bobbypage
Copy link
Member Author

/retest

@bobbypage
Copy link
Member Author

bobbypage commented Oct 2, 2020

I've updated the KEP based on the feedback so far (changed to use ReadyStatus and option to configure grace period for critical pods) as mentioned in #2001 (comment) .

I've also updated the KEP and corresponding enhancement issue (#2000) to implementable status targeting 1.20 as discussed during the SIG-Node meeting. Please let me know if there's any other concerns.

Pinging @dchen1107 for final approval.

Copy link
Member

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted 2 things from an enhancements team POV

keps/sig-node/2000-graceful-node-shutdown/kep.yaml Outdated Show resolved Hide resolved
@bobbypage
Copy link
Member Author

/retest

@dchen1107
Copy link
Member

@bobbypage thanks for addressing our comments except Windows specific ones. Mark agreed to send the followup PRs to the KEP later.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 2, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bobbypage, dchen1107

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit cd1f616 into kubernetes:master Oct 2, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Oct 2, 2020
“critical system pods”, and regular pods. Critical system pods should be
terminated last, because for example, if the logging pod is terminated first,
logs from the other workloads will not be captured. Critical system pods are
identified as those that are in the `system-cluster-critical` or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this when it came out, but this is super concerning to me. I don't think these priority classes are "special" in that the Kubelet should hardcode their use. By hardcoding these, we FORCE workloads that interact with the kubelet or other system infra pods to be in these two priority classes, which breaks the orthogonality of scheduling and resource behavior from the kubelet.

Either there needs to be something that selects for priority classes to treat "specially", or these need to be configurable at startup time in the kubelet. The former is more flexible, the latter may be more acceptable, but would not allow a service provider to allow users to make that orthogonal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bobbypage while reviewing the KEP (i was reading through it and thinking about the implications of the intersection with grace period when i noticed this), I think this has to be addressed before we go to beta.

Copy link
Member Author

@bobbypage bobbypage Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @smarterclayton

Thanks so much for providing your feedback and comments.

@mrunalp and I discussed this topic in length as part of KEP design and we decided to use priority class of system-cluster-critical and system-node-critical to separate "core system workloads (e.g logging deamonsets, etc) vs regular pods and use that information to determine shutdown ordering.

Unfortunately as I'm sure you're aware there is no existing declarative mechanism to describe pod shutdown ordering and as such we decided to use pod priority as a signal instead. This is similar to for example pod admission/preemption and OOM score adjustment which also uses IsCriticalPod() as a signal today.

@mrunalp and I definitely agree this is not perfect, and happy to discuss if you have some alternative ideas on how to improve it and a potentially better signal for us to use to determine pod shutdown order logic instead for beta moving forward.

@bobbypage
Copy link
Member Author

bobbypage commented Jan 29, 2021

We had a chat today with @smarterclayton @mrunalp @SergeyKanzhelev regarding some of questions in #2001 (comment)

The main item we discussed was around the current design of having two shutdown phases, first being shutting down user workloads followed by "critical node system workloads" and the current pattern of using system-cluster-critical and system-node-critical to separate system workloads vs user workloads.

Some notes from our discussion:

  • Currently users who want to get their workload into the later "critical" shutdown phase need to configure their pod spec's priority class to be system-cluster-critical or system-node-critical. However, some pods may want to get into "critical" shutdown phase, but are using some other priority class.

    • Does it make sense to make configurable what priority classes are considered critical vs not instead of hardcoding system-cluster-critical and system-node-critical as today?
  • Node Graceful shutdown currently splits the total shutdown time into two phases, first user workloads, followed by critical pods.

    • Does it make sense to provide users ability to partition the shutdown time into more than the existing two phases?
      • Perhaps providing ability to configure a list of priority classes (or maybe range of priorities) to a shutdown phase and an associated amount of time for each phase?

@SergeyKanzhelev
Copy link
Member

I think one more small piece of feedback was to allow to use leftover of one phase to extend another. I.e. if user pods terminated very quickly, let critical pods to use the rest of the time.

@bobbypage
Copy link
Member Author

To circle back on #2001 (comment) regarding supporting "custom" priority classes for node shutdown other than system-cluster-critical and system-node-critical, we discussed this point in SIG-node with some other folks and @mrunalp.

Ultimately we decided that it's not clear how many users are actually making use of custom priority classes and would want to partition the shutdown time per specific priority class, making it a bit of niche requirement and more complicated.

We decided to proceed to beta with the current design. If we'll get more feedback / data that supporting configurable shutdown time per "custom" priority class, we can always add this capability as followup post beta.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.